合并分支 'caorunzhe' 到 'master'

Caorunzhe 查看合并请求 !514

合并分支 'caorunzhe' 到 'master'
Caorunzhe 查看合并请求 !514
79d6f5e0 · 曹润柘 · 25d484fe · da5fd39d · 79d6f5e0 · 79d6f5e0
Commit 79d6f5e0 authored Nov 30, 2020 by 曹润柘
--- a/Chapter10/Figures/figure-attention-of-source-and-target-words.tex
+++ b/Chapter10/Figures/figure-attention-of-source-and-target-words.tex
@@ -47,7 +47,7 @@

 {
 \draw [<->,ublue,thick] ([xshift=0.3em]ws4.south) .. controls +(-60:1) and +(south:1) .. (wt4.south);
-\draw [<->,ublue,thick] (ws4.south) .. controls +(south:1.0) and +(south:1.5) .. (wt5.south);
+\draw [<->,ublue,thick] (ws4.south) .. controls +(south:1) and +(south:1.5) .. (wt5.south);
 }

 {

--- a/Chapter10/Figures/figure-encoder-decoder-with-attention.tex
+++ b/Chapter10/Figures/figure-encoder-decoder-with-attention.tex
@@ -80,9 +80,9 @@

 \draw[<-] ([yshift=0.1em,xshift=1em]t6.north) -- ([yshift=1.2em,xshift=1em]t6.north);

-\draw [->] ([yshift=3em]s6.north) -- ([yshift=4em]s6.north) -- ([yshift=4em]t1.north) node [pos=0.5,fill=green!30,inner sep=2pt] (c1) {\scriptsize{表示$\vectorn{C}_1$}} -- ([yshift=3em]t1.north) ;
-\draw [->] ([yshift=3em]s5.north) -- ([yshift=5.3em]s5.north) -- ([yshift=5.3em]t2.north) node [pos=0.5,fill=green!30,inner sep=2pt] (c2) {\scriptsize{表示$\vectorn{C}_2$}} -- ([yshift=3em]t2.north) ;
-\draw [->] ([yshift=3.5em]s3.north) -- ([yshift=6.6em]s3.north) -- ([yshift=6.6em]t4.north) node [pos=0.5,fill=green!30,inner sep=2pt] (c3) {\scriptsize{表示$\vectorn{C}_i$}} -- ([yshift=3.5em]t4.north) ;
+\draw [->] ([yshift=3em]s6.north) -- ([yshift=4em]s6.north) -- ([yshift=4em]t1.north) node [pos=0.5,fill=green!30,inner sep=2pt] (c1) {\scriptsize{表示$\mathbi{C}_1$}} -- ([yshift=3em]t1.north) ;
+\draw [->] ([yshift=3em]s5.north) -- ([yshift=5.3em]s5.north) -- ([yshift=5.3em]t2.north) node [pos=0.5,fill=green!30,inner sep=2pt] (c2) {\scriptsize{表示$\mathbi{C}_2$}} -- ([yshift=3em]t2.north) ;
+\draw [->] ([yshift=3.5em]s3.north) -- ([yshift=6.6em]s3.north) -- ([yshift=6.6em]t4.north) node [pos=0.5,fill=green!30,inner sep=2pt] (c3) {\scriptsize{表示$\mathbi{C}_i$}} -- ([yshift=3.5em]t4.north) ;
 \node [anchor=north] (smore) at ([yshift=3.5em]s3.north) {...};
 \node [anchor=north] (tmore) at ([yshift=3.5em]t4.north) {...};


--- a/Chapter10/chapter10.tex
+++ b/Chapter10/chapter10.tex
@@ -417,25 +417,25 @@ NMT                     & 21.7          & 18.7           & -13.7      \\

 \parinterval {\chapternine}已经对循环神经网络的基本知识进行过介绍，这里再回顾一下。简单来说，循环神经网络由循环单元组成。对于序列中的任意时刻，都有一个循环单元与之对应，它会融合当前时刻的输入和上一时刻循环单元的输出，生成当前时刻的输出。这样每个时刻的信息都会被传递到下一时刻，这也间接达到了记录历史信息的目的。比如，对于序列$\seq{x}=\{x_1, x_2,..., x_m\}$，循环神经网络会按顺序输出一个序列$\seq{h}=\{ \mathbi{h}_1, \mathbi{h}_2,..., \mathbi{h}_m \}$，其中$\mathbi{h}_i$表示$i$时刻循环神经网络的输出（通常为一个向量）。

-\parinterval 图\ref{fig:10-9}展示了一个循环神经网络处理序列问题的实例。当前时刻循环单元的输入由上一个时刻的输出和当前时刻的输入组成，因此也可以理解为，网络当前时刻计算得到的输出是由之前的序列共同决定的，即网络在不断地传递信息的过程中记忆了历史信息。以最后一个时刻的循环单元为例，它在对“开始”这个单词的信息进行处理时，参考了之前所有词（“<sos>\ 让\ 我们”）的信息。
+\parinterval 图\ref{fig:10-8}展示了一个循环神经网络处理序列问题的实例。当前时刻循环单元的输入由上一个时刻的输出和当前时刻的输入组成，因此也可以理解为，网络当前时刻计算得到的输出是由之前的序列共同决定的，即网络在不断地传递信息的过程中记忆了历史信息。以最后一个时刻的循环单元为例，它在对“开始”这个单词的信息进行处理时，参考了之前所有词（“<sos>\ 让\ 我们”）的信息。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter10/Figures/figure-structure-of-a-recurrent-network-model}
 \caption{循环神经网络处理序列的实例}
-\label{fig:10-9}
+\label{fig:10-8}
 \end{figure}
 %----------------------------------------------

-\parinterval 在神经机器翻译里使用循环神经网络也很简单。只需要把源语言句子和目标语言句子分别看作两个序列，之后使用两个循环神经网络分别对其进行建模。这个过程如图\ref{fig:10-10}所示。图中，下半部分是编码器，上半部分是解码器。编码器利用循环神经网络对源语言序列逐词进行编码处理，同时利用循环单元的记忆能力，不断累积序列信息，遇到终止符<eos>后便得到了包含源语言句子全部信息的表示结果。解码器利用编码器的输出和起始符<sos>开始逐词地进行解码，即逐词翻译，每得到一个译文单词，便将其作为当前时刻解码端循环单元的输入，这也是一个典型的神经语言模型的序列生成过程。解码器通过循环神经网络不断地累积已经得到的译文的信息，并继续生成下一个单词，直到遇到结束符<eos>，便得到了最终完整的译文。
+\parinterval 在神经机器翻译里使用循环神经网络也很简单。只需要把源语言句子和目标语言句子分别看作两个序列，之后使用两个循环神经网络分别对其进行建模。这个过程如图\ref{fig:10-9}所示。图中，下半部分是编码器，上半部分是解码器。编码器利用循环神经网络对源语言序列逐词进行编码处理，同时利用循环单元的记忆能力，不断累积序列信息，遇到终止符<eos>后便得到了包含源语言句子全部信息的表示结果。解码器利用编码器的输出和起始符<sos>开始逐词地进行解码，即逐词翻译，每得到一个译文单词，便将其作为当前时刻解码端循环单元的输入，这也是一个典型的神经语言模型的序列生成过程。解码器通过循环神经网络不断地累积已经得到的译文的信息，并继续生成下一个单词，直到遇到结束符<eos>，便得到了最终完整的译文。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter10/Figures/figure-model-structure-based-on-recurrent-neural-network-translation}
 \caption{基于循环神经网络翻译的模型结构}
-\label{fig:10-10}
+\label{fig:10-9}
 \end{figure}
 %----------------------------------------------

@@ -454,19 +454,19 @@ NMT                     & 21.7          & 18.7           & -13.7      \\
 \vspace{-0.5em}
 \noindent 其中，$ \seq{{y}}_{<j }$表示目标语言第$j$个位置之前已经生成的译文单词序列。$ \funp{P} ( y_j | \seq{{y}}_{<j }, \seq{{x}})$可以被解释为：根据源语言句子$\seq{{x}} $和已生成的目标语言译文片段$\seq{{y}}_{<j }=\{ y_1, y_2,..., y_{j-1} \}$,生成第$j$个目标语言单词$y_j$的概率。

-\parinterval 求解$\funp{P}(y_j | \seq{{y}}_{<j},\seq{{x}})$有三个关键问题（图\ref{fig:10-11}）：
+\parinterval 求解$\funp{P}(y_j | \seq{{y}}_{<j},\seq{{x}})$有三个关键问题（图\ref{fig:10-10}）：

 \vspace{-0.5em}
 \begin{itemize}
 \vspace{0.5em}
 \item	如何对$\seq{{x}}$和$\seq{{y}}_{<j }$进行分布式表示，即词嵌入。首先，将由One-hot向量表示的源语言单词，即由0和1构成的离散化向量表示，转化为实数向量。可以把这个过程记为$\textrm{e}_x (\cdot)$。类似地，可以把目标语言序列$\seq{{y}}_{<j }$中的每个单词用同样的方式进行表示，记为$\textrm{e}_y (\cdot)$。
 \vspace{0.5em}
-\item	如何在词嵌入的基础上获取整个序列的表示，即句子的表示学习。可以把词嵌入的序列作为循环神经网络的输入，循环神经网络最后一个时刻的输出向量便是整个句子的表示结果。如图\ref{fig:10-11}中，编码器最后一个循环单元的输出$\mathbi{h}_m$被看作是一种包含了源语言句子信息的表示结果，记为$\mathbi{C}$。
+\item	如何在词嵌入的基础上获取整个序列的表示，即句子的表示学习。可以把词嵌入的序列作为循环神经网络的输入，循环神经网络最后一个时刻的输出向量便是整个句子的表示结果。如图\ref{fig:10-10}中，编码器最后一个循环单元的输出$\mathbi{h}_m$被看作是一种包含了源语言句子信息的表示结果，记为$\mathbi{C}$。
 \vspace{0.5em}
 \item	如何得到每个目标语言单词的概率，即译文单词的{\small\sffamily\bfseries{生成}}\index{生成}（Generation）\index{Generation}。与神经语言模型一样，可以用一个Softmax输出层来获取当前时刻所有单词的分布，即利用Softmax 函数计算目标语言词表中每个单词的概率。令目标语言序列$j$时刻的循环神经网络的输出向量（或状态）为$\mathbi{s}_j$。根据循环神经网络的性质，$ y_j$ 的生成只依赖前一个状态$\mathbi{s}_{j-1}$和当前时刻的输入（即词嵌入$\textrm{e}_y (y_{j-1})$）。同时考虑源语言信息$\mathbi{C}$，$\funp{P}(y_j  | \seq{{y}}_{<j},\seq{{x}})$可以被重新定义为：
 \begin{eqnarray}
 \funp{P} (y_j | \seq{{y}}_{<j},\seq{{x}}) = \funp{P} ( {y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}} )
-\label{eq:10-4}
+\label{eq:10-3}
 \end{eqnarray}
 $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softmax的输入是循环神经网络$j$时刻的输出。在具体实现时，$\mathbi{C}$可以被简单地作为第一个时刻循环单元的输入，即，当$j=1$ 时，解码器的循环神经网络会读入编码器最后一个隐层状态$ \mathbi{h}_m$（也就是$\mathbi{C}$），而其他时刻的隐层状态不直接与$\mathbi{C}$相关。最终，$\funp{P} (y_j | \seq{{y}}_{<j},\seq{{x}})$ 被表示为：
 \begin{eqnarray}
@@ -475,7 +475,7 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm
 \funp{P} (y_j |\mathbi{C} ,y_{j-1}) &j=1 \\
 \funp{P} (y_j|\mathbi{s}_{j-1},y_{j-1})  \quad &j>1
 \end{array} \right .
-\label{eq:10-5}
+\label{eq:10-4}
 \end{eqnarray}
 \vspace{0.5em}
 \end{itemize}
@@ -485,16 +485,15 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm
 \centering
 \input{./Chapter10/Figures/figure-3-base-problom-of-p}
 \caption{求解$\funp{P} (y_j | \seq{{y}}_{<j},\seq{{x}})$的三个基本问题}
-\label{fig:10-11}
+\label{fig:10-10}
 \end{figure}
 %----------------------------------------------

 \parinterval 输入层（词嵌入）和输出层（Softmax）的内容已在{\chapternine}进行了介绍，因此这里的核心内容是设计循环神经网络结构，即设计循环单元的结构。至今，研究人员已经提出了很多优秀的循环单元结构。其中循环神经网络（RNN）
 是最原始的循环单元结构。在RNN中，对于序列$\seq{{x}}=\{ \mathbi{x}_1, \mathbi{x}_2,...,\mathbi{x}_m \}$，每个时刻$t$都对应一个循环单元，它的输出是一个向量$\mathbi{h}_t$，可以被描述为：
-
 \begin{eqnarray}
 \mathbi{h}_t=f(\mathbi{x}_t \mathbi{U}+\mathbi{h}_{t-1} \mathbi{W}+\mathbi{b})
-\label{eq:10-11}
+\label{eq:10-5}
 \end{eqnarray}

 \noindent 其中$\mathbi{x}_t$是当前时刻的输入，$\mathbi{h}_{t-1}$是上一时刻循环单元的输出，$f(\cdot)$是激活函数，$\mathbi{U}$和$\mathbi{W}$是参数矩阵，$\mathbi{b}$是偏置。
@@ -510,7 +509,7 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm

 \parinterval RNN结构使得当前时刻循环单元的状态包含了之前时间步的状态信息。但是这种对历史信息的记忆并不是无损的，随着序列变长，RNN的记忆信息的损失越来越严重。在很多长序列处理任务中（如长文本生成）都观测到了类似现象。对于这个问题，研究者们提出了{\small\bfnew{长短时记忆}}\index{长短时记忆}（Long Short-term Memory）\index{Long Short-term Memory，LSTM}模型，也就是常说的LSTM模型\upcite{HochreiterLong}。

-\parinterval LSTM模型是RNN模型的一种改进。相比RNN仅传递前一时刻的状态$\mathbi{h}_{t-1}$，LSTM会同时传递两部分信息：状态信息$\mathbi{h}_{t-1}$和记忆信息$\mathbi{c}_{t-1}$。这里，$\mathbi{c}_{t-1}$是新引入的变量，它也是循环单元的一部分，用于显性地记录需要记录的历史内容，$\mathbi{h}_{t-1}$和$\mathbi{c}_{t-1}$在循环单元中会相互作用。LSTM通过“门”单元来动态地选择遗忘多少以前的信息和记忆多少当前的信息。LSTM中所使用的门单元结构如图\ref{fig:10-15}所示，包括遗忘门，输入门和输出门。图中$\sigma$代表Sigmoid函数，它将函数输入映射为0-1范围内的实数，用来充当门控信号。
+\parinterval LSTM模型是RNN模型的一种改进。相比RNN仅传递前一时刻的状态$\mathbi{h}_{t-1}$，LSTM会同时传递两部分信息：状态信息$\mathbi{h}_{t-1}$和记忆信息$\mathbi{c}_{t-1}$。这里，$\mathbi{c}_{t-1}$是新引入的变量，它也是循环单元的一部分，用于显性地记录需要记录的历史内容，$\mathbi{h}_{t-1}$和$\mathbi{c}_{t-1}$在循环单元中会相互作用。LSTM通过“门”单元来动态地选择遗忘多少以前的信息和记忆多少当前的信息。LSTM中所使用的门单元结构如图\ref{fig:10-11}所示，包括遗忘门，输入门和输出门。图中$\sigma$代表Sigmoid函数，它将函数输入映射为0-1范围内的实数，用来充当门控信号。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -520,7 +519,7 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm
 \subfigure[记忆更新]{\input{./Chapter10/Figures/figure-lstm03}}
 \subfigure[输出门]{\input{./Chapter10/Figures/figure-lstm04}}
 \caption{LSTM中的门控结构}
-\label{fig:10-15}
+\label{fig:10-11}
 \end{figure}
 %----------------------------------------------

@@ -528,43 +527,43 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm

 \begin{itemize}
 \vspace{0.5em}
-\item {\small\sffamily\bfseries{遗忘}}\index{遗忘}。顾名思义，遗忘的目的是忘记一些历史，在LSTM中通过遗忘门实现，其结构如图\ref{fig:10-15}(a)所示。$\mathbi{x}_{t}$表示时刻$t$的输入向量，$\mathbi{h}_{t-1}$是时刻$t-1$的循环单元的输出，$\mathbi{x}_{t}$和$\mathbi{h}_{t-1}$都作为$t$时刻循环单元的输入。$\sigma$将对$\mathbi{x}_{t}$和$\mathbi{h}_{t-1}$进行筛选，以决定遗忘的信息，其计算公式如下：
+\item {\small\sffamily\bfseries{遗忘}}\index{遗忘}。顾名思义，遗忘的目的是忘记一些历史，在LSTM中通过遗忘门实现，其结构如图\ref{fig:10-11}(a)所示。$\mathbi{x}_{t}$表示时刻$t$的输入向量，$\mathbi{h}_{t-1}$是时刻$t-1$的循环单元的输出，$\mathbi{x}_{t}$和$\mathbi{h}_{t-1}$都作为$t$时刻循环单元的输入。$\sigma$将对$\mathbi{x}_{t}$和$\mathbi{h}_{t-1}$进行筛选，以决定遗忘的信息，其计算公式如下：
 \begin{eqnarray}
 \mathbi{f}_t=\sigma(\mathbi{W}_f [\mathbi{h}_{t-1},\mathbi{x}_{t}] + \mathbi{b}_f )
-\label{eq:10-12}
+\label{eq:10-6}
 \end{eqnarray}

 这里，$\mathbi{W}_f$是权值，$\mathbi{b}_f$是偏置，$[\mathbi{h}_{t-1},\mathbi{x}_{t}]$表示两个向量的拼接。该公式可以解释为，对$[\mathbi{h}_{t-1},\mathbi{x}_{t}]$进行变换，并得到一个实数向量$\mathbi{f}_t$。$\mathbi{f}_t$的每一维都可以被理解为一个“门”，它决定可以有多少信息被留下（或遗忘）。
 \vspace{0.5em}
-\item {\small\sffamily\bfseries{记忆更新}}\index{记忆更新}。首先，要生成当前时刻需要新增加的信息，该部分由输入门完成，其结构如图\ref{fig:10-15}(b)红色线部分，图中“$\bigotimes$”表示进行点乘操作。输入门的计算分为两部分，首先利用$\sigma$决定门控参数$\mathbi{i}_t$，然后通过Tanh函数得到新的信息$\hat{\mathbi{c}}_t$，具体公式如下：
+\item {\small\sffamily\bfseries{记忆更新}}\index{记忆更新}。首先，要生成当前时刻需要新增加的信息，该部分由输入门完成，其结构如图\ref{fig:10-11}(b)红色线部分，图中“$\bigotimes$”表示进行点乘操作。输入门的计算分为两部分，首先利用$\sigma$决定门控参数$\mathbi{i}_t$，然后通过Tanh函数得到新的信息$\hat{\mathbi{c}}_t$，具体公式如下：
 \begin{eqnarray}
-\mathbi{i}_t & = & \sigma (\mathbi{W}_i [\mathbi{h}_{t-1},\mathbi{x}_{t}] + \mathbi{b}_i ) \label{eq:10-13} \\
-\hat{\mathbi{c}}_t & = & \textrm{Tanh} (\mathbi{W}_c [\mathbi{h}_{t-1},\mathbi{x}_{t}] + \mathbi{b}_c ) \label{eq:10-14}
+\mathbi{i}_t & = & \sigma (\mathbi{W}_i [\mathbi{h}_{t-1},\mathbi{x}_{t}] + \mathbi{b}_i ) \label{eq:10-7} \\
+\hat{\mathbi{c}}_t & = & \textrm{Tanh} (\mathbi{W}_c [\mathbi{h}_{t-1},\mathbi{x}_{t}] + \mathbi{b}_c ) \label{eq:10-8}
 \end{eqnarray}

-之后，用$\mathbi{i}_t$点乘$\hat{\mathbi{c}}_t$，得到当前需要记忆的信息，记为$\mathbi{i}_t \cdot  \hat{\mathbi{c}}_t$。接下来需要更新旧的信息$\mathbi{c}_{t-1}$，得到新的记忆信息$\mathbi{c}_t$，更新的操作如图\ref{fig:10-15}(c)红色线部分所示，“$\bigoplus$”表示相加。具体规则是通过遗忘门选择忘记一部分上文信息$\mathbi{f}_t$，通过输入门计算新增的信息$\mathbi{i}_t \cdot  \hat{\mathbi{c}}_t$，然后根据“$\bigotimes$”门与“$\bigoplus$”门进行相应的乘法和加法计算：
+之后，用$\mathbi{i}_t$点乘$\hat{\mathbi{c}}_t$，得到当前需要记忆的信息，记为$\mathbi{i}_t \cdot  \hat{\mathbi{c}}_t$。接下来需要更新旧的信息$\mathbi{c}_{t-1}$，得到新的记忆信息$\mathbi{c}_t$，更新的操作如图\ref{fig:10-11}(c)红色线部分所示，“$\bigoplus$”表示相加。具体规则是通过遗忘门选择忘记一部分上文信息$\mathbi{f}_t$，通过输入门计算新增的信息$\mathbi{i}_t \cdot  \hat{\mathbi{c}}_t$，然后根据“$\bigotimes$”门与“$\bigoplus$”门进行相应的乘法和加法计算：
 \begin{eqnarray}
 \mathbi{c}_t = \mathbi{f}_t \cdot \mathbi{c}_{t-1} + \mathbi{i}_t  \cdot \hat{\mathbi{c}_t}
-\label{eq:10-15}
+\label{eq:10-9}
 \end{eqnarray}
 \vspace{-1.0em}
-\item {\small\sffamily\bfseries{输出}}\index{输出}。该部分使用输出门计算最终的输出信息$\mathbi{h}_t$，其结构如图\ref{fig:10-15}(d)红色线部分所示。在输出门中，首先将$\mathbi{x}_t$和$\mathbi{h}_{t-1}$通过$\sigma$函数变换得到$\mathbi{o}_t$。其次，将上一步得到的新记忆信息$\mathbi{c}_t$通过Tanh函数进行变换，得到值在[-1，1]范围的向量。最后将这两部分进行点乘，具体公式如下：
+\item {\small\sffamily\bfseries{输出}}\index{输出}。该部分使用输出门计算最终的输出信息$\mathbi{h}_t$，其结构如图\ref{fig:10-11}(d)红色线部分所示。在输出门中，首先将$\mathbi{x}_t$和$\mathbi{h}_{t-1}$通过$\sigma$函数变换得到$\mathbi{o}_t$。其次，将上一步得到的新记忆信息$\mathbi{c}_t$通过Tanh函数进行变换，得到值在[-1，1]范围的向量。最后将这两部分进行点乘，具体公式如下：
 \begin{eqnarray}
-\mathbi{o}_t & = & \sigma (\mathbi{W}_o [\mathbi{h}_{t-1},\mathbi{x}_{t}] + \mathbi{b}_o ) \label{eq:10-16} \\
-\mathbi{h}_t & = & \mathbi{o}_t \cdot \textrm{Tanh} (\mathbi{c}_t) \label{eq:6-17}
+\mathbi{o}_t & = & \sigma (\mathbi{W}_o [\mathbi{h}_{t-1},\mathbi{x}_{t}] + \mathbi{b}_o ) \label{eq:10-10} \\
+\mathbi{h}_t & = & \mathbi{o}_t \cdot \textrm{Tanh} (\mathbi{c}_t) \label{eq:10-11}
 \end{eqnarray}
 \vspace{0.5em}
 \end{itemize}


-\parinterval LSTM的完整结构如图\ref{fig:10-16}所示，模型的参数包括：参数矩阵$\mathbi{W}_f$、$\mathbi{W}_i$ 、$\mathbi{W}_c$、\\$\mathbi{W}_o$和偏置$\mathbi{b}_f$、$\mathbi{b}_i$、$\mathbi{b}_c$、$\mathbi{b}_o$。可以看出，$\mathbi{h}_t$是由$\mathbi{c}_{t-1}$、$\mathbi{h}_{t-1}$与$\mathbi{x}_t$共同决定的。此外，上述公式中激活函数的选择是根据函数各自的特点决定的。
+\parinterval LSTM的完整结构如图\ref{fig:10-12}所示，模型的参数包括：参数矩阵$\mathbi{W}_f$、$\mathbi{W}_i$ 、$\mathbi{W}_c$、\\$\mathbi{W}_o$和偏置$\mathbi{b}_f$、$\mathbi{b}_i$、$\mathbi{b}_c$、$\mathbi{b}_o$。可以看出，$\mathbi{h}_t$是由$\mathbi{c}_{t-1}$、$\mathbi{h}_{t-1}$与$\mathbi{x}_t$共同决定的。此外，上述公式中激活函数的选择是根据函数各自的特点决定的。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter10/Figures/figure-the-whole-of-lstm}
 \caption{LSTM的整体结构}
-\label{fig:10-16}
+\label{fig:10-12}
 \end{figure}
 %----------------------------------------------

@@ -583,26 +582,26 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm
 \subfigure[更新门]{\input{./Chapter10/Figures/figure-gru02}}
 \subfigure[隐藏状态更新]{\input{./Chapter10/Figures/figure-gru03}}
 \caption{GRU中的门控结构}
-\label{fig:10-17}
+\label{fig:10-13}
 \end{figure}
 %----------------------------------------------

-\parinterval GRU的输入和RNN是一样的，由输入$\mathbi{x}_t$和$t-1$时刻的状态$\mathbi{h}_{t-1}$组成。GRU只有两个门信号，分别是重置门和更新门。重置门$\mathbi{r}_t$用来控制前一时刻隐藏状态的记忆程度，其结构如图\ref{fig:10-17}(a)。更新门用来更新记忆，使用一个门同时完成遗忘和记忆两种操作，其结构如图\ref{fig:10-17}(b)。重置门和更新门的计算公式如下：
+\parinterval GRU的输入和RNN是一样的，由输入$\mathbi{x}_t$和$t-1$时刻的状态$\mathbi{h}_{t-1}$组成。GRU只有两个门信号，分别是重置门和更新门。重置门$\mathbi{r}_t$用来控制前一时刻隐藏状态的记忆程度，其结构如图\ref{fig:10-13}(a)。更新门用来更新记忆，使用一个门同时完成遗忘和记忆两种操作，其结构如图\ref{fig:10-13}(b)。重置门和更新门的计算公式如下：
 \begin{eqnarray}
-\mathbi{r}_t & = &\sigma (\mathbi{W}_r [\mathbi{h}_{t-1},\mathbi{x}_{t}] ) \label{eq:10-18} \\
-\mathbi{u}_t & = & \sigma (\mathbi{W}_u [\mathbi{h}_{t-1},\mathbi{x}_{t}]) \label{eq:10-19}
+\mathbi{r}_t & = &\sigma (\mathbi{W}_r [\mathbi{h}_{t-1},\mathbi{x}_{t}] ) \label{eq:10-12} \\
+\mathbi{u}_t & = & \sigma (\mathbi{W}_u [\mathbi{h}_{t-1},\mathbi{x}_{t}]) \label{eq:10-13}
 \end{eqnarray}

-\parinterval 当完成了重置门和更新门计算后，就需要更新当前隐藏状态，如图\ref{fig:10-17}(c)所示。在计算得到了重置门的权重$\mathbi{r}_t$后，使用其对前一时刻的状态$\mathbi{h}_{t-1}$进行重置($\mathbi{r}_t \cdot \mathbi{h}_{t-1}$)，将重置后的结果与$\mathbi{x}_t$拼接，通过Tanh激活函数将数据变换到[-1,1]范围内：
+\parinterval 当完成了重置门和更新门计算后，就需要更新当前隐藏状态，如图\ref{fig:10-13}(c)所示。在计算得到了重置门的权重$\mathbi{r}_t$后，使用其对前一时刻的状态$\mathbi{h}_{t-1}$进行重置($\mathbi{r}_t \cdot \mathbi{h}_{t-1}$)，将重置后的结果与$\mathbi{x}_t$拼接，通过Tanh激活函数将数据变换到[-1,1]范围内：
 \begin{eqnarray}
 \hat{\mathbi{h}}_t = \textrm{Tanh} (\mathbi{W}_h [\mathbi{r}_t \cdot \mathbi{h}_{t-1},\mathbi{x}_{t}])
-\label{eq:10-20}
+\label{eq:10-14}
 \end{eqnarray}

 \parinterval $\hat{\mathbi{h}}_t$在包含了输入信息$\mathbi{x}_t$的同时，引入了$\mathbi{h}_{t-1}$的信息，可以理解为，记忆了当前时刻的状态。下一步是计算更新后的隐藏状态也就是更新记忆，如下所示：
 \begin{eqnarray}
 \mathbi{h}_t = (1-\mathbi{u}_t) \cdot \mathbi{h}_{t-1} +\mathbi{u}_t \cdot \hat{\mathbi{h}}_t
-\label{eq:10-21}
+\label{eq:10-15}
 \end{eqnarray}

 \noindent 这里，$\mathbi{u}_t$是更新门中得到的权重，将$\mathbi{u}_t$作用于$\hat{\mathbi{h}}_t$表示对当前时刻的状态进行“遗忘”，舍弃一些不重要的信息，将$(1-\mathbi{u}_t)$作用于$\mathbi{h}_{t-1}$，用于对上一时刻隐藏状态进行选择性记忆。
@@ -615,14 +614,14 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm

 \subsection{双向模型}

-\parinterval 前面提到的循环神经网络都是自左向右运行的，也就是说在处理一个单词的时候只能访问它前面的序列信息。但是，只根据句子的前文来生成一个序列的表示是不全面的，因为从最后一个词来看，第一个词的信息可能已经很微弱了。为了同时考虑前文和后文的信息，一种解决办法是使用双向循环网络，其结构如图\ref{fig:10-18}所示。这里，编码器可以看作有两个循环神经网络，第一个网络，即红色虚线框里的网络，从句子的右边进行处理，第二个网络从句子左边开始处理，最终将正向和反向得到的结果都融合后传递给解码器。
+\parinterval 前面提到的循环神经网络都是自左向右运行的，也就是说在处理一个单词的时候只能访问它前面的序列信息。但是，只根据句子的前文来生成一个序列的表示是不全面的，因为从最后一个词来看，第一个词的信息可能已经很微弱了。为了同时考虑前文和后文的信息，一种解决办法是使用双向循环网络，其结构如图\ref{fig:10-14}所示。这里，编码器可以看作有两个循环神经网络，第一个网络，即红色虚线框里的网络，从句子的右边进行处理，第二个网络从句子左边开始处理，最终将正向和反向得到的结果都融合后传递给解码器。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter10/Figures/figure-bi-rnn}
 \caption{基于双向循环神经网络的机器翻译模型结构}
-\label{fig:10-18}
+\label{fig:10-14}
 \end{figure}
 %----------------------------------------------

@@ -636,14 +635,14 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm

 \parinterval 实际上，对于单词序列所使用的循环神经网络是一种很“深”的网络，因为从第一个单词到最后一个单词需要经过至少句子长度相当层数的神经元。比如，一个包含几十个词的句子也会对应几十个神经元层。但是，在很多深度学习应用中，更习惯把对输入序列的同一种处理作为“一层”。比如，对于输入序列，构建一个RNN，那么这些循环单元就构成了网络的“一层”。当然，这里并不是要混淆概念。只是要明确，在随后的讨论中，“层”并不是指一组神经元的全连接，它一般指的是网络结构中逻辑上的一层。

-\parinterval 单层循环神经网络对输入序列进行了抽象，为了得到更深入的抽象能力，可以把多个循环神经网络叠在一起，构成多层循环神经网络。比如，图\ref{fig:10-19}就展示了基于两层循环神经网络的解码器和编码器结构。通常来说，层数越多模型的表示能力越强，因此在很多基于循环神经网络的机器翻译系统中一般会使用4$\sim$8层的网络。但是，过多的层也会增加模型训练的难度，甚至导致模型无法进行训练。{\chapterthirteen}还会对这个问题进行深入讨论。
+\parinterval 单层循环神经网络对输入序列进行了抽象，为了得到更深入的抽象能力，可以把多个循环神经网络叠在一起，构成多层循环神经网络。比如，图\ref{fig:10-15}就展示了基于两层循环神经网络的解码器和编码器结构。通常来说，层数越多模型的表示能力越强，因此在很多基于循环神经网络的机器翻译系统中一般会使用4$\sim$8层的网络。但是，过多的层也会增加模型训练的难度，甚至导致模型无法进行训练。{\chapterthirteen}还会对这个问题进行深入讨论。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter10/Figures/figure-double-layer-rnn} \hspace{10em}
 \caption{基于双层循环神经网络的机器翻译模型结构}
-\label{fig:10-19}
+\label{fig:10-15}
 \end{figure}
 %----------------------------------------------

@@ -662,14 +661,14 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm

 \noindent 之所以能想到在横线处填“吃饭”、“吃东西”很有可能是因为看到了“没/吃饭”、 “很/饿”等关键信息。也就是这些关键的片段对预测缺失的单词起着关键性作用。而预测“吃饭”与前文中的“ 中午”、“又”之间的联系似乎不那么紧密。也就是说，在形成 “吃饭”的逻辑时，在潜意识里会更注意“没/吃饭”、“很/饿”等关键信息。也就是我们的关注度并不是均匀地分布在整个句子上的。

-\parinterval 这个现象可以用注意力机制进行解释。注意力机制的概念来源于生物学的一些现象：当待接收的信息过多时，人类会选择性地关注部分信息而忽略其他信息。它在人类的视觉、听觉、嗅觉等方面均有体现，当我们在感受事物时，大脑会自动过滤或衰减部分信息，仅关注其中少数几个部分。例如，当看到图\ref{fig:10-20}时，往往不是“均匀地”看图像中的所有区域，可能最先注意到的是小狗的嘴，然后才会关注图片中其他的部分。那注意力机制是如何解决神经机器翻译的问题呢？下面就一起来看一看。
+\parinterval 这个现象可以用注意力机制进行解释。注意力机制的概念来源于生物学的一些现象：当待接收的信息过多时，人类会选择性地关注部分信息而忽略其他信息。它在人类的视觉、听觉、嗅觉等方面均有体现，当我们在感受事物时，大脑会自动过滤或衰减部分信息，仅关注其中少数几个部分。例如，当看到图\ref{fig:10-16}时，往往不是“均匀地”看图像中的所有区域，可能最先注意到的是小狗的嘴，然后才会关注图片中其他的部分。那注意力机制是如何解决神经机器翻译的问题呢？下面就一起来看一看。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \includegraphics[scale=0.05]{./Chapter10/Figures/dog-hat-new.jpg}
 \caption{戴帽子的狗}
-\label{fig:10-20}
+\label{fig:10-16}
 \end{figure}
 %----------------------------------------------

@@ -688,27 +687,27 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm
 \vspace{0.5em}
 \end{itemize}

-\parinterval 更直观的，如图\ref{fig:10-21}，目标语言中的“very long”仅依赖于源语言中的“很长”。这时如果将所有源语言编码成一个固定的实数向量，“很长”的信息就很可能被其他词的信息淹没掉。
+\parinterval 更直观的，如图\ref{fig:10-17}，目标语言中的“very long”仅依赖于源语言中的“很长”。这时如果将所有源语言编码成一个固定的实数向量，“很长”的信息就很可能被其他词的信息淹没掉。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter10/Figures/figure-attention-of-source-and-target-words}
 \caption{源语言词和目标语言词的关注度}
-\label{fig:10-21}
+\label{fig:10-17}
 \end{figure}
 %----------------------------------------------

 \parinterval 显然，以上问题的根本原因在于所使用的表示模型还比较“弱”。因此需要一个更强大的表示模型，在生成目标语言单词时能够有选择地获取源语言句子中更有用的部分。更准确的说，对于要生成的目标语单词，相关性更高的源语言片段应该在源语言句子的表示中体现出来，而不是将所有的源语言单词一视同仁。在神经机器翻译中引入注意力机制正是为了达到这个目的\upcite{bahdanau2014neural,DBLP:journals/corr/LuongPM15}。实际上，除了机器翻译，注意力机制也被成功地应用于图像处理、语音识别、自然语言处理等其他任务。也正是注意力机制的引入，使得包括机器翻译在内很多自然语言处理系统得到了飞跃发展。

-\parinterval 神经机器翻译中的注意力机制并不复杂。对于每个目标语言单词$y_j$，系统生成一个源语言表示向量$\mathbi{C}_j$与之对应，$\mathbi{C}_j$会包含生成$y_j$所需的源语言的信息，或者说$\mathbi{C}_j$是一种包含目标语言单词与源语言单词对应关系的源语言表示。相比用一个静态的表示$\mathbi{C}$，注意机制使用的是动态的表示$\mathbi{C}_j$。$\mathbi{C}_j$也被称作对于目标语言位置$j$的{\small\bfnew{上下文向量}}\index{上下文向量}（Context Vector\index{Context Vector}）。图\ref{fig:10-22}对比了未引入注意力机制和引入了注意力机制的编码器- 解码器结构。可以看出，在注意力模型中，对于每一个目标单词的生成，都会额外引入一个单独的上下文向量参与运算。
+\parinterval 神经机器翻译中的注意力机制并不复杂。对于每个目标语言单词$y_j$，系统生成一个源语言表示向量$\mathbi{C}_j$与之对应，$\mathbi{C}_j$会包含生成$y_j$所需的源语言的信息，或者说$\mathbi{C}_j$是一种包含目标语言单词与源语言单词对应关系的源语言表示。相比用一个静态的表示$\mathbi{C}$，注意机制使用的是动态的表示$\mathbi{C}_j$。$\mathbi{C}_j$也被称作对于目标语言位置$j$的{\small\bfnew{上下文向量}}\index{上下文向量}（Context Vector\index{Context Vector}）。图\ref{fig:10-18}对比了未引入注意力机制和引入了注意力机制的编码器- 解码器结构。可以看出，在注意力模型中，对于每一个目标单词的生成，都会额外引入一个单独的上下文向量参与运算。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter10/Figures/figure-encoder-decoder-with-attention}
 \caption{不使用(a)和使用(b)注意力机制的翻译模型对比}
-\label{fig:10-22}
+\label{fig:10-18}
 \end{figure}
 %----------------------------------------------

@@ -723,28 +722,28 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm
 \parinterval 根据这种思想，上下文向量$\mathbi{C}_j$被定义为对不同时间步编码器输出的状态序列$\{ \mathbi{h}_1, \mathbi{h}_2,...,\mathbi{h}_m \}$进行加权求和，如下：
 \begin{eqnarray}
 \mathbi{C}_j=\sum_{i} \alpha_{i,j} \mathbi{h}_i
-\label{eq:10-22}
+\label{eq:10-16}
 \end{eqnarray}

-\noindent 其中，$\alpha_{i,j}$是{\small\sffamily\bfseries{注意力权重}}\index{注意力权重}（Attention Weight）\index{Attention Weight}，它表示目标语言第$j$个位置与源语言第$i$个位置之间的相关性大小。这里，将每个时间步编码器的输出$\mathbi{h}_i$ 看作源语言位置$i$的表示结果。进行翻译时，解码端可以根据当前的位置$j$，通过控制不同$\mathbi{h}_i$的权重得到$\mathbi{C}_j$，使得对目标语言位置$j$贡献大的$\mathbi{h}_i$对$\mathbi{C}_j$的影响增大。也就是说，$\mathbi{C}_j$实际上就是\{${\mathbi{h}_1, \mathbi{h}_2,...,\mathbi{h}_m}$\}的一种组合，只不过不同的$\mathbi{h}_i$会根据对目标端的贡献给予不同的权重。图\ref{fig:10-23}展示了上下文向量$\mathbi{C}_j$的计算过程。
+\noindent 其中，$\alpha_{i,j}$是{\small\sffamily\bfseries{注意力权重}}\index{注意力权重}（Attention Weight）\index{Attention Weight}，它表示目标语言第$j$个位置与源语言第$i$个位置之间的相关性大小。这里，将每个时间步编码器的输出$\mathbi{h}_i$ 看作源语言位置$i$的表示结果。进行翻译时，解码端可以根据当前的位置$j$，通过控制不同$\mathbi{h}_i$的权重得到$\mathbi{C}_j$，使得对目标语言位置$j$贡献大的$\mathbi{h}_i$对$\mathbi{C}_j$的影响增大。也就是说，$\mathbi{C}_j$实际上就是\{${\mathbi{h}_1, \mathbi{h}_2,...,\mathbi{h}_m}$\}的一种组合，只不过不同的$\mathbi{h}_i$会根据对目标端的贡献给予不同的权重。图\ref{fig:10-19}展示了上下文向量$\mathbi{C}_j$的计算过程。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter10/Figures/figure-calculation-process-of-context-vector-c}
 \caption{上下文向量$\mathbi{C}_j$的计算过程}
-\label{fig:10-23}
+\label{fig:10-19}
 \end{figure}
 %----------------------------------------------

-\parinterval 如图\ref{fig:10-23}所示，注意力权重$\alpha_{i,j}$的计算分为两步：
+\parinterval 如图\ref{fig:10-19}所示，注意力权重$\alpha_{i,j}$的计算分为两步：

 \begin{itemize}
 \vspace{0.5em}
 \item	使用目标语言上一时刻循环单元的输出$\mathbi{s}_{j-1}$与源语言第$i$个位置的表示$\mathbi{h}_i$之间的相关性，其用来表示目标语言位置$j$对源语言位置$i$的关注程度，记为$\beta_{i,j}$，由函数$a(\cdot)$实现：
 \begin{eqnarray}
 \beta_{i,j} = a(\mathbi{s}_{j-1},\mathbi{h}_i)
-\label{eq:10-23}
+\label{eq:10-17}
 \end{eqnarray}

 $a(\cdot)$可以被看作是目标语言表示和源语言表示的一种“统一化”，即把源语言和目标语言表示映射在同一个语义空间，进而语义相近的内容有更大的相似性。该函数有多种计算方式，比如，向量乘、向量夹角和单层神经网络等，数学表达如下：
@@ -756,7 +755,7 @@ a (\mathbi{s},\mathbi{h}) =  \left\{ \begin{array}{ll}
    \textrm{Tanh}(\mathbi{W}[\mathbi{s},\mathbi{h}])\mathbi{v}^{\textrm{T}} & \textrm{拼接}[\mathbi{s},\mathbi{h}]+\textrm{单层网络}
    \end{array}
    \right.
-\label{eq:10-24}
+\label{eq:10-18}
 \end{eqnarray}

 其中$\mathbi{W}$和$\mathbi{v}$是可学习的参数。
@@ -765,39 +764,39 @@ a (\mathbi{s},\mathbi{h}) =  \left\{ \begin{array}{ll}
 \vspace{0.5em}
 \begin{eqnarray}
 \alpha_{i,j}=\frac{\textrm{exp}(\beta_{i,j})} {\sum_{i'} \textrm{exp}(\beta_{i',j})}
-\label{eq:10-25}
+\label{eq:10-19}
 \end{eqnarray}
 \vspace{0.5em}

-最终，\{$\alpha_{i,j}$\}可以被看作是一个矩阵，它的长为目标语言句子长度，宽为源语言句子长度，矩阵中的每一项对应一个$\alpha_{i,j}$。图\ref{fig:10-24}给出了\{$\alpha_{i,j}$\}的一个矩阵表示。图中蓝色方框的大小表示不同的注意力权重$\alpha_{i,j}$的大小，方框越大，源语言位置$i$和目标语言位置$j$的相关性越高。能够看到，对于互译的中英文句子，\{$\alpha_{i,j}$\}可以较好的反应两种语言之间不同位置的对应关系。
+最终，\{$\alpha_{i,j}$\}可以被看作是一个矩阵，它的长为目标语言句子长度，宽为源语言句子长度，矩阵中的每一项对应一个$\alpha_{i,j}$。图\ref{fig:10-20}给出了\{$\alpha_{i,j}$\}的一个矩阵表示。图中蓝色方框的大小表示不同的注意力权重$\alpha_{i,j}$的大小，方框越大，源语言位置$i$和目标语言位置$j$的相关性越高。能够看到，对于互译的中英文句子，\{$\alpha_{i,j}$\}可以较好的反应两种语言之间不同位置的对应关系。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter10/Figures/figure-matrix-representation-of-attention-weights-between-chinese-english-sentence-pairs}
 \caption{一个汉英句对之间的注意力权重{$\alpha_{i,j}$}的矩阵表示}
-\label{fig:10-24}
+\label{fig:10-20}
 \end{figure}
 %----------------------------------------------

 \vspace{0.5em}
 \end{itemize}

-\parinterval 图\ref{fig:10-25}展示了一个上下文向量的计算过程实例。首先，计算目标语言第一个单词“Have”与源语言中的所有单词的相关性，即注意力权重，对应图中第一列$\alpha_{i,1}$，则当前时刻所使用的上下文向量$\mathbi{C}_1 = \sum_{i=1}^8 \alpha_{i,1} \mathbi{h}_i$；然后，计算第二个单词“you”的注意力权重对应第二列$\alpha_{i,2}$，其上下文向量$\mathbi{C}_2 = \sum_{i=1}^8 \alpha_{i,2} \mathbi{h}_i$，以此类推，可以得到任意目标语言位置$j$的上下文向量$\mathbi{C}_j$。很容易看出，不同目标语言单词的上下文向量对应的源语言词的权重$\alpha_{i,j}$是不同的，不同的注意力权重为不同位置赋予了不同的重要性。
+\parinterval 图\ref{fig:10-21}展示了一个上下文向量的计算过程实例。首先，计算目标语言第一个单词“Have”与源语言中的所有单词的相关性，即注意力权重，对应图中第一列$\alpha_{i,1}$，则当前时刻所使用的上下文向量$\mathbi{C}_1 = \sum_{i=1}^8 \alpha_{i,1} \mathbi{h}_i$；然后，计算第二个单词“you”的注意力权重对应第二列$\alpha_{i,2}$，其上下文向量$\mathbi{C}_2 = \sum_{i=1}^8 \alpha_{i,2} \mathbi{h}_i$，以此类推，可以得到任意目标语言位置$j$的上下文向量$\mathbi{C}_j$。很容易看出，不同目标语言单词的上下文向量对应的源语言词的权重$\alpha_{i,j}$是不同的，不同的注意力权重为不同位置赋予了不同的重要性。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter10/Figures/figure-example-of-context-vector-calculation-process}
 \caption{上下文向量计算过程实例}
-\label{fig:10-25}
+\label{fig:10-21}
 \end{figure}
 %----------------------------------------------

-\parinterval 在\ref{sec:10.3.1}节中，公式\eqref{eq:10-5}描述了目标语言单词生成概率$ \funp{P} (y_j | \mathbi{y}_{<j},\mathbi{x})$。在引入注意力机制后，不同时刻的上下文向量$\mathbi{C}_j$替换了传统模型中固定的句子表示$\mathbi{C}$。描述如下：
+\parinterval 在\ref{sec:10.3.1}节中，公式\eqref{eq:10-4}描述了目标语言单词生成概率$ \funp{P} (y_j | \mathbi{y}_{<j},\mathbi{x})$。在引入注意力机制后，不同时刻的上下文向量$\mathbi{C}_j$替换了传统模型中固定的句子表示$\mathbi{C}$。描述如下：
 \begin{eqnarray}
 \funp{P} (y_j | \mathbi{y}_{<j},\mathbi{x}) \equiv \funp{P} (y_j | \mathbi{s}_{j-1},y_{j-1},\mathbi{C}_j )
-\label{eq:10-26}
+\label{eq:10-20}
 \end{eqnarray}

 \parinterval 这样，可以在生成每个$y_j$时动态的使用不同的源语言表示$\mathbi{C}_j$，并更准确地捕捉源语言和目标语言不同位置之间的相关性。表\ref{tab:10-7}展示了引入注意力机制前后译文单词生成公式的对比。
@@ -827,47 +826,47 @@ a (\mathbi{s},\mathbi{h}) =  \left\{ \begin{array}{ll}

 \parinterval 那么，如何理解这个过程？注意力机制的本质又是什么呢？换一个角度来看，实际上，目标语言位置$j$可以被看作是一个查询，我们希望从源语言端找到与之最匹配的源语言位置，并返回相应的表示结果。为了描述这个问题，可以建立一个查询系统。假设有一个库，里面包含若干个$\mathrm{key}$-$\mathrm{value}$单元，其中$\mathrm{key}$代表这个单元的索引关键字，$\mathrm{value}$代表这个单元的值。比如，对于学生信息系统，$\mathrm{key}$可以是学号，$\mathrm{value}$可以是学生的身高。当输入一个查询$\mathrm{query}$，我们希望这个系统返回与之最匹配的结果。也就是，希望找到匹配的$\mathrm{key}$，并输出其对应的$\mathrm{value}$。比如，当查询某个学生的身高信息时，可以输入学生的学号，之后在库中查询与这个学号相匹配的记录，并把这个记录中的$\mathrm{value}$（即身高）作为结果返回。

-\parinterval 图\ref{fig:10-26}展示了一个这样的查询系统。里面包含四个$\mathrm{key}$-$\mathrm{value}$单元，当输入查询$\mathrm{query}$，就把$\mathrm{query}$与这四个$\mathrm{key}$逐个进行匹配，如果完全匹配就返回相应的$\mathrm{value}$。在图中的例子中，$\mathrm{query}$和$\mathrm{key}_3$是完全匹配的（因为都是横纹），因此系统返回第三个单元的值，即$\mathrm{value}_3$。当然，如果库中没有与$\mathrm{query}$匹配的$\mathrm{key}$，则返回一个空结果。
+\parinterval 图\ref{fig:10-22}展示了一个这样的查询系统。里面包含四个$\mathrm{key}$-$\mathrm{value}$单元，当输入查询$\mathrm{query}$，就把$\mathrm{query}$与这四个$\mathrm{key}$逐个进行匹配，如果完全匹配就返回相应的$\mathrm{value}$。在图中的例子中，$\mathrm{query}$和$\mathrm{key}_3$是完全匹配的（因为都是横纹），因此系统返回第三个单元的值，即$\mathrm{value}_3$。当然，如果库中没有与$\mathrm{query}$匹配的$\mathrm{key}$，则返回一个空结果。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter10/Figures/figure-query-model-corresponding-to-traditional-query-model-vs-attention-mechanism}
 \caption{传统查询模型}
-\label{fig:10-26}
+\label{fig:10-22}
 \end{figure}
 %----------------------------------------------

 \parinterval 也可以用这个系统描述翻译中的注意力问题，其中，$\mathrm{query}$即目标语言位置$j$的某种表示，$\mathrm{key}$和$\mathrm{value}$即源语言每个位置$i$上的${\mathbi{h}_i}$（这里$\mathrm{key}$和$\mathrm{value}$是相同的）。但是，这样的系统在机器翻译问题上并不好用，因为目标语言的表示和源语言的表示都在多维实数空间上，所以无法要求两个实数向量像字符串一样进行严格匹配，或者说这种严格匹配的模型可能会导致$\mathrm{query}$几乎不会命中任何的$\mathrm{key}$。既然无法严格精确匹配，注意力机制就采用了一个“模糊”匹配的方法。这里定义每个$\mathrm{key}_i$和$\mathrm{query}$ 都有一个0～1之间的匹配度，这个匹配度描述了$\mathrm{key}_i$和$\mathrm{query}$之间的相关程度，记为$\alpha_i$。而查询的结果（记为$\overline{\mathrm{value}}$）也不再是某一个单元的$\mathrm{value}$，而是所有单元$\mathrm{value}$用$\alpha_i$的加权和：
 \begin{eqnarray}
 \overline{\mathrm{value}} = \sum_i \alpha_i \cdot {\mathrm{value}}_i
-\label{eq:10-27}
+\label{eq:10-21}
 \end{eqnarray}

 \noindent 也就是说所有的$\mathrm{value}_i$都会对查询结果有贡献，只是贡献度不同罢了。可以通过设计$\alpha_i$来捕捉$\mathrm{key}$和$\mathrm{query}$之间的相关性，以达到相关度越大的$\mathrm{key}$所对应的$\mathrm{value}$对结果的贡献越大。

-\parinterval 重新回到神经机器翻译问题上来。这种基于模糊匹配的查询模型可以很好的满足对注意力建模的要求。实际上，公式\eqref{eq:10-27}中的$\alpha_i$就是前面提到的注意力权重，它可以由注意力函数$a(\cdot)$计算得到。这样，$\overline{\mathrm{value}}$就是得到的上下文向量，它包含了所有\{$\mathbi{h}_i$\}的信息，只是不同$\mathbi{h}_i$的贡献度不同罢了。图\ref{fig:10-27}展示了将基于模糊匹配的查询模型应用于注意力机制的实例。
+\parinterval 重新回到神经机器翻译问题上来。这种基于模糊匹配的查询模型可以很好的满足对注意力建模的要求。实际上，公式\eqref{eq:10-21}中的$\alpha_i$就是前面提到的注意力权重，它可以由注意力函数$a(\cdot)$计算得到。这样，$\overline{\mathrm{value}}$就是得到的上下文向量，它包含了所有\{$\mathbi{h}_i$\}的信息，只是不同$\mathbi{h}_i$的贡献度不同罢了。图\ref{fig:10-23}展示了将基于模糊匹配的查询模型应用于注意力机制的实例。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter10/Figures/figure-query-model-corresponding-to-attention-mechanism}
 \caption{注意力机制所对应的查询模型}
-\label{fig:10-27}
+\label{fig:10-23}
 \end{figure}
 %----------------------------------------------

-\parinterval 最后，从统计学的角度，如果把$\alpha_i$作为每个$\mathrm{value}_i$出现的概率的某种估计，即：$ \funp{P} (\mathrm{value}_i$) $= \alpha_i$，于是可以把公式\eqref{eq:10-27}重写为：
+\parinterval 最后，从统计学的角度，如果把$\alpha_i$作为每个$\mathrm{value}_i$出现的概率的某种估计，即：$ \funp{P} (\mathrm{value}_i$) $= \alpha_i$，于是可以把公式\eqref{eq:10-21}重写为：
 \begin{eqnarray}
 \overline{\mathrm{value}} = \sum_i \funp{P} ( {\mathrm{value}}_i) \cdot {\mathrm{value}}_i
-\label{eq:10-28}
+\label{eq:10-22}
 \end{eqnarray}

 \noindent 显然， $\overline{\mathrm{value}}$就是$\mathrm{value}_i$在分布$ \funp{P}( \mathrm{value}_i$)下的期望，即

 \begin{equation}
 \mathbb{E}_{\sim \\ \funp{P} ( {\mathrm{\mathrm{value}}}_i )} ({\mathrm{value}}_i) = \sum_i \funp{P} ({\mathrm{value}}_i) \cdot {\mathrm{value}}_i
-\label{eq:10-29}
+\label{eq:10-23}
 \end{equation}

 从这个观点看，注意力机制实际上是得到了变量$\mathrm{value}$的期望。当然，严格意义上说，$\alpha_i$并不是从概率角度定义的，在实际应用中也并不必须追求严格的统计学意义。
@@ -880,7 +879,7 @@ a (\mathbi{s},\mathbi{h}) =  \left\{ \begin{array}{ll}

 \parinterval 循环神经网络在机器翻译中有很多成功的应用，比如：RNNSearch\upcite{bahdanau2014neural}、Nematus\upcite{DBLP:journals/corr/SennrichFCBHHJL17}等系统就被很多研究者作为实验系统。在众多基于循环神经网络的系统中，GNMT系统是非常成功的一个\upcite{Wu2016GooglesNM}。GNMT是谷歌2016年发布的神经机器翻译系统。

-\parinterval GNMT使用了编码器-解码器结构，构建了一个8层的深度网络，每层网络均由LSTM组成，且在编码器-解码器之间使用了多层注意力连接。其结构如图\ref{fig:10-35}，编码器只有最下面2层为双向LSTM。GNMT在束搜索中也加入了长度惩罚和覆盖度因子来确保输出高质量的翻译结果。
+\parinterval GNMT使用了编码器-解码器结构，构建了一个8层的深度网络，每层网络均由LSTM组成，且在编码器-解码器之间使用了多层注意力连接。其结构如图\ref{fig:10-24}，编码器只有最下面2层为双向LSTM。GNMT在束搜索中也加入了长度惩罚和覆盖度因子来确保输出高质量的翻译结果。
 \vspace{0.5em}

 %----------------------------------------------
@@ -888,17 +887,17 @@ a (\mathbi{s},\mathbi{h}) =  \left\{ \begin{array}{ll}
 \centering
 \input{./Chapter10/Figures/figure-structure-of-gnmt}
 \caption{GNMT结构}
-\label{fig:10-35}
+\label{fig:10-24}
 \end{figure}
 %----------------------------------------------

-\parinterval 实际上，GNMT的主要贡献在于集成了多种优秀的技术，而且在大规模数据上证明了神经机器翻译的有效性。在引入注意力机制之前，神经机器翻译在较大规模的任务上的性能弱于统计机器翻译。加入注意力机制和深层网络后，神经机器翻译性能有了很大的提升。在英德和英法的任务中，GNMT的BLEU值不仅超过了当时优秀的神经机器翻译系统RNNSearch和LSTM（6层），还超过了当时处于领导地位的基于短语的统计机器翻译系统（PBMT）（表\ref{tab:10-10}）。相比基于短语的统计机器翻译系统，在人工评价中，GNMT能将翻译错误平均减少60\%。这一结果也充分表明了神经机器翻译带来的巨大性能提升。
+\parinterval 实际上，GNMT的主要贡献在于集成了多种优秀的技术，而且在大规模数据上证明了神经机器翻译的有效性。在引入注意力机制之前，神经机器翻译在较大规模的任务上的性能弱于统计机器翻译。加入注意力机制和深层网络后，神经机器翻译性能有了很大的提升。在英德和英法的任务中，GNMT的BLEU值不仅超过了当时优秀的神经机器翻译系统RNNSearch和LSTM（6层），还超过了当时处于领导地位的基于短语的统计机器翻译系统（PBMT）（表\ref{tab:10-8}）。相比基于短语的统计机器翻译系统，在人工评价中，GNMT能将翻译错误平均减少60\%。这一结果也充分表明了神经机器翻译带来的巨大性能提升。

 %----------------------------------------------
 \begin{table}[htp]
 \centering
 \caption{GNMT与其他翻译模型对比\upcite{Wu2016GooglesNM}}
-\label{tab:10-10}
+\label{tab:10-8}
 \begin{tabular}{l l l}
 \multicolumn{1}{l|}{\multirow{3}{*}{\#}} & \multicolumn{2}{c}{BLEU[\%]} \\
 \multicolumn{1}{l|}{}                    & 英德  & 英法                                               \\
@@ -925,16 +924,16 @@ a (\mathbi{s},\mathbi{h}) =  \left\{ \begin{array}{ll}
 %----------------------------------------------------------------------------------------
 \subsection{训练}

-\parinterval 在基于梯度的方法中，模型参数可以通过损失函数$L$对于参数的梯度进行不断更新。对于第$\textrm{step}$步参数更新，首先进行神经网络的前向计算，之后进行反向计算，并得到所有参数的梯度信息，再使用下面的规则进行参数更新：
+\parinterval 在基于梯度的方法中，模型参数可以通过损失函数$L$对参数的梯度进行不断更新。对于第$\textrm{step}$步参数更新，首先进行神经网络的前向计算，之后进行反向计算，并得到所有参数的梯度信息，再使用下面的规则进行参数更新：

 \begin{eqnarray}
 \mathbi{w}_{\textrm{step}+1} = \mathbi{w}_{\textrm{step}} - \alpha \cdot \frac{ \partial L(\mathbi{w}_{\textrm{step}})} {\partial \mathbi{w}_{\textrm{step}} }
-\label{eq:10-30}
+\label{eq:10-24}
 \end{eqnarray}

-\noindent 其中，$\mathbi{w}_{\textrm{step}}$表示更新前的模型参数，$\mathbi{w}_{\textrm{step}+1}$表示更新后的模型参数，$L(\mathbi{w}_{\textrm{step}})$表示模型相对于$\mathbi{w}_{\textrm{step}}$ 的损失，$\frac{\partial L(\mathbi{w}_{\textrm{step}})} {\partial \mathbi{w}_{\textrm{step}} }$表示损失函数的梯度，$\alpha$是更新的步进值。也就是说，给定一定量的训练数据，不断执行公式\eqref{eq:10-30}的过程。反复使用训练数据，直至模型参数达到收敛或者损失函数不再变化。通常，把公式的一次执行称为“一步”更新/训练，把访问完所有样本的训练称为“一轮”训练。
+\noindent 其中，$\mathbi{w}_{\textrm{step}}$表示更新前的模型参数，$\mathbi{w}_{\textrm{step}+1}$表示更新后的模型参数，$L(\mathbi{w}_{\textrm{step}})$表示模型相对于$\mathbi{w}_{\textrm{step}}$ 的损失，$\frac{\partial L(\mathbi{w}_{\textrm{step}})} {\partial \mathbi{w}_{\textrm{step}} }$表示损失函数的梯度，$\alpha$是更新的步长。也就是说，给定一定量的训练数据，不断执行公式\eqref{eq:10-24}的过程。反复使用训练数据，直至模型参数达到收敛或者损失函数不再变化。通常，把公式的一次执行称为“一步”更新/训练，把访问完所有样本的训练称为“一轮”训练。

-\parinterval 将公式\eqref{eq:10-30}应用于神经机器翻译有几个基本问题需要考虑：1）损失函数的选择；2）参数初始化的策略，也就是如何设置$\mathbi{w}_0$；3）优化策略和学习率调整策略；4）训练加速。下面对这些问题进行讨论。
+\parinterval 将公式\eqref{eq:10-24}应用于神经机器翻译有几个基本问题需要考虑：1）损失函数的选择；2）参数初始化的策略，也就是如何设置$\mathbi{w}_0$；3）优化策略和学习率调整策略；4）训练加速。下面对这些问题进行讨论。

 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -945,16 +944,16 @@ a (\mathbi{s},\mathbi{h}) =  \left\{ \begin{array}{ll}
 \parinterval 神经机器翻译在目标端的每个位置都会输出一个概率分布，表示这个位置上不同单词出现的可能性。设计损失函数时，需要知道当前位置输出的分布相比于标准答案的“差异”。对于这个问题，常用的是交叉熵损失函数。令$\mathbi{y}$表示机器翻译模型输出的分布，$\hat{\mathbi{y}}$ 表示标准答案，则交叉熵损失可以被定义为：
 \begin{eqnarray}
 L_{\textrm{ce}}(\mathbi{y},\hat{\mathbi{y}}) = - \sum_{k=1}^{|V|} \mathbi{y}[k] \textrm{log} (\hat{\mathbi{y}}[k])
-\label{eq:10-3222}
+\label{eq:10-25}
 \end{eqnarray}

 \noindent 其中$\mathbi{y}[k]$ 和$\hat{\mathbi{y}}[k]$分别表示向量$\mathbi{y}$和$\hat{\mathbi{y}}$的第$k$维，$|V|$表示输出向量的维度（等于词表大小）。假设有$n$个训练样本，模型输出的概率分布为$\mathbi{Y} = \{ \mathbi{y}_1,\mathbi{y}_2,..., \mathbi{y}_n \}$，标准答案的分布$\widehat{\mathbi{Y}}=\{ \hat{\mathbi{y}}_1, \hat{\mathbi{y}}_2,...,\hat{\mathbi{y}}_n \}$。这个训练样本集合上的损失函数可以被定义为：
 \begin{eqnarray}
 L(\mathbi{Y},\widehat{\mathbi{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j,\hat{\mathbi{y}}_j)
-\label{eq:10-31}
+\label{eq:10-26}
 \end{eqnarray}

-\parinterval 公式\eqref{eq:10-31}是一种非常通用的损失函数形式，除了交叉熵，也可以使用其他的损失函数，这时只需要替换$L_{\textrm{ce}} (\cdot)$即可。这里使用交叉熵损失函数的好处在于，它非常容易优化，特别是与Softmax组合，其反向传播的实现非常高效。此外，交叉熵损失（在一定条件下）也对应了极大似然的思想，这种方法在自然语言处理中已经被证明是非常有效的。
+\parinterval 公式\eqref{eq:10-26}是一种非常通用的损失函数形式，除了交叉熵，也可以使用其他的损失函数，这时只需要替换$L_{\textrm{ce}} (\cdot)$即可。这里使用交叉熵损失函数的好处在于，它非常容易优化，特别是与Softmax组合，其反向传播的实现非常高效。此外，交叉熵损失（在一定条件下）也对应了极大似然的思想，这种方法在自然语言处理中已经被证明是非常有效的。

 \parinterval 除了交叉熵，很多系统也使用了面向评价的损失函数，比如，直接利用评价指标BLEU定义损失函数\upcite{DBLP:conf/acl/ShenCHHWSL16}。不过这类损失函数往往不可微分，因此无法直接获取梯度。这时可以引入强化学习技术，通过策略梯度等方法进行优化。不过这类方法需要采样等手段，这里不做重点讨论，相关内容会在后面技术部分进行介绍。

@@ -977,7 +976,7 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j,\
 \item 网络的权重矩阵$\mathbi{w}$一般使用Xavier参数初始化方法\upcite{pmlr-v9-glorot10a}，可以有效稳定训练过程，特别是对于比较“深”的网络。令$d_{\textrm{in}}$和$d_{\textrm{out}}$分别表示$\mathbi{w}$的输入和输出的维度大小\footnote{对于变换$\mathbi{y} = \mathbi{x} \mathbi{w}$，$\mathbi{w}$的列数为$d_{\textrm{in}}$，行数为$d_{\textrm{out}}$。}，则该方法的具体实现如下：
 \begin{eqnarray}
 \mathbi{w} \sim U(-\sqrt{ \frac{6} { d_{\textrm{in}} + d_{\textrm{out}} } } , \sqrt{ \frac{6} { d_{\textrm{in}} + d_{\textrm{out}} } })
-\label{eq:10-32}
+\label{eq:10-27}
 \vspace{0.5em}
 \end{eqnarray}

@@ -991,7 +990,7 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j,\
 \subsubsection{3. 优化策略}

 %\vspace{0.5em}
-\parinterval 公式\eqref{eq:10-30}展示了最基本的优化策略，也被称为标准的SGD优化器。实际上，训练神经机器翻译模型时，还有非常多的优化器可以选择，在{\chapternine}也有详细介绍，这里考虑Adam优化器\upcite{kingma2014adam}。 Adam 通过对梯度的{\small\bfnew{一阶矩估计}}\index{一阶矩估计}（First Moment Estimation）\index{First Moment Estimation}和{\small\bfnew{二阶矩估计}}\index{二阶矩估计}（Second Moment Estimation）\index{Second Moment Estimation}进行综合考虑，计算出更新步长。
+\parinterval 公式\eqref{eq:10-24}展示了最基本的优化策略，也被称为标准的SGD优化器。实际上，训练神经机器翻译模型时，还有非常多的优化器可以选择，在{\chapternine}也有详细介绍，这里考虑Adam优化器\upcite{kingma2014adam}。 Adam 通过对梯度的{\small\bfnew{一阶矩估计}}\index{一阶矩估计}（First Moment Estimation）\index{First Moment Estimation}和{\small\bfnew{二阶矩估计}}\index{二阶矩估计}（Second Moment Estimation）\index{Second Moment Estimation}进行综合考虑，计算出更新步长。

 \parinterval 通常，Adam收敛地比较快，不同任务基本上可以使用一套配置进行优化，虽性能不算差，但很难达到最优效果。相反，SGD虽能通过在不同的数据集上进行调整，来达到最优的结果，但是收敛速度慢。因此需要根据不同的需求来选择合适的优化器。若需要快得到模型的初步结果，选择Adam较为合适，若是需要在一个任务上得到最优的结果，选择SGD更为合适。

@@ -1006,7 +1005,7 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j,\
 \vspace{-0.5em}
 \begin{eqnarray}
 \mathbi{w}' = \mathbi{w} \cdot \frac{\gamma} {\textrm{max}(\gamma,\| \mathbi{w} \|_2)}
-\label{eq:10-33}
+\label{eq:10-28}
 \end{eqnarray}
 %\vspace{0.5em}

@@ -1019,14 +1018,14 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j,\
 \subsubsection{5. 学习率策略}
 \vspace{0.5em}

-\parinterval 在公式\eqref{eq:10-30}中， $\alpha$决定了每次参数更新时更新的步幅大小，称之为学习率。学习率作为基于梯度方法中的重要超参数，它决定目标函数能否收敛到较好的局部最优点以及收敛的速度。合理的学习率能够使模型快速、稳定地达到较好的状态。但是，如果学习率太小，收敛过程会很慢；而学习率太大，则模型的状态可能会出现震荡，很难达到稳定，甚至使模型无法收敛。图\ref{fig:10-28} 对比了不同学习率对优化过程的影响。
+\parinterval 在公式\eqref{eq:10-24}中， $\alpha$决定了每次参数更新时更新的步幅大小，称之为学习率。学习率作为基于梯度方法中的重要超参数，它决定目标函数能否收敛到较好的局部最优点以及收敛的速度。合理的学习率能够使模型快速、稳定地达到较好的状态。但是，如果学习率太小，收敛过程会很慢；而学习率太大，则模型的状态可能会出现震荡，很难达到稳定，甚至使模型无法收敛。图\ref{fig:10-25} 对比了不同学习率对优化过程的影响。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter10/Figures/figure-convergence&lr}
 \caption{学习率过小（左） vs 学习率过大（右） }
-\label{fig:10-28}
+\label{fig:10-25}
 \end{figure}
 %----------------------------------------------

@@ -1034,22 +1033,22 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j,\
 \vspace{0.5em}


-\parinterval 图\ref{fig:10-29}展示了一种常用的学习率调整策略。它分为两个阶段：预热阶段和衰减阶段。模型训练初期梯度通常很大，如果直接使用较大的学习率很容易让模型陷入局部最优。学习率的预热阶段便是通过在训练初期使学习率从小到大逐渐增加来减缓在初始阶段模型“跑偏”的现象。一般来说，初始学习率太高会使得模型进入一种损失函数曲面非常不平滑的区域，进而使得模型进入一种混乱状态，后续的优化过程很难取得很好的效果。一个常用的学习率预热方法是{\small\bfnew{逐渐预热}}\index{逐渐预热}（Gradual Warmup）\index{Gradual Warmup}。假设预热的更新次数为$N$，初始学习率为$\alpha_0$，则预热阶段第$\textrm{step}$次更新的学习率为：
+\parinterval 图\ref{fig:10-26}展示了一种常用的学习率调整策略。它分为两个阶段：预热阶段和衰减阶段。模型训练初期梯度通常很大，如果直接使用较大的学习率很容易让模型陷入局部最优。学习率的预热阶段便是通过在训练初期使学习率从小到大逐渐增加来减缓在初始阶段模型“跑偏”的现象。一般来说，初始学习率太高会使得模型进入一种损失函数曲面非常不平滑的区域，进而使得模型进入一种混乱状态，后续的优化过程很难取得很好的效果。一个常用的学习率预热方法是{\small\bfnew{逐渐预热}}\index{逐渐预热}（Gradual Warmup）\index{Gradual Warmup}。假设预热的更新次数为$N$，初始学习率为$\alpha_0$，则预热阶段第$\textrm{step}$次更新的学习率为：
 %\vspace{0.5em}
 \begin{eqnarray}
 \alpha_t = \frac{\textrm{step}}{N} \alpha_0 \quad,\quad 1 \leq t \leq T'
-\label{eq:10-34}
+\label{eq:10-29}
 \end{eqnarray}
 %-------

-\noindent 另一方面，当模型训练逐渐接近收敛的时候，使用太大学习率会很容易让模型在局部最优解附近震荡，从而错过局部极小，因此需要通过减小学习率来调整更新的步长，以此来不断地逼近局部最优，这一阶段也称为学习率的衰减阶段。学习率衰减的方法有很多，比如指数衰减以及余弦衰减等，图\ref{fig:10-29}右侧展示的是{\small\bfnew{分段常数衰减}}\index{分段常数衰减}（Piecewise Constant Decay）\index{Piecewise Constant Decay}，即每经过$m$次更新，学习率衰减为原来的$\beta_m$（$\beta_m<1$）倍，其中$m$和$\beta_m$为经验设置的超参。
+\noindent 另一方面，当模型训练逐渐接近收敛的时候，使用太大学习率会很容易让模型在局部最优解附近震荡，从而错过局部极小，因此需要通过减小学习率来调整更新的步长，以此来不断地逼近局部最优，这一阶段也称为学习率的衰减阶段。学习率衰减的方法有很多，比如指数衰减以及余弦衰减等，图\ref{fig:10-26}右侧展示的是{\small\bfnew{分段常数衰减}}\index{分段常数衰减}（Piecewise Constant Decay）\index{Piecewise Constant Decay}，即每经过$m$次更新，学习率衰减为原来的$\beta_m$（$\beta_m<1$）倍，其中$m$和$\beta_m$为经验设置的超参。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter10/Figures/figure-relationship-between-learning-rate-and-number-of-updates}
 \caption{学习率与更新次数的变化关系}
-\label{fig:10-29}
+\label{fig:10-26}
 \end{figure}
 %----------------------------------------------

@@ -1081,19 +1080,19 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j,\
 \begin{itemize}
 \vspace{0.5em}

-\item {\small\bfnew{数据并行}}。如果一台设备能完整放下一个神经机器翻译模型，那么数据并行可以把一个大批次均匀切分成$n$个小批次，然后分发到$n$个设备上并行计算，最后把结果汇总，相当于把运算时间变为原来的${1}/{n}$，数据并行的过程如图\ref{fig:10-30}所示。不过，需要注意的是，多设备并行需要对数据在不同设备间传输，特别是多个GPU的情况，设备间传输的带宽十分有限，设备间传输数据往往会造成额外的时间消耗\upcite{xiao2017fast}。通常，数据并行的训练速度无法随着设备数量增加呈线性增长。不过这个问题也有很多优秀的解决方案，比如采用多个设备的异步训练，但是这些内容已经超出本章的内容，因此这里不做过多讨论。
+\item {\small\bfnew{数据并行}}。如果一台设备能完整放下一个神经机器翻译模型，那么数据并行可以把一个大批次均匀切分成$n$个小批次，然后分发到$n$个设备上并行计算，最后把结果汇总，相当于把运算时间变为原来的${1}/{n}$，数据并行的过程如图\ref{fig:10-27}所示。不过，需要注意的是，多设备并行需要对数据在不同设备间传输，特别是多个GPU的情况，设备间传输的带宽十分有限，设备间传输数据往往会造成额外的时间消耗\upcite{xiao2017fast}。通常，数据并行的训练速度无法随着设备数量增加呈线性增长。不过这个问题也有很多优秀的解决方案，比如采用多个设备的异步训练，但是这些内容已经超出本章的内容，因此这里不做过多讨论。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter10/Figures/figure-data-parallel-process}
 \caption{数据并行过程}
-\label{fig:10-30}
+\label{fig:10-27}
 \end{figure}
 %----------------------------------------------

 \vspace{0.5em}
-\item {\small\bfnew{模型并行}}\index{模型并行}。另一种思路是，把较大的模型分成若干小模型，之后在不同设备上训练小模型。对于循环神经网络，不同层的网络天然就是一个相对独立的模型，因此非常适合使用这种方法。比如，对于$l$层的循环神经网络，把每层都看做一个小模型，然后分发到$l$个设备上并行计算。在序列较长的时候，该方法使其运算时间变为原来的${1}/{l}$。图\ref{fig:10-31}以三层循环网络为例展示了对句子“你\ 很\ 不错\ 。”进行模型并行的过程。其中，每一层网络都被放到了一个设备上。当模型根据已经生成的第一个词“你”，并预测下一个词时（图\ref{fig:10-31}(a)），同层的下一个时刻的计算和对“你”的第二层的计算就可以同时开展（图\ref{fig:10-31}(b)）。以此类推，就完成了模型的并行计算。
+\item {\small\bfnew{模型并行}}\index{模型并行}。另一种思路是，把较大的模型分成若干小模型，之后在不同设备上训练小模型。对于循环神经网络，不同层的网络天然就是一个相对独立的模型，因此非常适合使用这种方法。比如，对于$l$层的循环神经网络，把每层都看做一个小模型，然后分发到$l$个设备上并行计算。在序列较长的时候，该方法使其运算时间变为原来的${1}/{l}$。图\ref{fig:10-28}以三层循环网络为例展示了对句子“你\ 很\ 不错\ 。”进行模型并行的过程。其中，每一层网络都被放到了一个设备上。当模型根据已经生成的第一个词“你”，并预测下一个词时（图\ref{fig:10-28}(a)），同层的下一个时刻的计算和对“你”的第二层的计算就可以同时开展（图\ref{fig:10-28}(b)）。以此类推，就完成了模型的并行计算。
 \vspace{0.5em}
 \end{itemize}

@@ -1106,7 +1105,7 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j,\
 %\subfigure[]{\input{./Chapter10/Figures/figure-process05}}  &\subfigure[]{\input{./Chapter10/Figures/figure-process06}}\\
 \end{tabular}
 %\caption{一个三层循环神经网络的模型并行过程}
-%\label{fig:10-31}
+%\label{fig:10-28}
 \end{figure}
 %----------------------------------------------
 %-------------------------------------------
@@ -1118,7 +1117,7 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j,\
 \subfigure[]{\input{./Chapter10/Figures/figure-process05}}  &\subfigure[]{\input{./Chapter10/Figures/figure-process06}}
 \end{tabular}
 \caption{一个三层循环神经网络的模型并行过程}
-\label{fig:10-31}
+\label{fig:10-28}
 \end{figure}
 %----------------------------------------------

@@ -1131,20 +1130,20 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j,\
 \begin{eqnarray}
 \hat{\seq{{y}}} & = & \argmax_{\seq{{y}}} \funp{P}(\seq{{y}} | \seq{{x}}) \nonumber \\
                 & = & \argmax_{\seq{{y}}} \prod_{j=1}^n \funp{P}(y_j | \seq{{y}}_{<j},\seq{{x}})
-\label{eq:10-35}
+\label{eq:10-30}
 \end{eqnarray}

 \parinterval 在具体实现时，由于当前目标语言单词的生成需要依赖前面单词的生成，因此无法同时生成所有的目标语言单词。理论上，可以枚举所有的$\seq{{y}}$，之后利用$\funp{P}(\seq{{y}} | \seq{{x}})$ 的定义对每个$\seq{{y}}$进行评价，然后找出最好的$\seq{{y}}$。这也被称作{\small\bfnew{全搜索}}\index{全搜索}（Full Search）\index{Full Search}。但是，枚举所有的译文单词序列显然是不现实的。因此，在具体实现时，并不会访问所有可能的译文单词序列，而是用某种策略进行有效的搜索。常用的做法是自左向右逐词生成。比如，对于每一个目标语言位置$j$，可以执行
 \begin{eqnarray}
 \hat{y}_j = \argmax_{y_j} \funp{P}(y_j | \hat{\seq{{y}}}_{<j} , \seq{{x}})
-\label{eq:10-36}
+\label{eq:10-31}
 \end{eqnarray}

 \noindent 其中，$\hat{y}_j$表示位置$j$概率最高的单词，$\hat{\seq{{y}}}_{<j} = \{ \hat{y}_1,...,\hat{y}_{j-1} \}$表示已经生成的最优译文单词序列。也就是，把最优的译文看作是所有位置上最优单词的组合。显然，这是一种贪婪搜索，因为无法保证$\{ \hat{y}_1,...,\hat{y}_{n} \}$是全局最优解。一种缓解这个问题的方法是，在每步中引入更多的候选。这里定义$\hat{y}_{jk} $ 表示在目标语言第$j$个位置排名在第$k$位的单词。在每一个位置$j$，可以生成$k$个最可能的单词，而不是1个，这个过程可以被描述为
 \begin{eqnarray}
 \{ \hat{y}_{j1},...,\hat{y}_{jk} \} = \argmax_{ \{ \hat{y}_{j1},...,\hat{y}_{jk} \} }
 \funp{P}(y_j | \{ \hat{\seq{{y}}}_{<{j\ast}} \},\seq{{x}})
-\label{eq:10-37}
+\label{eq:10-32}
 \end{eqnarray}

 \noindent 这里，$\{ \hat{y}_{j1},...,\hat{y}_{jk} \}$表示对于位置$j$翻译概率最大的前$k$个单词，$\{ \hat{\seq{{y}}}_{<j\ast} \}$表示前$j-1$步top-$k$单词组成的所有历史。${\hat{\seq{{y}}}_{<j\ast}}$可以被看作是一个集合，里面每一个元素都是一个目标语言单词序列，这个序列是前面生成的一系列top-$k$单词的某种组成。$\funp{P}(y_j | \{ \hat{\seq{{y}}}_{<{j\ast}} \},\seq{{x}})$表示基于\{$ \hat{\seq{{y}}}_{<j\ast} $\}的某一条路径生成$y_j$的概率\footnote{严格来说，$ \funp{P} (y_j | {\hat{\seq{{y}}}_{<j\ast} })$不是一个准确的数学表达，这里通过这种写法强调$y_j$是由\{$ \hat{\seq{{y}}}_{<j\ast} $\}中的某个译文单词序列作为条件生成的。} 。这种方法也被称为束搜索，意思是搜索时始终考虑一个集束内的候选。
@@ -1157,7 +1156,7 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j,\
 \vspace{1.0em}
 \subsubsection{1. 贪婪搜索}
 \vspace{0.6em}
-\parinterval 图\ref{fig:10-32}展示了一个基于贪婪方法的神经机器翻译解码过程。每一个时间步的单词预测都依赖于其前一步单词的生成。在解码第一个单词时，由于没有之前的单词信息，会用<sos>进行填充作为起始的单词，且会用一个零向量（可以理解为没有之前时间步的信息）表示第0步的中间层状态。
+\parinterval 图\ref{fig:10-29}展示了一个基于贪婪方法的神经机器翻译解码过程。每一个时间步的单词预测都依赖于其前一步单词的生成。在解码第一个单词时，由于没有之前的单词信息，会用<sos>进行填充作为起始的单词，且会用一个零向量（可以理解为没有之前时间步的信息）表示第0步的中间层状态。
 \vspace{0.8em}

 %----------------------------------------------
@@ -1165,12 +1164,12 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j,\
 \centering
 \input{./Chapter10/Figures/figure-decoding-process-based-on-greedy-method}
 \caption{基于贪婪方法的解码过程}
-\label{fig:10-32}
+\label{fig:10-29}
 \end{figure}
 %----------------------------------------------

 \vspace{0.2em}
-\parinterval 解码端的每一步Softmax层会输出所有单词的概率，由于是基于贪心的方法，这里会选择概率最大（top-1）的单词作为输出。这个过程可以参考图\ref{fig:10-33}的内容。选择分布中概率最大的单词“Have”作为得到的第一个单词，并再次送入解码器，作为第二步的输入同时预测下一个单词。以此类推，直到生成句子的终止符为止，就得到了完整的译文。
+\parinterval 解码端的每一步Softmax层会输出所有单词的概率，由于是基于贪心的方法，这里会选择概率最大（top-1）的单词作为输出。这个过程可以参考图\ref{fig:10-30}的内容。选择分布中概率最大的单词“Have”作为得到的第一个单词，并再次送入解码器，作为第二步的输入同时预测下一个单词。以此类推，直到生成句子的终止符为止，就得到了完整的译文。

 \parinterval 贪婪搜索的优点在于速度快。在对翻译速度有较高要求的场景中，贪婪搜索是一种十分有效的系统加速方法。而且贪婪搜索的原理非常简单，易于快速实现。不过，由于每一步只保留一个最好的局部结果，贪婪搜索往往会带来翻译品质上的损失。

@@ -1179,7 +1178,7 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j,\
 \centering
 \input{./Chapter10/Figures/figure-decode-the-word-probability-distribution-at-the-first-position}
 \caption{解码第一个位置输出的单词概率分布（“Have”的概率最高）}
-\label{fig:10-33}
+\label{fig:10-30}
 \end{figure}
 %----------------------------------------------

@@ -1190,14 +1189,14 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j,\
 \subsubsection{2. 束搜索}
 \vspace{0.5em}

-\parinterval 束搜索是一种启发式图搜索算法。相比于全搜索，它可以减少搜索所占用的空间和时间，在每一步扩展的时候，剪掉一些质量比较差的结点，保留下一些质量较高的结点。具体到机器翻译任务，对于每一个目标语言位置，束搜索选择了概率最大的前$K$个单词进行扩展（其中$k$叫做束宽度，或简称为束宽）。如图\ref{fig:10-34}所示，假设\{$y_1, y_2,..., y_n$\}表示生成的目标语言序列，且$k=3$，则束搜索的具体过程为：在预测第一个位置时，可以通过模型得到$y_1$的概率分布，选取概率最大的前3个单词作为候选结果（假设分别为“have”, “has”, “it”）。在预测第二个位置的单词时，模型针对已经得到的三个候选结果（“have”, “has”, “it”）计算第二个单词的概率分布。因为$y_2$对应$|V|$种可能，总共可以得到$3 \times |V|$种结果。然后从中选取使序列概率$\funp{P}(y_2,y_1| \seq{{x}})$最大的前三个$y_2$作为新的输出结果，这样便得到了前两个位置的top-3译文。在预测其他位置时也是如此，不断重复此过程直到推断结束。可以看到，束搜索的搜索空间大小与束宽度有关，也就是：束宽度越大，搜索空间越大，更有可能搜索到质量更高的译文，但同时搜索会更慢。束宽度等于3，意味着每次只考虑三个最有可能的结果，贪婪搜索实际上便是束宽度为1的情况。在神经机器翻译系统实现中，一般束宽度设置在4～8之间。
+\parinterval 束搜索是一种启发式图搜索算法。相比于全搜索，它可以减少搜索所占用的空间和时间，在每一步扩展的时候，剪掉一些质量比较差的结点，保留下一些质量较高的结点。具体到机器翻译任务，对于每一个目标语言位置，束搜索选择了概率最大的前$k$个单词进行扩展（其中$k$叫做束宽度，或简称为束宽）。如图\ref{fig:10-31}所示，假设\{$y_1, y_2,..., y_n$\}表示生成的目标语言序列，且$k=3$，则束搜索的具体过程为：在预测第一个位置时，可以通过模型得到$y_1$的概率分布，选取概率最大的前3个单词作为候选结果（假设分别为“have”, “has”, “it”）。在预测第二个位置的单词时，模型针对已经得到的三个候选结果（“have”, “has”, “it”）计算第二个单词的概率分布。因为$y_2$对应$|V|$种可能，总共可以得到$3 \times |V|$种结果。然后从中选取使序列概率$\funp{P}(y_2,y_1| \seq{{x}})$最大的前三个$y_2$作为新的输出结果，这样便得到了前两个位置的top-3译文。在预测其他位置时也是如此，不断重复此过程直到推断结束。可以看到，束搜索的搜索空间大小与束宽度有关，也就是：束宽度越大，搜索空间越大，更有可能搜索到质量更高的译文，但同时搜索会更慢。束宽度等于3，意味着每次只考虑三个最有可能的结果，贪婪搜索实际上便是束宽度为1的情况。在神经机器翻译系统实现中，一般束宽度设置在4～8之间。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter10/Figures/figure-beam-search-process}
 \caption{束搜索过程}
-\label{fig:10-34}
+\label{fig:10-31}
 \end{figure}
 %----------------------------------------------

@@ -1222,13 +1221,13 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j,\
 \parinterval 为了解决上面提到的问题，可以使用其他特征与$\textrm{log } \funp{P} (\seq{{y}} | \seq{{x}})$一起组成新的模型得分$\textrm{score} ( \seq{{y}} , \seq{{x}})$。针对模型倾向于生成短句子的问题，常用的做法是引入惩罚机制。比如，可以定义一个惩罚因子，形式如下：
 \begin{eqnarray}
 \textrm{lp}(\seq{{y}}) = \frac {(5+ |\seq{{y}}|)^{\alpha}} {(5+1)^{\alpha}}
-\label{eq:10-39}
+\label{eq:10-33}
 \end{eqnarray}

 \noindent 其中，$|\seq{{y}}|$代表已经得到的译文长度，$\alpha$是一个固定的常数，用于控制惩罚的强度。同时在计算句子得分时，额外引入表示覆盖度的因子，如下：
 \begin{eqnarray}
 \textrm{cp}(\seq{{y}} , \seq{{x}}) = \beta \cdot \sum_{i=1}^{|\seq{{x}}|} \textrm{log} \big(\textrm{min}(\sum_j^{|\seq{{y}}|} \alpha_{ij},1 ) \big)
-\label{eq:10-40}
+\label{eq:10-34}
 \end{eqnarray}

 \noindent $\textrm{cp}(\cdot)$会惩罚把某些源语言单词对应到很多目标语言单词的情况（覆盖度），被覆盖的程度用$\sum_j^{|\seq{{y}}|} \alpha_{ij}$度量。$\beta$也是需要经验性设置的超参数，用于对覆盖度惩罚的强度进行控制。
@@ -1236,7 +1235,7 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j,\
 \parinterval 最终，模型得分定义如下：
 \begin{eqnarray}
 \textrm{score} ( \seq{{y}} , \seq{{x}}) = \frac{\textrm{log} \funp{P}(\seq{{y}} | \seq{{x}})} {\textrm{lp}(\seq{{y}})} + \textrm{cp}(\seq{{y}} , \seq{{x}})
-\label{eq:10-41}
+\label{eq:10-35}
 \end{eqnarray}

 \noindent 显然，当目标语言$\seq{{y}}$越短时，$\textrm{lp}(\seq{{y}})$的值越小，因为$\textrm{log } \funp{P}(\seq{{y}} | \seq{{x}})$是负数，所以句子得分$\textrm{score} ( \seq{{y}} , \seq{{x}})$越小。也就是说，模型会惩罚译文过短的结果。当覆盖度较高时，同样会使得分变低。通过这样的惩罚机制，使模型得分更为合理，从而帮助模型选择出质量更高的译文。
@@ -1245,7 +1244,7 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j,\
 %    NEW SECTION
 %----------------------------------------------------------------------------------------
 \sectionnewpage
-\section{小节及拓展阅读}
+\section{小结及拓展阅读}

 \parinterval 神经机器翻译是近几年的热门方向。无论是前沿性的技术探索，还是面向应用落地的系统研发，神经机器翻译已经成为当下最好的选择之一。研究人员对神经机器翻译的热情使得这个领域得到了快速的发展。本章作为神经机器翻译的入门章节，对神经机器翻译的建模思想和基础框架进行了描述。同时，对常用的神经机器翻译架构\ \dash \ 循环神经网络进行了讨论与分析。


--- a/Chapter11/Figures/figure-single-glu.tex
+++ b/Chapter11/Figures/figure-single-glu.tex
@@ -64,8 +64,8 @@ $\otimes$： & 按位乘运算 \\
 	\draw[-latex,thick] (c2.east) -- ([xshift=0.4cm]c2.east); 
 	
 	\node[inner sep=0pt, font=\tiny] at (0.75cm, -0.4cm) {$\mathbi{x}$};
-	\node[inner sep=0pt, font=\tiny] at ([yshift=-0.8cm]a.south) {$\mathbi{B}=\mathbi{x} * \mathbi{V} + \mathbi{b}_{\mathbi{W}}$};
-	\node[inner sep=0pt, font=\tiny] at ([yshift=-0.8cm]b.south) {$\mathbi{A}=\mathbi{x} * \mathbi{W} + \mathbi{b}_{\mathbi{V}}$};
+	\node[inner sep=0pt, font=\tiny] at ([yshift=-0.8cm]a.south) {$\mathbi{A}=\mathbi{x} * \mathbi{W} + \mathbi{b}_{\mathbi{W}}$};
+	\node[inner sep=0pt, font=\tiny] at ([yshift=-0.8cm]b.south) {$\mathbi{B}=\mathbi{x} * \mathbi{V} + \mathbi{b}_{\mathbi{V}}$};
 	\node[inner sep=0pt, font=\tiny] at (8.2cm, -0.4cm) {$\mathbi{y}=\mathbi{A} \otimes \sigma(\mathbi{B})$};
 	
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter11/chapter11.tex
+++ b/Chapter11/chapter11.tex
@@ -88,10 +88,10 @@
 \parinterval 在卷积计算中，不同深度下卷积核不同但是执行操作相同，这里以二维卷积核为例展示具体卷积计算。若设输入矩阵为$\mathbi{x}$，输出矩阵为$\mathbi{y}$，卷积滑动步幅为$\textrm{stride}$，卷积核为$\mathbi{w}$，且$\mathbi{w} \in \mathbb{R}^{Q \times U} $，那么卷积计算的公式为：
 \begin{eqnarray}
 \mathbi{y}_{i,j} = \sum \sum ( \mathbi{x}_{[j\times \textrm{stride}:j\times \textrm{stride}+U-1,i\times \textrm{stride}:i\times \textrm{stride}+Q-1]} \odot \mathbi{w} )
-\label{eq:11-1-new}
+\label{eq:11-1}
 \end{eqnarray}

-\noindent 其中$i$是输出矩阵的行下标，$j$是输出矩阵的列下标。图\ref{fig:11-4}展示了一个简单的卷积操作示例，其中$Q$为2，$U$为2，$\textrm{stride}$为1，根据公式\eqref{eq:11-1-new}，图中蓝色位置$\mathbi{y}_{0,0}$的计算为：
+\noindent 其中$i$是输出矩阵的行下标，$j$是输出矩阵的列下标，$\odot$表示矩阵点乘，具体见{\chapternine}。图\ref{fig:11-4}展示了一个简单的卷积操作示例，其中$Q$为2，$U$为2，$\textrm{stride}$为1，根据公式\eqref{eq:11-1}，图中蓝色位置$\mathbi{y}_{0,0}$的计算为：
 \begin{eqnarray}
 \mathbi{y}_{0,0} &=& \sum \sum ( \mathbi{x}_{[0\times 1:0\times 1+2-1,0\times 1:0\times 1+2-1]} \odot \mathbi{w}) \nonumber \\
 			 &=& \sum \sum ( \mathbi{x}_{[0:1,0:1]} \odot \mathbi{w} ) \nonumber \\
@@ -101,7 +101,7 @@
 \end{pmatrix} \nonumber \\
 			 &=& 0 \times 0 + 1 \times 1 + 3 \times 2 + 4 \times 3 \nonumber \\
 			 &=& 19
-\label{eq:11-2-new}
+\label{eq:11-2}
 \end{eqnarray}

 \parinterval 卷积计算的作用是提取特征，用不同的卷积核计算可以获取不同的特征，比如图\ref{fig:11-5}，通过设计的特定卷积核就可以获取图像边缘信息。在卷积神经网络中，不需要手动设计卷积核，只需要指定卷积层中卷积核的数量及大小，模型就可以自己学习卷积核具体的参数。
@@ -307,20 +307,19 @@

 \parinterval 如图所示，形式上，卷积操作可以分成两部分，分别使用两个卷积核来得到两个卷积结果：
 \begin{eqnarray}
-\mathbi{A} & = & \mathbi{x} * \mathbi{W} + \mathbi{b}_\mathbi{W} \\
-\mathbi{B} & = & \mathbi{x} * \mathbi{V} + \mathbi{b}_\mathbi{V} \ \
-\label{eq:11-1}
+\mathbi{A} & = & \mathbi{x} * \mathbi{W} + \mathbi{b}_\mathbi{W} \label{eq:11-3} \\
+\mathbi{B} & = & \mathbi{x} * \mathbi{V} + \mathbi{b}_\mathbi{V} \label{eq:11-4}
 \end{eqnarray}

 \noindent 其中，$\mathbi{A},\mathbi{B}\in \mathbb{R}^d$，$\mathbi{W}\in \mathbb{R}^{K\times d \times d}$、$\mathbi{V}\in \mathbb{R}^{K\times d \times d}$、$\mathbi{b}_\mathbi{W}$，$\mathbi{b}_\mathbi{V} \in \mathbb{R}^d $，$\mathbi{W}$、$\mathbi{V}$在此表示卷积核，$\mathbi{b}_\mathbi{W}$，$\mathbi{b}_\mathbi{V}$为偏置矩阵。在卷积操作之后，引入非线性变换：
 \begin{eqnarray}
 \mathbi{y} & = & \mathbi{A} \otimes \sigma ( \mathbi{B} )
-\label{eq:11-2}
+\label{eq:11-5}
 \end{eqnarray}

 \noindent 其中，$\sigma$为Sigmoid函数，$\otimes$为按位乘运算。Sigmoid将$\mathbi{B}$映射为0-1范围内的实数，用来充当门控。可以看到，门控卷积神经网络中核心部分就是$\sigma ( \mathbi{B} )$，通过这个门控单元来对卷积输出进行控制，确定保留哪些信息。同时，在梯度反向传播的过程中，这种机制使得不同层之间存在线性的通道，梯度传导更加简单，利于深层网络的训练。这种思想和\ref{sec:11.2.3}节将要介绍的残差网络也很类似。

-\parinterval 在ConvS2S模型中，为了保证卷积操作之后的序列长度不变，需要对输入进行填充，这一点已经在之前的章节中讨论过了。因此，在编码端每一次卷积操作前，需要对序列的头部和尾部分别做相应的填充（如图\ref{fig:11-14}左侧部分）。而在解码端中，由于需要训练和解码的一致性，模型在训练过程中不能使用未来的信息，需要对未来信息进行屏蔽，也就是屏蔽掉当前译文单词右侧的译文信息。从实践角度来看，只需要对解码端输入序列的头部填充$K-1$ 个空元素，其中$K$为卷积核的宽度（图\ref{fig:11-14-2}展示了卷积核宽度$K$=3时，解码端对输入序列的填充情况，图中三角形表示卷积操作）。
+\parinterval 在ConvS2S模型中，为了保证卷积操作之后的序列长度不变，需要对输入进行填充，这一点已经在之前的章节中讨论过了。因此，在编码端每一次卷积操作前，需要对序列的头部和尾部分别做相应的填充（如图\ref{fig:11-14}左侧部分）。而在解码端中，由于需要训练和解码的一致性，模型在训练过程中不能使用未来的信息，需要对未来信息进行屏蔽，也就是屏蔽掉当前译文单词右侧的译文信息。从实践角度来看，只需要对解码端输入序列的头部填充$K-1$ 个空元素，其中$K$为卷积核的宽度（图\ref{fig:11-15}展示了卷积核宽度$K$=3时，解码端对输入序列的填充情况，图中三角形表示卷积操作）。

 %----------------------------------------------
 % 图14-2.
@@ -328,7 +327,7 @@
 \centering
 \input{./Chapter11/Figures/figure-padding-method}
 \caption{解码端的填充方法}
-\label{fig:11-14-2}
+\label{fig:11-15}
 \end{figure}
 %----------------------------------------------

@@ -342,16 +341,16 @@
 \parinterval 残差连接是一种训练深层网络的技术，其内容在{\chapternine}已经进行了介绍，即在多层神经网络之间通过增加直接连接的方式，从而将底层信息直接传递给上层。通过增加这样的直接连接，可以让不同层之间的信息传递更加高效，有利于深层神经网络的训练，其计算公式为：
 \begin{eqnarray}
 \mathbi{h}^{l+1} = F (\mathbi{h}^l) + \mathbi{h}^l
-\label{eq:11-3}
+\label{eq:11-6}
 \end{eqnarray}

-\noindent 其中，$\mathbi{h}^l$表示$l$层神经网络的输入向量，${F} (\mathbi{h}^l)$是$l$层神经网络的运算。如果$l=2$，那么公式\eqref{eq:11-3}可以解释为：第3层的输入$\mathbi{h}^3$等于第2层的输出${F}(\mathbi{h}^2)$加上第2层的输入$\mathbi{h}^2$。
+\noindent 其中，$\mathbi{h}^l$表示$l$层神经网络的输入向量，${F} (\mathbi{h}^l)$是$l$层神经网络的运算。如果$l=2$，那么公式\eqref{eq:11-6}可以解释为：第3层的输入$\mathbi{h}^3$等于第2层的输出${F}(\mathbi{h}^2)$加上第2层的输入$\mathbi{h}^2$。

 \parinterval 在ConvS2S中残差连接主要应用于门控卷积神经网络和多跳自注意力机制中，比如在编码器的多层门控卷积神经网络中，在每一层的输入和输出之间增加残差连接，具体的数学描述如下：
 \begin{eqnarray}
 %\mathbi{h}_i^l = \funp{v} (\mathbi{W}^l [\mathbi{h}_{i-\frac{k}{2}}^{l-1},...,\mathbi{h}_{i+\frac{k}{2}}^{l-1}] + b_{\mathbi{W}}^l ) + \mathbi{h}_i^{l-1}
 \mathbi{h}^{l+1} = \mathbi{A}^{l} \otimes \sigma ( \mathbi{B}^{l} ) + \mathbi{h}^{l}
-\label{eq:11-4}
+\label{eq:11-7}
 \end{eqnarray}


@@ -376,37 +375,34 @@
 \parinterval 在基于循环神经网络的翻译模型中，注意力机制已经被广泛使用\upcite{bahdanau2014neural}，并用于避免循环神经网络将源语言序列压缩成一个固定维度的向量表示带来的信息损失。另一方面，注意力同样能够帮助解码端区分源语言中不同位置对当前目标语言位置的贡献度，其具体的计算过程如下：

 \begin{eqnarray}
-\mathbi{C}_j &=& \sum_i \alpha_{i,j} \mathbi{h}_i \\
-\alpha_{i,j} &=& \frac{ \textrm{exp}(\funp{a} (\mathbi{s}_{j-1},\mathbi{h}_i))  }{\sum_{i'} \textrm{exp}( \funp{a} (\mathbi{s}_{j-1},\mathbi{h}_{i'}))}
-\label{eq:11-5}
+\mathbi{C}_j &=& \sum_i \alpha_{i,j} \mathbi{h}_i \label{eq:11-8} \\
+\alpha_{i,j} &=& \frac{ \textrm{exp}(\funp{a} (\mathbi{s}_{j-1},\mathbi{h}_i))  }{\sum_{i'} \textrm{exp}( \funp{a} (\mathbi{s}_{j-1},\mathbi{h}_{i'}))} \label{eq:11-9}
 \end{eqnarray}

 \noindent 其中，$\mathbi{h}_i$表示源语端第$i$个位置的隐层状态，即编码器在第$i$个位置的输出。$\mathbi{s}_j$表示目标端第$j$个位置的隐层状态。给定$\mathbi{s}_j$和$\mathbi{h}_i$，注意力机制通过函数$\funp{a}(\cdot)$计算目标语言表示$\mathbi{s}_j$与源语言表示$\mathbi{h}_i$之间的注意力权重$\alpha_{i,j}$，通过加权平均得到当前目标端位置所需的上下文表示$\mathbi{C}_j$。其中$\funp{a}(\cdot)$的具体计算方式在{\chapterten}已经详细讨论。

 \parinterval 在ConvS2S模型中，解码器同样采用堆叠的多层门控卷积网络来对目标语言进行序列建模。区别于编码器，解码器在每一层卷积网络之后引入了注意力机制，用来参考源语言信息。ConvS2S选用了点乘注意力，并且通过类似残差连接的方式将注意力操作的输入与输出同时作用于下一层计算，称为多跳注意力。其具体计算方式如下：
-
 \begin{eqnarray}
 \alpha_{ij}^l = \frac{ \textrm{exp} (\mathbi{d}_{j}^l\mathbi{h}_i) }{\sum_{i^{'}=1}^m \textrm{exp} (\mathbi{d}_{j}^l\mathbi{h}_{i^{'}})}
-\label{eq:11-6-1}
+\label{eq:11-10}
 \end{eqnarray}

-\noindent 不同于公式\eqref{eq:11-5}中使用的目标语端隐层表示$\mathbi{s}_{j-1}$，公式\eqref{eq:11-6-1}中的$\mathbi{d}_{j}^l$同时结合了$\mathbi{s}_{j}$的卷积计算结果和目标语端的词嵌入$\mathbi{g}_j$，其具体计算公式如下：
+\noindent 不同于公式\eqref{eq:11-9}中使用的目标语端隐层表示$\mathbi{s}_{j-1}$，公式\eqref{eq:11-10}中的$\mathbi{d}_{j}^l$同时结合了$\mathbi{s}_{j}$的卷积计算结果和目标语端的词嵌入$\mathbi{g}_j$，其具体计算公式如下：
 \begin{eqnarray}
-\mathbi{d}_{j}^l &=& \mathbi{W}_{d}^{l} \mathbi{z}_{j}^{l} + \mathbi{b}_{d}^{l} + \mathbi{g}_j \\
-\mathbi{z}_j^l &=& \textrm{Conv}(\mathbi{s}_j^l) 
-\label{eq:11-6-2}
+\mathbi{d}_{j}^l &=& \mathbi{W}_{d}^{l} \mathbi{z}_{j}^{l} + \mathbi{b}_{d}^{l} + \mathbi{g}_j \label{eq:11-11} \\
+\mathbi{z}_j^l &=& \textrm{Conv}(\mathbi{s}_j^l) \label{eq:11-12}
 \end{eqnarray}

 \noindent 其中，$\mathbi{z}_j^l$表示第$l$层卷积网络输出中第$j$个位置的表示，$\mathbi{W}_{d}^{l}$和$\mathbi{b}_{d}^{l}$是模型可学习的参数，$\textrm{Conv}(\cdot)$表示卷积操作。在获得第$l$层的注意力权重之后，就可以得到对应的一个上下文表示$\mathbi{C}_j^l$：
 \begin{eqnarray}
 \mathbi{C}_j^l = \sum_i \alpha_{ij}^l (\mathbi{h}_i + \mathbi{e}_i)
-\label{eq:11-7}
+\label{eq:11-13}
 \end{eqnarray}

 \noindent 模型使用了更全面的源语言信息，同时考虑了源语言端编码表示$\mathbi{h}_i$以及词嵌入表示$\mathbi{e}_i$。在获得第$l$层的上下文向量$\mathbi{C}_j^l$后，模型将其与$\mathbi{z}_j^l$相加后送入下一层网络，这个过程可以被描述为：
 \begin{eqnarray}
 \mathbi{s}_j^{l+1} = \mathbi{C}_j^l + \mathbi{z}_j^l
-\label{eq:11-8}
+\label{eq:11-14}
 \end{eqnarray}

 \noindent 与循环网络中的注意力机制相比，该机制能够帮助模型甄别已经考虑了哪些先前的输入。也就是说，多跳的注意力机制会考虑模型之前更关注哪些单词，并且之后层中执行多次注意力的“跳跃”。
@@ -431,10 +427,9 @@
 \end{itemize}

 \parinterval Nesterov加速梯度下降法和{\chapternine}介绍的Momentum梯度下降法类似，都使用了历史梯度信息，首先回忆一下Momentum梯度下降法，公式如下：
-
 \begin{eqnarray}
-\mathbi{w}_{t+1} & = &  \mathbi{w}_t - \alpha \mathbi{v}_t \label{eq:11-9-update} \\
-\mathbi{v}_t & = & \beta \mathbi{v}_{t-1} + (1-\beta)\frac{\partial J(\mathbi{w}_t)}{\partial \mathbi{w}_t}  \label{eq:11-9-momentum}
+\mathbi{w}_{t+1} & = &  \mathbi{w}_t - \alpha \mathbi{v}_t \label{eq:11-15} \\
+\mathbi{v}_t & = & \beta \mathbi{v}_{t-1} + (1-\beta)\frac{\partial J(\mathbi{w}_t)}{\partial \mathbi{w}_t}  \label{eq:11-16}
 \end{eqnarray}

 \noindent 其中，$\mathbi{w}_t$表示第$t$步更新时的模型参数；$J(\mathbi{w}_t)$表示损失函数均值期望的估计；$\frac{\partial J(\mathbi{w}_t)}{\partial \mathbi{w}_t}$将指向$J(\mathbi{w}_t)$在$\mathbi{w}_t$处变化最大的方向，即梯度方向；$\alpha$ 为学习率；$\mathbi{v}_t$为损失函数在前$t-1$步更新中累积的梯度动量，利用超参数$\beta$控制累积的范围。
@@ -442,7 +437,7 @@
 \parinterval 而在Nesterov加速梯度下降法中，使用的梯度不是来自于当前参数位置，而是按照之前梯度方向更新一小步的位置，以便于更好地“预测未来”，提前调整更新速率，因此，其动量的更新方式为：
 \begin{eqnarray}
 \mathbi{v}_t & = & \beta \mathbi{v}_{t-1} + (1-\beta)\frac{\partial J(\mathbi{w}_t)}{\partial (\mathbi{w}_{t} -\alpha \beta \mathbi{v}_{t-1} )}
-\label{eq:11-10}
+\label{eq:11-17}
 \end{eqnarray}

 \parinterval Nesterov加速梯度下降法其实是利用了二阶导数的信息，因此可以做到“向前看”，加速收敛过程\upcite{Bengio2013AdvancesIO}。为了模型的稳定训练。ConvS2S模型也采用了一些网络正则化和参数初始化的策略，使得模型在前向计算和反向计算过程中方差尽可能保持一致。
@@ -484,7 +479,7 @@
 在标准卷积中，若使用N表示卷积核的个数，也就是标准卷积输出序列的通道数，那么对于第$i$个位置的第$n$个通道$ \mathbi{z}_{i,n}^\textrm{\,std}$，其标准卷积具体计算方式如下：
 \begin{eqnarray}
 \mathbi{z}_{i,n}^\textrm{\,std} = \sum_{o=1}^{O} \sum_{k=0}^{K-1} \mathbi{W}_{k,o,n}^\textrm{\,std} \mathbi{x}_{i+k,o}
-\label{eq:11-11}
+\label{eq:11-18}
 \end{eqnarray}

 %在标准卷积中，$ \mathbi{z}^\textrm{\,std}$表示标准卷积的输出，$ \mathbi{z}_i^\textrm{\,std} \in \mathbb{R}^N$ ，N为卷积核的个数，也就是标准卷积输出序列的通道数。针对$ \mathbi{z}_i^\textrm{\,std} $ 中的第$n$个通道$ \mathbi{z}_{i,n}^\textrm{\,std}$，标准卷积具体计算方式如下：
@@ -494,7 +489,7 @@
 \parinterval 相应的，深度卷积只考虑不同词之间的依赖性，而不考虑不同通道之间的关系，相当于使用$O$个卷积核逐个通道对不同的词进行卷积操作。因此深度卷积不改变输出的表示维度，输出序列表示的通道数与输入序列一致，其计算方式如下：
 \begin{eqnarray}
 \mathbi{z}_{i,o}^\textrm{\,dw} = \sum_{k=0}^{K-1} \mathbi{W}_{k,o}^\textrm{\,dw} \mathbi{x}_{i+k,o}
-\label{eq:11-12}
+\label{eq:11-19}
 \end{eqnarray}

 \noindent 其中，$\mathbi{z}^\textrm{\,dw}$表示深度卷积的输出，$\mathbi{z}_i^\textrm{\,dw} \in \mathbb{R}^{O}$ ，$\mathbi{W}^\textrm{\,dw} \in \mathbb{R}^{K \times O}$为深度卷积的参数，参数量只涉及卷积核大小及输入表示维度。
@@ -503,7 +498,7 @@
 \begin{eqnarray}
 \mathbi{z}_{i,n}^\textrm{\,pw} &=& \sum\limits_{o=1}^{O} \mathbi{x}_{i,o} \mathbi{W}_{o,n}^\textrm{\,pw} \nonumber \\
                      &=& \mathbi{x}_i \mathbi{W}^\textrm{\,pw}
-\label{eq:11-13}
+\label{eq:11-20}
 \end{eqnarray}

 \noindent 其中$\mathbi{z}^\textrm{\,pw}$表示逐点卷积的输出，$\mathbi{z}_{i}^\textrm{\,pw} \in  \mathbb{R}^{N}$，$\mathbi{W}^\textrm{\,pw} \in \mathbb{R}^{O \times N}$为逐点卷积的参数。
@@ -555,7 +550,7 @@
 \parinterval 此外，和标准卷积不同的是，轻量卷积之前需要先对卷积参数进行归一化，具体计算过程如下：
 \begin{eqnarray}
 \mathbi{z}_{i,o}^\textrm{\,lw} &=& \sum_{k=0}^{K-1} \textrm{Softmax}(\mathbi{W}^\textrm{\,lw})_{k,[\frac{oa}{d}]} \mathbi{x}_{i+k,o}
-\label{eq:11-14}
+\label{eq:11-21}
 \end{eqnarray}

 \noindent 其中，$\mathbi{z}^\textrm{\,lw}$表示轻量卷积的输出，$\mathbi{z}_i^\textrm{\,lw} \in \mathbb{R}^d $，$\mathbi{W}^\textrm{\,lw} \in \mathbb{R}^{K\times a}$为轻量卷积的参数。在这里，轻量卷积用来捕捉相邻词的特征，通过Softmax可以在保证关注到不同词的同时，对输出大小进行限制。
@@ -570,7 +565,7 @@
 \parinterval 在轻量卷积中，模型使用的卷积参数是静态的，与序列位置无关， 维度大小为$K\times a$；而在动态卷积中，为了增强模型的表示能力，卷积参数来自于当前位置输入的变换，具体如下：
 \begin{eqnarray}
 \funp{f} (\mathbi{x}_{i}) = \sum_{c=1}^d \mathbi{W}_{:,:,c} \odot \mathbi{x}_{i,c}
-\label{eq:11-15}
+\label{eq:11-22}
 \end{eqnarray}

 \parinterval 这里采用了最简单的线性变换，其中$\odot$表示矩阵的点乘（详见第九章介绍），$d$为通道数，$\mathbi{x}_i$是序列第$i$个位置的表示，$c$表示某个通道，$\mathbi{W} \in \mathbb{R}^{K \times a \times d}$为变换矩阵，$\mathbi{W}_{:,:,c}$表示其只在$d$这一维进行计算，最后生成的$\funp{f} (\mathbi{x}_i)\in \mathbb{R}^{K \times a}$就是与输入相关的卷积核参数。通过这种方式，模型可以根据不同位置的表示来确定如何关注其他位置信息的“权重”，更好地提取序列信息。同时，相比于注意力机制中两两位置确定出来的注意力权重，动态卷积线性复杂度的做法具有更高的计算效率。
@@ -579,7 +574,7 @@
 %    NEW SUB-SECTION
 %----------------------------------------------------------------------------------------

-\section{小节及拓展阅读}
+\section{小结及拓展阅读}

 \parinterval 卷积是一种高效的神经网络结构，在图像、语音处理等领域取得了令人瞩目的成绩。本章介绍了卷积的概念及其特性，并对池化、填充等操作进行了讨论。本章介绍了具有高并行计算能力的机器翻译范式，即基于卷积神经网络的编码器-解码器框架。其在机器翻译任务上表现出色，并大幅度缩短了模型的训练周期。除了基础部分，本章还针对卷积计算进行了延伸，内容涉及逐通道卷积、逐点卷积、轻量卷积和动态卷积等。除了上述提及的内容，卷积神经网络及其变种在文本分类、命名实体识别、关系分类、事件抽取等其他自然语言处理任务上也有许多应用\upcite{Kim2014ConvolutionalNN,2011Natural,DBLP:conf/cncl/ZhouZXQBX17,DBLP:conf/acl/ChenXLZ015,DBLP:conf/coling/ZengLLZZ14}。


--- a/Chapter12/Figures/figure-different-regularization-methods.tex
+++ b/Chapter12/Figures/figure-different-regularization-methods.tex
@@ -7,15 +7,15 @@
 \tikzstyle{standard} = [rounded corners=3pt]

 \node [lnode,anchor=west] (l1) at (0,0) {\scriptsize{子层}};
-\node [lnode,anchor=west] (l2) at ([xshift=3em]l1.east) {\scriptsize{层正则化}};
-\node [lnode,anchor=west] (l3) at ([xshift=4em]l2.east) {\scriptsize{层正则化}};
+\node [lnode,anchor=west] (l2) at ([xshift=3em]l1.east) {\scriptsize{层标准化}};
+\node [lnode,anchor=west] (l3) at ([xshift=4em]l2.east) {\scriptsize{层标准化}};
 \node [lnode,anchor=west] (l4) at ([xshift=1.5em]l3.east) {\scriptsize{子层}};

 \node [anchor=west] (plus1) at ([xshift=0.9em]l1.east) {\scriptsize{$\mathbf{\oplus}$}};
 \node [anchor=west] (plus2) at ([xshift=0.9em]l4.east) {\scriptsize{$\mathbf{\oplus}$}};

-\node [anchor=north] (label1) at ([xshift=3em,yshift=-0.5em]l1.south) {\scriptsize{(a)后正则化}};
-\node [anchor=north] (label2) at ([xshift=3em,yshift=-0.5em]l3.south) {\scriptsize{(b)前正则化}};
+\node [anchor=north] (label1) at ([xshift=3em,yshift=-0.5em]l1.south) {\scriptsize{(a)后标准化}};
+\node [anchor=north] (label2) at ([xshift=3em,yshift=-0.5em]l3.south) {\scriptsize{(b)前标准化}};

 \draw [->,thick] ([xshift=-1.5em]l1.west) -- ([xshift=-0.1em]l1.west);
 \draw [->,thick] ([xshift=0.1em]l1.east) -- ([xshift=0.2em]plus1.west);

--- a/Chapter12/chapter12.tex
+++ b/Chapter12/chapter12.tex
@@ -34,32 +34,32 @@
 \vspace{0.5em}
 \label{sec:12.1}

-\parinterval 首先回顾一下循环神经网络处理文字序列的过程。如图\ref{fig:12-36}所示，对于单词序列$\{ w_1,...,w_m \}$，处理第$m$个单词$w_m$时（绿色方框部分），需要输入前一时刻的信息（即处理单词$w_{m-1}$），而$w_{m-1}$又依赖于$w_{m-2}$，以此类推。也就是说，如果想建立$w_m$和$w_1$之间的关系，需要$m-1$次信息传递。对于长序列来说，词汇之间信息传递距离过长会导致信息在传递过程中丢失，同时这种按顺序建模的方式也使得系统对序列的处理十分缓慢。
+\parinterval 首先回顾一下循环神经网络处理文字序列的过程。如图\ref{fig:12-1}所示，对于单词序列$\{ w_1,...,w_m \}$，处理第$m$个单词$w_m$时（绿色方框部分），需要输入前一时刻的信息（即处理单词$w_{m-1}$），而$w_{m-1}$又依赖于$w_{m-2}$，以此类推。也就是说，如果想建立$w_m$和$w_1$之间的关系，需要$m-1$次信息传递。对于长序列来说，词汇之间信息传递距离过长会导致信息在传递过程中丢失，同时这种按顺序建模的方式也使得系统对序列的处理十分缓慢。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter12/Figures/figure-dependencies-between-words-in-a-recurrent-neural-network}
 \caption{循环神经网络中单词之间的依赖关系}
-\label{fig:12-36}
+\label{fig:12-1}
 \end{figure}
 %----------------------------------------------

-\parinterval 那么能否摆脱这种顺序传递信息的方式，直接对不同位置单词之间的关系进行建模，即将信息传递的距离拉近为1？自注意力机制的提出便有效解决了这个问题\upcite{DBLP:journals/corr/LinFSYXZB17}。图\ref{fig:12-37}给出了自注意力机制对序列进行建模的示例。对于单词$w_m$，自注意力机制直接建立它与前$m-1$个单词之间的关系。也就是说，$w_m$与序列中所有其他单词的距离都是1。这种方式很好地解决了长距离依赖问题，同时由于单词之间的联系都是相互独立的，因此也大大提高了模型的并行度。
+\parinterval 那么能否摆脱这种顺序传递信息的方式，直接对不同位置单词之间的关系进行建模，即将信息传递的距离拉近为1？自注意力机制的提出便有效解决了这个问题\upcite{DBLP:journals/corr/LinFSYXZB17}。图\ref{fig:12-2}给出了自注意力机制对序列进行建模的示例。对于单词$w_m$，自注意力机制直接建立它与前$m-1$个单词之间的关系。也就是说，$w_m$与序列中所有其他单词的距离都是1。这种方式很好地解决了长距离依赖问题，同时由于单词之间的联系都是相互独立的，因此也大大提高了模型的并行度。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter12/Figures/figure-dependencies-between-words-of-attention}
 \caption{自注意力机制中单词之间的依赖关系}
-\label{fig:12-37}
+\label{fig:12-2}
 \end{figure}
 %----------------------------------------------

 \parinterval 自注意力机制也可以被看作是一个序列表示模型。比如，对于每个目标位置$j$，都生成一个与之对应的源语句子表示，它的形式为：
 \begin{eqnarray}
 \mathbi{C}_j & = & \sum_i \alpha_{i,j}\mathbi{h}_i
-\label{eq:12-4201}
+\label{eq:12-1}
 \end{eqnarray}

 \noindent 其中，$\mathbi{h}_i$ 为源语句子每个位置的表示结果，$\alpha_{i,j}$是目标位置$j$对$\mathbi{h}_i$的注意力权重。以源语句子为例，自注意力机制将序列中每个位置的表示$\mathbi{h}_i$看作$\mathrm{query}$（查询），并且将所有位置的表示看作$\mathrm{key}$（键）和$\mathrm{value}$ （值）。自注意力模型通过计算当前位置与所有位置的匹配程度，也就是在注意力机制中提到的注意力权重，来对各个位置的$\mathrm{value}$进行加权求和。得到的结果可以被看作是在这个句子中当前位置的抽象表示。这个过程，可以叠加多次，形成多层注意力模型，对输入序列中各个位置进行更深层的表示。
@@ -69,16 +69,16 @@
 \centering
 \input{./Chapter12/Figures/figure-example-of-self-attention-mechanism-calculation}
 \caption{自注意力计算实例}
-\label{fig:12-38}
+\label{fig:12-3}
 \end{figure}
 %----------------------------------------------

-\parinterval 举个例子，如图\ref{fig:12-38}所示，一个汉语句子包含5个词。这里，用$h$(他)表示“他”当前的表示结果，其中$h(\cdot)$是一个函数，用于返回输入单词所在位置对应的表示结果（向量）。如果把“他”看作目标，这时$\mathrm{query}$ 就是$h$(他)，$\mathrm{key}$和$\mathrm{value}$是图中所有位置的表示，即：{$h$(他)、$h$(什么)、$h$(也)、$h$(没)、$h$(学)}。在自注意力模型中，首先计算$\mathrm{query}$ 和$\mathrm{key}$的相关度，这里用$\alpha_i$表示$h$(他)和位置$i$的表示之间的相关性。然后，把$\alpha_i$作为权重，对不同位置上的$\mathrm{value}$进行加权求和。最终，得到新的表示结果$\tilde{h}$ (他)：
+\parinterval 举个例子，如图\ref{fig:12-3}所示，一个汉语句子包含5个词。这里，用$h$(他)表示“他”当前的表示结果，其中$h(\cdot)$是一个函数，用于返回输入单词所在位置对应的表示结果（向量）。如果把“他”看作目标，这时$\mathrm{query}$ 就是$h$(他)，$\mathrm{key}$和$\mathrm{value}$是图中所有位置的表示，即：{$h$(他)、$h$(什么)、$h$(也)、$h$(没)、$h$(学)}。在自注意力模型中，首先计算$\mathrm{query}$ 和$\mathrm{key}$的相关度，这里用$\alpha_i$表示$h$(他)和位置$i$的表示之间的相关性。然后，把$\alpha_i$作为权重，对不同位置上的$\mathrm{value}$进行加权求和。最终，得到新的表示结果$\tilde{h}$ (他)：

 \begin{eqnarray}
 \tilde{h} (\textrm{他} ) & = & \alpha_1 {h} (\textrm{他} ) + \alpha_2 {h} (\textrm{什么}) + \alpha_3 {h} (\textrm{也} ) + \nonumber \\
                         &   & \alpha_4 {h} (\textrm{没} ) +\alpha_5 {h} (\textrm{学} )
-\label{eq:12-42}
+\label{eq:12-2}
 \end{eqnarray}


@@ -102,13 +102,13 @@

 \parinterval 首先再来回顾一下{\chapterten}介绍的循环神经网络，虽然它很强大，但是也存在一些弊端。其中比较突出的问题是，循环神经网络每个循环单元都有向前依赖性，也就是当前时间步的处理依赖前一时间步处理的结果。这个性质可以使序列的“历史”信息不断被传递，但是也造成模型运行效率的下降。特别是对于自然语言处理任务，序列往往较长，无论是传统的RNN结构，还是更为复杂的LSTM结构，都需要很多次循环单元的处理才能够捕捉到单词之间的长距离依赖。由于需要多个循环单元的处理，距离较远的两个单词之间的信息传递变得很复杂。

-\parinterval 针对这些问题，研究人员提出了一种全新的模型$\ \dash\ $Transformer\index{Transformer}\upcite{vaswani2017attention}。与循环神经网络等传统模型不同，Transformer模型仅仅使用自注意力机制和标准的前馈神经网络，完全不依赖任何循环单元或者卷积操作。自注意力机制的优点在于可以直接对序列中任意两个单元之间的关系进行建模，这使得长距离依赖等问题可以更好地被求解。此外，自注意力机制非常适合在GPU 上进行并行化，因此模型训练的速度更快。表\ref{tab:12-11}对比了RNN、CNN和Transformer层类型的复杂度\footnote{顺序操作数指模型处理一个序列所需要的操作数，由于Transformer和CNN都可以并行计算，所以是1；路径长度指序列中任意两个单词在网络中的距离。}。
+\parinterval 针对这些问题，研究人员提出了一种全新的模型$\ \dash\ $Transformer\index{Transformer}\upcite{vaswani2017attention}。与循环神经网络等传统模型不同，Transformer模型仅仅使用自注意力机制和标准的前馈神经网络，完全不依赖任何循环单元或者卷积操作。自注意力机制的优点在于可以直接对序列中任意两个单元之间的关系进行建模，这使得长距离依赖等问题可以更好地被求解。此外，自注意力机制非常适合在GPU 上进行并行化，因此模型训练的速度更快。表\ref{tab:12-1}对比了RNN、CNN和Transformer层类型的复杂度\footnote{顺序操作数指模型处理一个序列所需要的操作数，由于Transformer和CNN都可以并行计算，所以是1；路径长度指序列中任意两个单词在网络中的距离。}。

 %----------------------------------------------
 \begin{table}[htp]
 \centering
 \caption{ RNN、CNN、Transformer的层类型复杂度对比\upcite{vaswani2017attention} （$n$表示序列长度，$d$表示隐层大小，$k$表示卷积核大小） }
-\label{tab:12-11}
+\label{tab:12-1}
 \begin{tabular}{c | c c c c}
 \rule{0pt}{20pt} 模型 & 层类型 & \begin{tabular}[l]{@{}l@{}}复杂度\end{tabular} & \begin{tabular}[l]{@{}l@{}}最小顺序 \\ 操作数\end{tabular} & \begin{tabular}[l]{@{}l@{}}最大路径\\ 长度\end{tabular} \\ \hline
 \rule{0pt}{13pt} Transformer & 自注意力 &$O(n^2\cdot d)$	&$O(1)$	&$O(1)$       \\
@@ -118,13 +118,13 @@
 \end{table}
 %----------------------------------------------

-\parinterval Transformer在被提出之后，很快就席卷了整个自然语言处理领域。实际上，也可以把Transformer当作一种表示模型，因此也被大量地使用在自然语言处理的其他领域，甚至图像处理\upcite{DBLP:journals/corr/abs-1802-05751}和语音处理\upcite{DBLP:conf/icassp/DongXX18,DBLP:conf/interspeech/GulatiQCPZYHWZW20}中也能看到它的影子。比如，目前非常流行的BERT等预训练模型就是基于Transformer。表\ref{tab:12-12}展示了Transformer在WMT英德和英法机器翻译任务上的性能。它能用更少的计算量（FLOPS）达到比其他模型更好的翻译品质\footnote{FLOPS = floating-point operations per second，即每秒浮点运算次数。它是度量计算机运算规模的常用单位} 。
+\parinterval Transformer在被提出之后，很快就席卷了整个自然语言处理领域。实际上，也可以把Transformer当作一种表示模型，因此也被大量地使用在自然语言处理的其他领域，甚至图像处理\upcite{DBLP:journals/corr/abs-1802-05751}和语音处理\upcite{DBLP:conf/icassp/DongXX18,DBLP:conf/interspeech/GulatiQCPZYHWZW20}中也能看到它的影子。比如，目前非常流行的BERT等预训练模型就是基于Transformer。表\ref{tab:12-2}展示了Transformer在WMT英德和英法机器翻译任务上的性能。它能用更少的计算量（FLOPS）达到比其他模型更好的翻译品质\footnote{FLOPS = floating-point operations per second，即每秒浮点运算次数。它是度量计算机运算规模的常用单位} 。

 %----------------------------------------------
 \begin{table}[htp]
 \centering
 \caption{ 不同翻译模型性能对比\upcite{vaswani2017attention}}
-\label{tab:12-12}
+\label{tab:12-2}
 \begin{tabular}{l l l l}
 \multicolumn{1}{l|}{\multirow{2}{*}{系统}} & \multicolumn{2}{c}{BLEU[\%]} & \multirow{2}{*}{\parbox{6em}{模型训练代价 (FLOPs)}} \\
 \multicolumn{1}{l|}{}                    & EN-DE  & EN-FR  &                                       \\ \hline
@@ -149,11 +149,11 @@
 \centering
 \input{./Chapter12/Figures/figure-transformer}
 \caption{ Transformer结构}
-\label{fig:12-39}
+\label{fig:12-4}
 \end{figure}
 %----------------------------------------------

-\parinterval 图\ref{fig:12-39}展示了Transformer的结构。编码器由若干层组成（绿色虚线框就代表一层）。每一层（Layer）的输入都是一个向量序列，输出是同样大小的向量序列，而Transformer层的作用是对输入进行进一步的抽象，得到新的表示结果。不过这里的层并不是指单一的神经网络结构，它里面由若干不同的模块组成，包括：
+\parinterval 图\ref{fig:12-4}展示了Transformer的结构。编码器由若干层组成（绿色虚线框就代表一层）。每一层（Layer）的输入都是一个向量序列，输出是同样大小的向量序列，而Transformer层的作用是对输入进行进一步的抽象，得到新的表示结果。不过这里的层并不是指单一的神经网络结构，它里面由若干不同的模块组成，包括：

 \begin{itemize}
 \vspace{0.5em}
@@ -163,28 +163,28 @@
 \vspace{0.5em}
 \item {\small\sffamily\bfseries{残差连接}}（标记为“Add”）：对于自注意力子层和前馈神经网络子层，都有一个从输入直接到输出的额外连接，也就是一个跨子层的直连。残差连接可以使深层网络的信息传递更为有效；
 \vspace{0.5em}
-\item {\small\sffamily\bfseries{层正则化}}\index{层正则化}（Layer Normalization）：自注意力子层和前馈神经网络子层进行最终输出之前，会对输出的向量进行层正则化，规范结果向量取值范围，这样易于后面进一步的处理。
+\item {\small\sffamily\bfseries{层标准化}}\index{层标准化}（Layer Normalization）：自注意力子层和前馈神经网络子层进行最终输出之前，会对输出的向量进行层标准化，规范结果向量取值范围，这样易于后面进一步的处理。
 \vspace{0.5em}
 \end{itemize}

 \parinterval 以上操作就构成了Transformer的一层，各个模块执行的顺序可以简单描述为：Self-Attention $\to$ Residual Connection $\to$ Layer Normalization $\to$ Feed Forward Network $\to$ Residual Connection $\to$ Layer Normalization。编码器可以包含多个这样的层，比如，可以构建一个六层编码器，每层都执行上面的操作。最上层的结果作为整个编码的结果，会被传入解码器。

-\parinterval 解码器的结构与编码器十分类似。它也是由若干层组成，每一层包含编码器中的所有结构，即：自注意力子层、前馈神经网络子层、残差连接和层正则化模块。此外，为了捕捉源语言的信息，解码器又引入了一个额外的{\small\sffamily\bfseries{编码-解码注意力子层}}\index{编码-解码注意力子层}（Encoder-Decoder Attention Sub-layer）\index{Encoder-Decoder Attention Sub-layer}。这个新的子层，可以帮助模型使用源语言句子的表示信息生成目标语不同位置的表示。编码-解码注意力子层仍然基于自注意力机制，因此它和自注意力子层的结构是相同的，只是$\mathrm{query}$、$\mathrm{key}$、$\mathrm{value}$的定义不同。比如，在解码端，自注意力子层的$\mathrm{query}$、$\mathrm{key}$、$\mathrm{value}$是相同的，它们都等于解码端每个位置的表示。而在编码-解码注意力子层中，$\mathrm{query}$是解码端每个位置的表示，此时$\mathrm{key}$和$\mathrm{value}$是相同的，等于编码端每个位置的表示。图\ref{fig:12-40}给出了这两种不同注意力子层输入的区别。
+\parinterval 解码器的结构与编码器十分类似。它也是由若干层组成，每一层包含编码器中的所有结构，即：自注意力子层、前馈神经网络子层、残差连接和层标准化模块。此外，为了捕捉源语言的信息，解码器又引入了一个额外的{\small\sffamily\bfseries{编码-解码注意力子层}}\index{编码-解码注意力子层}（Encoder-Decoder Attention Sub-layer）\index{Encoder-Decoder Attention Sub-layer}。这个新的子层，可以帮助模型使用源语言句子的表示信息生成目标语不同位置的表示。编码-解码注意力子层仍然基于自注意力机制，因此它和自注意力子层的结构是相同的，只是$\mathrm{query}$、$\mathrm{key}$、$\mathrm{value}$的定义不同。比如，在解码端，自注意力子层的$\mathrm{query}$、$\mathrm{key}$、$\mathrm{value}$是相同的，它们都等于解码端每个位置的表示。而在编码-解码注意力子层中，$\mathrm{query}$是解码端每个位置的表示，此时$\mathrm{key}$和$\mathrm{value}$是相同的，等于编码端每个位置的表示。图\ref{fig:12-5}给出了这两种不同注意力子层输入的区别。

 %----------------------------------------------
 \begin{figure}[htp]
    \centering
   \input{./Chapter12/Figures/figure-self-att-vs-enco-deco-att}
    \caption{ 注意力模型的输入（自注意力子层 vs 编码-解码注意力子层）}
-    \label{fig:12-40}
+    \label{fig:12-5}
 \end{figure}
 %----------------------------------------------

 \parinterval 此外，编码端和解码端都有输入的词序列。编码端的词序列输入是为了对其进行表示，进而解码端能从编码端访问到源语言句子的全部信息。解码端的词序列输入是为了进行目标语的生成，本质上它和语言模型是一样的，在得到前$n-1$个单词的情况下输出第$n$个单词。除了输入词序列的词嵌入，Transformer中也引入了位置嵌入，以表示每个位置信息。原因是，自注意力机制没有显性地对位置进行表示，因此也无法考虑词序。在输入中引入位置信息可以让自注意力机制间接地感受到每个词的位置，进而保证对序列表示的合理性。最终，整个模型的输出由一个Softmax层完成，它和循环神经网络中的输出层是完全一样的。

-\parinterval 在进行更详细的介绍前，先利用图\ref{fig:12-39}简单了解一下Transformer模型是如何进行翻译的。首先，Transformer将源语言句子“我/很/好”的词嵌入融合位置编码后作为输入。然后，编码器对输入的源语句子进行逐层抽象，得到包含丰富的上下文信息的源语表示并传递给解码器。解码器的每一层，使用自注意力子层对输入解码端的表示进行加工，之后再使用编码-解码注意力子层融合源语句子的表示信息。就这样逐词生成目标语译文单词序列。解码器每个位置的输入是当前单词（比如，“I”），而这个位置的输出是下一个单词（比如，“am”），这个设计和标准的神经语言模型是完全一样的。
+\parinterval 在进行更详细的介绍前，先利用图\ref{fig:12-4}简单了解一下Transformer模型是如何进行翻译的。首先，Transformer将源语言句子“我/很/好”的词嵌入融合位置编码后作为输入。然后，编码器对输入的源语句子进行逐层抽象，得到包含丰富的上下文信息的源语表示并传递给解码器。解码器的每一层，使用自注意力子层对输入解码端的表示进行加工，之后再使用编码-解码注意力子层融合源语句子的表示信息。就这样逐词生成目标语译文单词序列。解码器每个位置的输入是当前单词（比如，“I”），而这个位置的输出是下一个单词（比如，“am”），这个设计和标准的神经语言模型是完全一样的。

-\parinterval 当然，这里可能还有很多疑惑，比如，什么是位置编码？Transformer的自注意力机制具体是怎么进行计算的，其结构是怎样的？层正则化又是什么？等等。下面就一一展开介绍。
+\parinterval 当然，这里可能还有很多疑惑，比如，什么是位置编码？Transformer的自注意力机制具体是怎么进行计算的，其结构是怎样的？层标准化又是什么？等等。下面就一一展开介绍。

 %----------------------------------------------------------------------------------------
 %    NEW SECTION
@@ -192,59 +192,59 @@

 \section{位置编码}

-\parinterval 在使用循环神经网络进行序列的信息提取时，每个时刻的运算都要依赖前一个时刻的输出，具有一定的时序性，这也与语言具有顺序的特点相契合。而采用自注意力机制对源语言和目标语言序列进行处理时，直接对当前位置和序列中的任意位置进行建模，忽略了词之间的顺序关系，例如图\ref{fig:12-41}中两个语义不同的句子，通过自注意力得到的表示$\tilde{h}$(机票)却是相同的。
+\parinterval 在使用循环神经网络进行序列的信息提取时，每个时刻的运算都要依赖前一个时刻的输出，具有一定的时序性，这也与语言具有顺序的特点相契合。而采用自注意力机制对源语言和目标语言序列进行处理时，直接对当前位置和序列中的任意位置进行建模，忽略了词之间的顺序关系，例如图\ref{fig:12-6}中两个语义不同的句子，通过自注意力得到的表示$\tilde{h}$(机票)却是相同的。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter12/Figures/figure-calculation-of-context-vector-c}
 \caption{“机票”的更进一步抽象表示$\tilde{\mathbi{h}}$的计算}
-\label{fig:12-41}
+\label{fig:12-6}
 \end{figure}
 %----------------------------------------------

-\parinterval 为了解决这个问题，Transformer在原有的词向量输入基础上引入了位置编码，来表示单词之间的顺序关系。位置编码在Transformer结构中的位置如图\ref{fig:12-42}，它是Transformer成功的一个重要因素。
+\parinterval 为了解决这个问题，Transformer在原有的词向量输入基础上引入了位置编码，来表示单词之间的顺序关系。位置编码在Transformer结构中的位置如图\ref{fig:12-7}，它是Transformer成功的一个重要因素。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter12/Figures/figure-transformer-input-and-position-encoding}
 \caption{Transformer输入与位置编码}
-\label{fig:12-42}
+\label{fig:12-7}
 \end{figure}
 %----------------------------------------------

 \parinterval 位置编码的计算方式有很多种，Transformer使用不同频率的正余弦函数：
 \begin{eqnarray}
-\textrm{PE}(\textrm{pos},2i) & = & \textrm{sin} (\frac{\textrm{pos}}{10000^{2i/d_{\textrm{model}}}}) \label{eq:12-43} \\
-\textrm{PE}(\textrm{pos},2i+1) & = & \textrm{cos} (\frac{\textrm{pos}}{10000^{2i/d_{\textrm{model}}}}) \label{eq:12-44}
+\textrm{PE}(\textrm{pos},2i) & = & \textrm{sin} (\frac{\textrm{pos}}{10000^{2i/d_{\textrm{model}}}}) \label{eq:12-3} \\
+\textrm{PE}(\textrm{pos},2i+1) & = & \textrm{cos} (\frac{\textrm{pos}}{10000^{2i/d_{\textrm{model}}}}) \label{eq:12-4}
 \end{eqnarray}

-\noindent 式中PE($\cdot$)表示位置编码的函数，$\textrm{pos}$表示单词的位置，$i$代表位置编码向量中的第几维，$d_{\textrm{model}}$是Transformer的一个基础参数，表示每个位置的隐层大小。因为，正余弦函数的编码各占一半，因此当位置编码的维度为512 时，$i$ 的范围是0-255。 在Transformer中，位置编码的维度和词嵌入向量的维度相同（均为$d_{\textrm{model}}$），模型通过将二者相加作为模型输入，如图\ref{fig:12-43}所示。
+\noindent 式中PE($\cdot$)表示位置编码的函数，$\textrm{pos}$表示单词的位置，$i$代表位置编码向量中的第几维，$d_{\textrm{model}}$是Transformer的一个基础参数，表示每个位置的隐层大小。因为，正余弦函数的编码各占一半，因此当位置编码的维度为512 时，$i$ 的范围是0-255。 在Transformer中，位置编码的维度和词嵌入向量的维度相同（均为$d_{\textrm{model}}$），模型通过将二者相加作为模型输入，如图\ref{fig:12-8}所示。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter12/Figures/figure-a-combination-of-position-encoding-and-word-encoding}
 \caption{位置编码与词编码的组合}
-\label{fig:12-43}
+\label{fig:12-8}
 \end{figure}
 %----------------------------------------------

 \parinterval 那么为什么通过这种计算方式可以很好的表示位置信息？有几方面原因。首先，正余弦函数是具有上下界的周期函数，用正余弦函数可将长度不同的序列的位置编码的范围都固定到$[-1,1]$，这样在与词的编码进行相加时，不至于产生太大差距。另外位置编码的不同维度对应不同的正余弦曲线，这为多维的表示空间赋予一定意义。最后，根据三角函数的性质：
 \begin{eqnarray}
-\textrm{sin}(\alpha + \beta) &=& \textrm{sin}\alpha \cdot \textrm{cos} \beta + \textrm{cos} \alpha \cdot \textrm{sin} \beta \nonumber  \\
+\textrm{sin}(\alpha + \beta) &=& \textrm{sin}\alpha \cdot \textrm{cos} \beta + \textrm{cos} \alpha \cdot \textrm{sin} \beta \label{eq:12-5}  \\
 \textrm{cos}(\alpha + \beta) &=&  \textrm{cos} \alpha  \cdot \textrm{cos} \beta - \textrm{sin} \alpha \cdot \textrm{sin} \beta
-\label{eq:12-45}
+\label{eq:12-6}
 \end{eqnarray}

 \parinterval 可以得到第$pos+k$个位置的编码为：
 \begin{eqnarray}
 \textrm{PE}(\textrm{pos}+k,2i) &=& \textrm{PE}(\textrm{pos},2i) \cdot \textrm{PE}(k,2i+1) + \nonumber \\
-                      & & \textrm{PE}(\textrm{pos},2i+1) \cdot \textrm{PE}(k,2i)\\
+                      & & \textrm{PE}(\textrm{pos},2i+1) \cdot \textrm{PE}(k,2i)  \label{eq:12-7} \\
 \textrm{PE}(\textrm{pos}+k ,2i+1) &=& \textrm{PE}(\textrm{pos},2i+1) \cdot \textrm{PE}(k,2i+1) - \nonumber \\
                         & & \textrm{PE}(\textrm{pos},2i) \cdot \textrm{PE}(k,2i)
-\label{eq:12-46}
+\label{eq:12-8}
 \end{eqnarray}

 \noindent 即对于任意固定的偏移量$k$，$\textrm{PE}(\textrm{pos}+k)$能被表示成$\textrm{PE}(\textrm{pos})$的线性函数，换句话说，位置编码可以表示词之间的距离。在实践中发现，位置编码对Transformer系统的性能有很大影响。对其进行改进也会带来进一步的性能提升\upcite{Shaw2018SelfAttentionWR}。
@@ -255,14 +255,14 @@

 \section{基于点乘的多头注意力机制}

-\parinterval Transformer模型摒弃了循环单元和卷积等结构，完全基于注意力机制来构造模型，其中包含着大量的注意力计算。比如，可以通过自注意力机制对源语言和目标语言序列进行信息提取，并通过编码-解码注意力对双语句对之间的关系进行建模。图\ref{fig:12-44}中红色方框部分是Transformer中使用注意力机制的模块。而这些模块都是由基于点乘的多头注意力机制实现的。
+\parinterval Transformer模型摒弃了循环单元和卷积等结构，完全基于注意力机制来构造模型，其中包含着大量的注意力计算。比如，可以通过自注意力机制对源语言和目标语言序列进行信息提取，并通过编码-解码注意力对双语句对之间的关系进行建模。图\ref{fig:12-9}中红色方框部分是Transformer中使用注意力机制的模块。而这些模块都是由基于点乘的多头注意力机制实现的。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter12/Figures/figure-position-of-self-attention-mechanism-in-the-model}
 \caption{自注意力机制在模型中的位置}
-\label{fig:12-44}
+\label{fig:12-9}
 \end{figure}
 %----------------------------------------------

@@ -282,7 +282,7 @@
 \begin{eqnarray}
 \textrm{Attention}(\mathbi{Q},\mathbi{K},\mathbi{V}) = \textrm{Softmax}
 ( \frac{\mathbi{Q}\mathbi{K}^{\textrm{T}}} {\sqrt{d_k}} + \mathbi{Mask} ) \mathbi{V}
-\label{eq:12-47}
+\label{eq:12-9}
 \end{eqnarray}

 \noindent 首先，通过对$\mathbi{Q}$和$\mathbi{K}$的转置进行矩阵乘法操作，计算得到一个维度大小为$L \times L$的相关性矩阵，即$\mathbi{Q}\mathbi{K}^{\textrm{T}}$，它表示一个序列上任意两个位置的相关性。再通过系数1/$\sqrt{d_k}$进行放缩操作，放缩可以减少相关性矩阵的方差，具体体现在运算过程中实数矩阵中的数值不会过大，有利于模型训练。
@@ -290,25 +290,25 @@
 \parinterval 在此基础上，通过对相关性矩阵累加一个掩码矩阵$\mathbi{Mask}$，来屏蔽掉矩阵中的无用信息。比如，在编码端，如果需要对多个句子同时处理，由于这些句子长度不统一，需要对句子补齐。再比如，在解码端，训练的时候需要屏蔽掉当前目标语位置右侧的单词，因此这些单词在推断的时候是看不到的。

 \parinterval 随后，使用Softmax函数对相关性矩阵在行的维度上进行归一化操作，这可以理解为对第$i$ 行进行归一化，结果对应了$\mathbi{V}$ 中不同位置上向量的注意力权重。对于$\mathrm{value}$ 的加权求和，可以直接用相关性系数和$\mathbi{V}$ 进行矩阵乘法得到，即$\textrm{Softmax}
- ( \frac{\mathbi{Q}\mathbi{K}^{\textrm{T}}} {\sqrt{d_k}} + \mathbi{Mask} )$和$\mathbi{V}$进行矩阵乘。最终得到自注意力的输出，它和输入的$\mathbi{V}$的大小是一模一样的。图\ref{fig:12-45}展示了点乘注意力计算的全过程。
+ ( \frac{\mathbi{Q}\mathbi{K}^{\textrm{T}}} {\sqrt{d_k}} + \mathbi{Mask} )$和$\mathbi{V}$进行矩阵乘。最终得到自注意力的输出，它和输入的$\mathbi{V}$的大小是一模一样的。图\ref{fig:12-10}展示了点乘注意力计算的全过程。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter12/Figures/figure-point-product-attention-model}
 \caption{点乘注意力模型 }
-\label{fig:12-45}
+\label{fig:12-10}
 \end{figure}
 %----------------------------------------------

-\parinterval 下面举个简单的例子介绍点乘注意力的具体计算过程。如图\ref{fig:12-46}所示，用黄色、蓝色和橙色的矩阵分别表示$\mathbi{Q}$、$\mathbi{K}$和$\mathbi{V}$。$\mathbi{Q}$、$\mathbi{K}$ 和$\mathbi{V}$中的每一个小格都对应一个单词在模型中的表示（即一个向量）。首先，通过点乘、放缩、掩码等操作得到相关性矩阵，即粉色部分。其次，将得到的中间结果矩阵（粉色）的每一行使用Softmax激活函数进行归一化操作，得到最终的权重矩阵，也就是图中的红色矩阵。红色矩阵中的每一行都对应一个注意力分布。最后，按行对$\mathbi{V}$进行加权求和，便得到了每个单词通过点乘注意力机制计算得到的表示。这里面，主要的计算消耗是两次矩阵乘法，即$\mathbi{Q}$与$\mathbi{K}^{\textrm{T}}$的乘法、相关性矩阵和$\mathbi{V}$的乘法。这两个操作都可以在GPU上高效地完成，因此可以一次性计算出序列中所有单词之间的注意力权重，并完成所有位置表示的加权求和过程，这样大大提高了模型计算的并行度。
+\parinterval 下面举个简单的例子介绍点乘注意力的具体计算过程。如图\ref{fig:12-11}所示，用黄色、蓝色和橙色的矩阵分别表示$\mathbi{Q}$、$\mathbi{K}$和$\mathbi{V}$。$\mathbi{Q}$、$\mathbi{K}$ 和$\mathbi{V}$中的每一个小格都对应一个单词在模型中的表示（即一个向量）。首先，通过点乘、放缩、掩码等操作得到相关性矩阵，即粉色部分。其次，将得到的中间结果矩阵（粉色）的每一行使用Softmax激活函数进行归一化操作，得到最终的权重矩阵，也就是图中的红色矩阵。红色矩阵中的每一行都对应一个注意力分布。最后，按行对$\mathbi{V}$进行加权求和，便得到了每个单词通过点乘注意力机制计算得到的表示。这里面，主要的计算消耗是两次矩阵乘法，即$\mathbi{Q}$与$\mathbi{K}^{\textrm{T}}$的乘法、相关性矩阵和$\mathbi{V}$的乘法。这两个操作都可以在GPU上高效地完成，因此可以一次性计算出序列中所有单词之间的注意力权重，并完成所有位置表示的加权求和过程，这样大大提高了模型计算的并行度。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter12/Figures/figure-process-of-5}
-\caption{式\ref{eq:12-47}的执行过程示例}
-\label{fig:12-46}
+\caption{公式\eqref{eq:12-9}的执行过程示例}
+\label{fig:12-11}
 \end{figure}
 %----------------------------------------------

@@ -320,7 +320,7 @@

 \parinterval Transformer中使用的另一项重要技术是{\small\sffamily\bfseries{多头注意力}}\index{多头注意力}（Multi-head Attention）\index{Multi-head Attention}。“多头”可以理解成将原来的$\mathbi{Q}$、$\mathbi{K}$、$\mathbi{V}$按照隐层维度平均切分成多份。假设切分$h$份，那么最终会得到$\mathbi{Q} = \{ \mathbi{Q}_1, \mathbi{Q}_2,...,\mathbi{Q}_h \}$，$\mathbi{K}=\{ \mathbi{K}_1,\mathbi{K}_2,...,\mathbi{K}_h \}$，$\mathbi{V}=\{ \mathbi{V}_1, \mathbi{V}_2,...,\mathbi{V}_h \}$。多头注意力机制就是用每一个切分得到的$\mathbi{Q}$，$\mathbi{K}$，$\mathbi{V}$独立的进行注意力计算，即第$i$个头的注意力计算结果$\mathbi{head}_i = \textrm{Attention}(\mathbi{Q}_i,\mathbi{K}_i, \mathbi{V}_i)$。

-\parinterval 下面根据图\ref{fig:12-48}详细介绍多头注意力的计算过程：
+\parinterval 下面根据图\ref{fig:12-12}详细介绍多头注意力的计算过程：

 \begin{itemize}
 \vspace{0.5em}
@@ -337,15 +337,15 @@
 \centering
 \input{./Chapter12/Figures/figure-multi-head-attention-model}
 \caption{多头注意力模型}
-\label{fig:12-48}
+\label{fig:12-12}
 \end{figure}
 %----------------------------------------------

 \parinterval 多头机制可以被形式化描述为如下公式：
 \begin{eqnarray}
-\textrm{MultiHead}(\mathbi{Q}, \mathbi{K} , \mathbi{V})& = & \textrm{Concat} (\mathbi{head}_1, ... , \mathbi{head}_h ) \mathbi{W}^{\,o} \label{eq:12-48} \\
+\textrm{MultiHead}(\mathbi{Q}, \mathbi{K} , \mathbi{V})& = & \textrm{Concat} (\mathbi{head}_1, ... , \mathbi{head}_h ) \mathbi{W}^{\,o} \label{eq:12-10} \\
 \mathbi{head}_i & = &\textrm{Attention} (\mathbi{Q}\mathbi{W}_i^{\,Q} , \mathbi{K}\mathbi{W}_i^{\,K}  , \mathbi{V}\mathbi{W}_i^{\,V} )
-\label{eq:12-49}
+\label{eq:12-11}
 \end{eqnarray}

 \parinterval 多头机制的好处是允许模型在不同的表示子空间里学习。在很多实验中发现，不同表示空间的头捕获的信息是不同的，比如，在使用Transformer处理自然语言时，有的头可以捕捉句法信息，有头可以捕捉词法信息。
@@ -356,13 +356,13 @@

 \subsection{掩码操作}

-\parinterval 在公式\eqref{eq:12-47}中提到了{\small\bfnew{掩码}}\index{掩码}（Mask\index{Mask}），它的目的是对向量中某些值进行掩盖，避免无关位置的数值对运算造成影响。Transformer中的掩码主要应用在注意力机制中的相关性系数计算，具体方式是在相关性系数矩阵上累加一个掩码矩阵。该矩阵在需要掩码的位置的值为负无穷$-$inf（具体实现时是一个非常小的数，比如$-$1e9），其余位置为0，这样在进行了Softmax 归一化操作之后，被掩码掉的位置计算得到的权重便近似为0，也就是说对无用信息分配的权重为0，从而避免了其对结果产生影响。Transformer包含两种掩码：
+\parinterval 在公式\eqref{eq:12-9}中提到了{\small\bfnew{掩码}}\index{掩码}（Mask\index{Mask}），它的目的是对向量中某些值进行掩盖，避免无关位置的数值对运算造成影响。Transformer中的掩码主要应用在注意力机制中的相关性系数计算，具体方式是在相关性系数矩阵上累加一个掩码矩阵。该矩阵在需要掩码的位置的值为负无穷$-$inf（具体实现时是一个非常小的数，比如$-$1e9），其余位置为0，这样在进行了Softmax 归一化操作之后，被掩码掉的位置计算得到的权重便近似为0，也就是说对无用信息分配的权重为0，从而避免了其对结果产生影响。Transformer包含两种掩码：

 \begin{itemize}
 \vspace{0.5em}
 \item {\small\bfnew{句长补全掩码}}\index{句长补全掩码}（Padding Mask\index{Padding Mask}）。在批量处理多个样本时（训练或解码），由于要对源语言和目标语言的输入进行批次化处理，而每个批次内序列的长度不一样，为了方便对批次内序列进行矩阵表示，需要进行对齐操作，即在较短的序列后面填充0来占位（padding操作）。而这些填充的位置没有意义，不参与注意力机制的计算，因此，需要进行掩码 操作，屏蔽其影响。
 \vspace{0.5em}
-\item {\small\bfnew{未来信息掩码}}\index{未来信息掩码}（Future Mask\index{Future Mask}）。对于解码器来说，由于在预测的时候是自左向右进行的，即第$t$时刻解码器的输出只能依赖于$t$时刻之前的输出。且为了保证训练解码一致，避免在训练过程中观测到目标语端每个位置未来的信息，因此需要对未来信息进行屏蔽。具体的做法是：构造一个上三角值全为-inf的Mask矩阵，也就是说，在解码端计算中，在当前位置，通过未来信息掩码把序列之后的信息屏蔽掉了，避免了$t$ 时刻之后的位置对当前的计算产生影响。图\ref{fig:12-47}给出了一个具体的实例。
+\item {\small\bfnew{未来信息掩码}}\index{未来信息掩码}（Future Mask\index{Future Mask}）。对于解码器来说，由于在预测的时候是自左向右进行的，即第$t$时刻解码器的输出只能依赖于$t$时刻之前的输出。且为了保证训练解码一致，避免在训练过程中观测到目标语端每个位置未来的信息，因此需要对未来信息进行屏蔽。具体的做法是：构造一个上三角值全为-inf的Mask矩阵，也就是说，在解码端计算中，在当前位置，通过未来信息掩码把序列之后的信息屏蔽掉了，避免了$t$ 时刻之后的位置对当前的计算产生影响。图\ref{fig:12-13}给出了一个具体的实例。

 %----------------------------------------------
 % 图3.10
@@ -370,7 +370,7 @@
 \centering
 \input{./Chapter12/Figures/figure-mask-instance-for-future-positions-in-transformer}
 \caption{Transformer中对于未来位置进行的屏蔽的掩码实例}
-\label{fig:12-47}
+\label{fig:12-13}
 \end{figure}
 %----------------------------------------------

@@ -381,7 +381,7 @@
 %    NEW SECTION
 %----------------------------------------------------------------------------------------

-\section{残差网络和层正则化}
+\section{残差网络和层标准化}

 \parinterval Transformer编码器、解码器分别由多层网络组成（通常为6层），每层网络又包含多个子层（自注意力网络、前馈神经网络）。因此Transformer实际上是一个很深的网络结构。再加上点乘注意力机制中包含很多线性和非线性变换；且注意力函数Attention($\cdot$)的计算也涉及多层网络，整个网络的信息传递非常复杂。从反向传播的角度来看，每次回传的梯度都会经过若干步骤，容易产生梯度爆炸或者消失。解决这个问题的一种办法就是使用残差连接\upcite{DBLP:journals/corr/HeZRS15}，此部分内容已经在{\chapternine}进行了介绍，这里不再赘述。

@@ -408,8 +408,8 @@
 \begin{figure}[htp]
 \centering
 \input{./Chapter12/Figures/figure-position-of-difference-and-layer-regularization-in-the-model}
-\caption{残差和层正则化在模型中的位置}
-\label{fig:12-50}
+\caption{残差和层标准化在模型中的位置}
+\label{fig:12-14}
 \end{figure}
 %----------------------------------------------

@@ -417,25 +417,25 @@
 \begin{eqnarray}
 %x_{l+1} = x_l + F (x_l)
 \mathbi{x}^{l+1} = F (\mathbi{x}^l) + \mathbi{x}^l
-\label{eq:12-50}
+\label{eq:12-12}
 \end{eqnarray}

-\noindent 其中$\mathbi{x}^l$表示第$l$层网络的输入向量，$F (\mathbi{x}^l)$是子层运算，这样会导致不同层（或子层）的结果之间的差异性很大，造成训练过程不稳定、训练时间较长。为了避免这种情况，在每层中加入了层正则化操作\upcite{Ba2016LayerN}。图\ref{fig:12-50} 中的红色方框展示了Transformer中残差和层正则化的位置。层正则化的计算公式如下：
+\noindent 其中$\mathbi{x}^l$表示第$l$层网络的输入向量，$F (\mathbi{x}^l)$是子层运算，这样会导致不同层（或子层）的结果之间的差异性很大，造成训练过程不稳定、训练时间较长。为了避免这种情况，在每层中加入了层标准化操作\upcite{Ba2016LayerN}。图\ref{fig:12-14} 中的红色方框展示了Transformer中残差和层标准化的位置。层标准化的计算公式如下：
 \begin{eqnarray}
 \textrm{LN}(\mathbi{x}) = g \cdot \frac{\mathbi{x}- \mu} {\sigma} + b
-\label{eq:12-51}
+\label{eq:12-13}
 \end{eqnarray}

 \noindent 该公式使用均值$\mu$和方差$\sigma$对样本进行平移缩放，将数据规范化为均值为0，方差为1的标准分布。$g$和$b$是可学习的参数。

-\parinterval 在Transformer中经常使用的层正则化操作有两种结构，分别是{\small\bfnew{后正则化}}\index{后正则化}（Post-norm）\index{Post-norm}和{\small\bfnew{前正则化}}\index{前正则化}（Pre-norm）\index{Pre-norm}，结构如图\ref{fig:12-51}所示。后正则化中先进行残差连接再进行层正则化，而前正则化则是在子层输入之前进行层正则化操作。在很多实践中已经发现，前正则化的方式更有利于信息传递，因此适合训练深层的Transformer模型\upcite{WangLearning}。
+\parinterval 在Transformer中经常使用的层标准化操作有两种结构，分别是{\small\bfnew{后标准化}}\index{后标准化}（Post-norm）\index{Post-norm}和{\small\bfnew{前标准化}}\index{前标准化}（Pre-norm）\index{Pre-norm}，结构如图\ref{fig:12-15}所示。后标准化中先进行残差连接再进行层标准化，而前标准化则是在子层输入之前进行层标准化操作。在很多实践中已经发现，前标准化的方式更有利于信息传递，因此适合训练深层的Transformer模型\upcite{WangLearning}。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter12/Figures/figure-different-regularization-methods}
-\caption{不同正则化方式 }
-\label{fig:12-51}
+\caption{不同标准化方式 }
+\label{fig:12-15}
 \end{figure}
 %----------------------------------------------

@@ -445,21 +445,21 @@

 \section{前馈全连接网络子层}

-\parinterval 在Transformer的结构中，每一个编码层或者解码层中都包含一个前馈神经网络，它在模型中的位置如图\ref{fig:12-52}中红色方框所示。
+\parinterval 在Transformer的结构中，每一个编码层或者解码层中都包含一个前馈神经网络，它在模型中的位置如图\ref{fig:12-16}中红色方框所示。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter12/Figures/figure-position-of-feedforward-neural-network-in-the-model}
 \caption{前馈神经网络在模型中的位置}
-\label{fig:12-52}
+\label{fig:12-16}
 \end{figure}
 %----------------------------------------------

 \parinterval Transformer使用了全连接网络。全连接网络的作用主要体现在将经过注意力操作之后的表示映射到新的空间中，新的空间会有利于接下来的非线性变换等操作。实验证明，去掉全连接网络会对模型的性能造成很大影响。Transformer的全连接前馈神经网络包含两次线性变换和一次非线性变换（ReLU激活函数:ReLU$(\mathbi{x})=\textrm{max}⁡(0,\mathbi{x})$），每层的前馈神经网络参数不共享，计算公式如下：
 \begin{eqnarray}
 \textrm{FFN}(\mathbi{x}) = \textrm{max} (0,\mathbi{x}\mathbi{W}_1 + \mathbi{b}_1)\mathbi{W}_2 + \mathbi{b}_2
-\label{eq:12-52}
+\label{eq:12-14}
 \end{eqnarray}

 \noindent 其中，$\mathbi{W}_1$、$\mathbi{W}_2$、$\mathbi{b}_1$和$\mathbi{b}_2$为模型的参数。通常情况下，前馈神经网络的隐层维度要比注意力部分的隐层维度大，而且研究人员发现这种设置对Transformer是至关重要的。 比如，注意力部分的隐层维度为512，前馈神经网络部分的隐层维度为2048。当然，继续增大前馈神经网络的隐层大小，比如设为4096，甚至8192，还可以带来性能的增益，但是前馈部分的存储消耗较大，需要更大规模GPU 设备的支持。因此在具体实现时，往往需要在翻译准确性和存储/速度之间找到一个平衡。
@@ -489,11 +489,11 @@
 \item Transformer在学习率中同样应用了学习率{\small\bfnew{预热}}\index{预热}（Warmup）\index{Warmup}策略，其计算公式如下：
 \begin{eqnarray}
 lrate = d_{\textrm{model}}^{-0.5} \cdot \textrm{min} (\textrm{step}^{-0.5} , \textrm{step} \cdot \textrm{warmup\_steps}^{-1.5})
-\label{eq:12-53}
+\label{eq:12-15}
 \end{eqnarray}

 \vspace{0.5em}
-其中，$\textrm{step}$表示更新的次数（或步数）。通常设置网络更新的前4000步为预热阶段即$\textrm{warmup\_steps}=4000$。Transformer的学习率曲线如图\ref{fig:12-54}所示。在训练初期，学习率从一个较小的初始值逐渐增大（线性增长），当到达一定的步数，学习率再逐渐减小。这样做可以减缓在训练初期的不稳定现象，同时在模型达到相对稳定之后，通过逐渐减小的学习率让模型进行更细致的调整。这种学习率的调整方法是Transformer系统一个很大的工程贡献。
+其中，$\textrm{step}$表示更新的次数（或步数）。通常设置网络更新的前4000步为预热阶段即$\textrm{warmup\_steps}=4000$。Transformer的学习率曲线如图\ref{fig:12-17}所示。在训练初期，学习率从一个较小的初始值逐渐增大（线性增长），当到达一定的步数，学习率再逐渐减小。这样做可以减缓在训练初期的不稳定现象，同时在模型达到相对稳定之后，通过逐渐减小的学习率让模型进行更细致的调整。这种学习率的调整方法是Transformer系统一个很大的工程贡献。
 \vspace{0.5em}
 \end{itemize}

@@ -502,7 +502,7 @@ lrate = d_{\textrm{model}}^{-0.5} \cdot \textrm{min} (\textrm{step}^{-0.5} , \te
 \centering
 \input{./Chapter12/Figures/figure-lrate-of-transformer}
 \caption{Transformer模型的学习率曲线}
-\label{fig:12-54}
+\label{fig:12-17}
 \end{figure}
 %----------------------------------------------

@@ -510,14 +510,14 @@ lrate = d_{\textrm{model}}^{-0.5} \cdot \textrm{min} (\textrm{step}^{-0.5} , \te

 \begin{itemize}
 \vspace{0.5em}
-\item {\small\bfnew{小批量训练}}\index{小批量训练}（Mini-batch Training）\index{Mini-batch Training}：每次使用一定数量的样本进行训练，即每次从样本中选择一小部分数据进行训练。这种方法的收敛较快，同时易于提高设备的利用率。批次大小通常设置为2048或4096（token数即每个批次中的单词个数）。每一个批次中的句子并不是随机选择的，模型通常会根据句子长度进行排序，选取长度相近的句子组成一个批次。这样做可以减少padding数量，提高训练效率，如图\ref{fig:12-55}。
+\item {\small\bfnew{小批量训练}}\index{小批量训练}（Mini-batch Training）\index{Mini-batch Training}：每次使用一定数量的样本进行训练，即每次从样本中选择一小部分数据进行训练。这种方法的收敛较快，同时易于提高设备的利用率。批次大小通常设置为2048或4096（token数即每个批次中的单词个数）。每一个批次中的句子并不是随机选择的，模型通常会根据句子长度进行排序，选取长度相近的句子组成一个批次。这样做可以减少padding数量，提高训练效率，如图\ref{fig:12-18}。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter12/Figures/figure-comparison-of-the-number-of-padding-in-batch}
 \caption{不同批次生成方法对比（白色部分为padding）}
-\label{fig:12-55}
+\label{fig:12-18}
 \end{figure}
 %----------------------------------------------
 \vspace{0.5em}
@@ -535,17 +535,17 @@ lrate = d_{\textrm{model}}^{-0.5} \cdot \textrm{min} (\textrm{step}^{-0.5} , \te
 \vspace{0.5em}
 \item  Transformer Big：为了提升网络的容量，使用更宽的网络。在Base的基础上增大隐层维度至1024，前馈神经网络的维度变为4096，多头注意力机制为16头，Dropout设为0.3。
 \vspace{0.5em}
-\item Transformer Deep：加深编码器网络层数可以进一步提升网络的性能，它的参数设置与Transformer Base基本一致，但是层数增加到48层，同时使用Pre-Norm作为层正则化的结构。
+\item Transformer Deep：加深编码器网络层数可以进一步提升网络的性能，它的参数设置与Transformer Base基本一致，但是层数增加到48层，同时使用Pre-Norm作为层标准化的结构。
 \vspace{0.5em}
 \end{itemize}

-\parinterval 在WMT'16数据 上的实验对比如表\ref{tab:12-13}所示。可以看出，Transformer Base的BLE\\U得分虽不如另外两种模型，但其参数量是最少的。而Transformer Deep的性能整体好于Transformer Big。
+\parinterval 在WMT'16数据 上的实验对比如表\ref{tab:12-3}所示。可以看出，Transformer Base的BLE\\U得分虽不如另外两种模型，但其参数量是最少的。而Transformer Deep的性能整体好于Transformer Big。

 %----------------------------------------------
 \begin{table}[htp]
 \centering
 \caption{三种Transformer模型的对比}
-\label{tab:12-13}
+\label{tab:12-3}
 \begin{tabular}{l | l l l}
 \multirow{2}{*}{系统}   & \multicolumn{2}{c}{BLEU[\%]} & 模型参数量 \\
                      & EN-DE  & EN-FR  &                                  \\ \hline
@@ -562,7 +562,7 @@ Transformer Deep（48层） & 30.2            & 43.1            & 194$\times 10^

 \section{推断}

-\parinterval Transformer解码器生成译文词序列的过程和其它神经机器翻译系统类似，都是从左往右生成，且下一个单词的预测依赖已经生成的单词。其具体推断过程如图\ref{fig:12-56}所示，其中$\mathbi{C}_i$是编码-解码注意力的结果，解码器首先根据“<sos>”和$\mathbi{C}_1$生成第一个单词“how”，然后根据“how”和$\mathbi{C}_2$生成第二个单词“are”，以此类推，当解码器生成“<eos>”时结束推断。
+\parinterval Transformer解码器生成译文词序列的过程和其它神经机器翻译系统类似，都是从左往右生成，且下一个单词的预测依赖已经生成的单词。其具体推断过程如图\ref{fig:12-19}所示，其中$\mathbi{C}_i$是编码-解码注意力的结果，解码器首先根据“<sos>”和$\mathbi{C}_1$生成第一个单词“how”，然后根据“how”和$\mathbi{C}_2$生成第二个单词“are”，以此类推，当解码器生成“<eos>”时结束推断。

 \parinterval 但是，Transformer在推断阶段无法对所有位置进行并行化操作，因为对于每一个目标语单词都需要对前面所有单词进行注意力操作，因此它推断速度非常慢。可以采用的加速手段有：Cache（缓存需要重复计算的变量）\upcite{Vaswani2018Tensor2TensorFN}、低精度计算\upcite{DBLP:journals/corr/CourbariauxB16,Lin2020TowardsF8}、共享注意力网络等\upcite{Xiao2019SharingAW}。关于Transformer模型的推断加速方法将会在{\chapterfourteen}进一步深入讨论。

@@ -571,7 +571,7 @@ Transformer Deep（48层） & 30.2            & 43.1            & 194$\times 10^
 \centering
 \input{./Chapter12/Figures/figure-decode-of-transformer}
 \caption{Transformer模型的推断过程示例}
-\label{fig:12-56}
+\label{fig:12-19}
 \end{figure}
 %----------------------------------------------


--- a/Chapter15/Figures/figure-activation-function-swish-structure-diagram.png
+++ b/Chapter15/Figures/figure-activation-function-swish-structure-diagram.png
--- a/Chapter15/Figures/figure-dynamic-linear-aggregation-network-structure.tex
+++ b/Chapter15/Figures/figure-dynamic-linear-aggregation-network-structure.tex
@@ -4,21 +4,21 @@

 \node [anchor=north,rectangle, inner sep=0mm,minimum height=1.2em,minimum width=2em,rounded corners=5pt,thick] (n1) at (0, 0) {编码端};

-\node [anchor=west,rectangle, inner sep=0mm,minimum height=1.2em,minimum width=0em,rounded corners=5pt,thick] (n2) at ([xshift=3.5em,yshift=-0.5em]n1.east) {$z_0$};
+\node [anchor=west,rectangle, inner sep=0mm,minimum height=1.2em,minimum width=0em,rounded corners=5pt,thick] (n2) at ([xshift=3.5em,yshift=-0.5em]n1.east) {$\mathbi{X}$};

-\node [anchor=west,rectangle,draw, inner sep=0mm,minimum height=1.2em,minimum width=3em,fill=orange!20,rounded corners=5pt,thick] (n3) at ([xshift=3.5em,yshift=0em]n2.east) {$z_1$};
+\node [anchor=west,rectangle,draw, inner sep=0mm,minimum height=1.2em,minimum width=3em,fill=orange!20,rounded corners=5pt,thick] (n3) at ([xshift=3.5em,yshift=0em]n2.east) {$\mathbi{x}_1$};

-\node [anchor=west,rectangle,draw, inner sep=0mm,minimum height=1.2em,minimum width=3em,fill=orange!20,rounded corners=5pt,thick] (n4) at ([xshift=3.5em,yshift=0em]n3.east) {$z_2$};
+\node [anchor=west,rectangle,draw, inner sep=0mm,minimum height=1.2em,minimum width=3em,fill=orange!20,rounded corners=5pt,thick] (n4) at ([xshift=3.5em,yshift=0em]n3.east) {$\mathbi{x}_2$};

 \node [anchor=west,rectangle, inner sep=0mm,minimum height=1.2em,minimum width=1em,rounded corners=5pt,thick] (n6) at ([xshift=1.5em,yshift=0em]n4.east) {$\ldots$};

-\node [anchor=west,rectangle,draw, inner sep=0mm,minimum height=1.2em,minimum width=3em,fill=orange!20,rounded corners=5pt,thick] (n5) at ([xshift=3.5em,yshift=0em]n6.east) {$z_{l}$};
+\node [anchor=west,rectangle,draw, inner sep=0mm,minimum height=1.2em,minimum width=3em,fill=orange!20,rounded corners=5pt,thick] (n5) at ([xshift=3.5em,yshift=0em]n6.east) {$\mathbi{x}_l$};

-\node [anchor=west,rectangle,draw, inner sep=0mm,minimum height=1.2em,minimum width=3em,fill=orange!20,rounded corners=5pt,thick] (n7) at ([xshift=1.5em,yshift=0em]n5.east) {$z_{l+1}$};
+\node [anchor=west,rectangle,draw, inner sep=0mm,minimum height=1.2em,minimum width=3em,fill=orange!20,rounded corners=5pt,thick] (n7) at ([xshift=1.5em,yshift=0em]n5.east) {$\mathbi{x}_{l+1}$};

 \node [anchor=north,rectangle,draw, inner sep=0mm,minimum height=1.2em,minimum width=15em,fill=teal!17,rounded corners=5pt,thick] (n8) at ([xshift=0em,yshift=-3em]n4.south) {层正则化};

-\node [anchor=north,rectangle,draw, inner sep=0mm,minimum height=1.2em,minimum width=15em,fill=purple!17,rounded corners=5pt,thick] (n9) at ([xshift=0em,yshift=-1em]n8.south) {$L_0\ \quad L_1\ \quad L_2\quad \ldots \quad\ L_l$};
+\node [anchor=north,rectangle,draw, inner sep=0mm,minimum height=1.2em,minimum width=15em,fill=purple!17,rounded corners=5pt,thick] (n9) at ([xshift=0em,yshift=-1em]n8.south) {$\mathbi{X}\ \quad \mathbi{h}_1\ \quad \mathbi{h}_2\quad \ldots \quad\ \mathbi{h}_l$};

 \node [anchor=north,rectangle,draw, inner sep=0mm,minimum height=1.2em,minimum width=15em,fill=teal!17,rounded corners=5pt,thick] (n10) at ([xshift=0em,yshift=-2em]n9.south) {权重累加};


--- a/Chapter15/Figures/figure-encoder-structure-of-transformer-model-optimized-by-nas.jpg
+++ b/Chapter15/Figures/figure-encoder-structure-of-transformer-model-optimized-by-nas.jpg
--- a/Chapter15/Figures/figure-evolution-and-change-of-ml-methods.jpg
+++ b/Chapter15/Figures/figure-evolution-and-change-of-ml-methods.jpg
--- a/Chapter15/Figures/figure-layer-fusion-method-2d.png
+++ b/Chapter15/Figures/figure-layer-fusion-method-2d.png
--- a/Chapter15/Figures/figure-layer-fusion-method.png
+++ b/Chapter15/Figures/figure-layer-fusion-method.png
--- a/Chapter15/Figures/figure-learning-of-local-structure-combination.png
+++ b/Chapter15/Figures/figure-learning-of-local-structure-combination.png
--- a/Chapter15/Figures/figure-linear-layer-aggregation-network.png
+++ b/Chapter15/Figures/figure-linear-layer-aggregation-network.png
--- a/Chapter15/Figures/figure-main-flow-of-neural-network-structure-search.png
+++ b/Chapter15/Figures/figure-main-flow-of-neural-network-structure-search.png
--- a/Chapter15/Figures/figure-model-structure-optimization-framework-for-specific-equipment.tex
+++ b/Chapter15/Figures/figure-model-structure-optimization-framework-for-specific-equipment.tex
+
+%%% outline
+%-------------------------------------------------------------------------
+\begin{tikzpicture}[scale=0.6]
+	\tikzstyle{every node}=[scale=0.6]
+	\tikzstyle{cell} = [draw,circle,inner sep=0pt,minimum size=1.4em]
+	\tikzstyle{point} = [circle,fill=white,minimum size=0.46em,inner sep=0pt]
+	\tikzstyle{point_line} = [-,white,line width=1.4pt]
+	\tikzstyle{cell_line} = [-,thick]
+	\tikzstyle{background} = [rounded corners=4pt,minimum width=44em,fill=gray!20]
+
+%figure 1
+\node[inner sep=0pt,minimum size=4em,fill=black,rounded corners=6pt] (n1) at (-1.6em,0) {};
+\node[circle,fill=white,minimum size=2em] at (-1.6em,0){};
+\node[circle,fill=black,minimum size=1.4em] (n2) at (1.8em,0){};
+\draw[line width=4pt,-] ([xshift=-0.1em]n1.east) -- ([xshift=0.1em]n2.west);
+\draw[line width=4pt,-, out=-45,in=45] ([xshift=0.2em,yshift=1em]n2.east) to ([xshift=0.2em,yshift=-1em]n2.east);
+\draw[line width=4pt,-, out=-45,in=45] ([xshift=0.8em,yshift=1.4em]n2.east) to ([xshift=0.8em,yshift=-1.4em]n2.east);
+\draw[line width=4pt,-, out=-45,in=45] ([xshift=1.4em,yshift=1.8em]n2.east) to ([xshift=1.4em,yshift=-1.8em]n2.east);
+
+
+%figure 2
+\node[inner sep=0pt,minimum size=6em,fill=black,rounded corners=6pt] (n3) at (14em,0) {};
+\node[inner sep=0pt,minimum size=5em,fill=white,rounded corners=8pt] (n4) at (14em,0) {};
+\node[inner sep=0pt,minimum size=4.4em,fill=black,rounded corners=6pt] (n5) at (14em,0) {};
+
+\node[point] (dot1) at (12.7em,-1em){};
+\node[point] (dot2) at (12.8em,1.2em){};
+\node[point] (dot3) at (14.8em,-0.9em){};
+\node[point] (dot4) at (15.2em,-0.2em){};
+\node[point] (dot5) at (13.5em,0.3em){};
+\node[point] (dot6) at (14em,0.8em){};
+\draw[point_line] ([yshift=-1em]dot1.south) -- ([yshift=0.1em]dot1.south);
+\draw[point_line] ([yshift=-1.1em,xshift=-1.6em]dot3.south) -- ([yshift=-0.5em,xshift=-1.6em]dot3.south)-- ([yshift=-0.5em]dot3.south) -- ([yshift=0.1em]dot3.south);
+\draw[point_line] ([xshift=-0.1em]dot4.east) -- ([xshift=1em]dot4.east);
+\draw[point_line] ([xshift=-1em,yshift=-1.3em]dot2.south) -- ([yshift=-1.3em]dot2.south) -- ([yshift=0.1em]dot2.south);
+\draw[point_line] ([xshift=-0.1em]dot5.east) -- ([xshift=0.9em]dot5.east) --([xshift=0.9em,yshift=0.8em]dot5.east) -- ([xshift=2.6em,yshift=0.8em]dot5.east);
+\draw[point_line] ([yshift=-0.1em]dot6.north) -- ([yshift=0.6em]dot6.north) -- ([yshift=0.6em,xshift=1em]dot6.north) -- ([yshift=1.2em,xshift=1em]dot6.north);
+
+\foreach \x in {1,2,3,4}{
+\node[fill=black,inner sep=0pt,minimum width=1.2em,minimum height=0.6em] at (17.8em,-3em+1.2em*\x){};
+\node[inner sep=0pt,circle,minimum size=0.6em,fill=black] at (18.4em,-3em+1.2em*\x){};}
+
+\foreach \x in {1,2,3,4}{
+\node[fill=black,inner sep=0pt,minimum width=1.2em,minimum height=0.6em] at (10.2em,-3em+1.2em*\x){};
+\node[inner sep=0pt,circle,minimum size=0.6em,fill=black] at (9.6em,-3em+1.2em*\x){};}
+
+\foreach \x in {1,2,3,4}{
+\node[fill=black,inner sep=0pt,minimum width=0.6em,minimum height=1.2em] at (10.9em+1.2em*\x,3.8em){};
+\node[inner sep=0pt,circle,minimum size=0.6em,fill=black] at (10.9em+1.2em*\x,4.2em){};}
+
+\foreach \x in {1,2,3,4}{
+\node[fill=black,inner sep=0pt,minimum width=0.6em,minimum height=1.2em] at (10.9em+1.2em*\x,-3.8em){};
+\node[inner sep=0pt,circle,minimum size=0.6em,fill=black] at (10.9em+1.2em*\x,-4.2em){};}
+
+%figure 3
+\node[circle,line width=2pt,minimum size=3.8em,draw,fill=white] (f1) at (26em,0){};
+\node[circle,line width=2pt,minimum size=1em,draw,fill=white] (c1) at (26em,0){};
+\node[circle,line width=2pt,minimum size=3.8em,draw,fill=white] (f2)at (30.4em,0){};
+\node[circle,line width=2pt,minimum size=1em,draw,fill=white] (c2) at (30.4em,0){};
+
+\draw[line width=2pt,out=90,in=90] ([xshift=0.1em]c1.180) to ([xshift=0.1em]f1.180);
+\draw[line width=2pt,out=45,in=45] ([xshift=0.1em,yshift=-0.1em]c1.135) to ([xshift=0.1em,yshift=-0.1em]f1.135);
+\draw[line width=2pt,out=0,in=0] ([yshift=-0.1em]c1.90) to ([yshift=-0.1em]f1.90);
+\draw[line width=2pt,out=-45,in=-45] ([xshift=-0.1em,yshift=-0.1em]c1.45) to ([xshift=-0.1em,yshift=-0.1em]f1.45);
+\draw[line width=2pt,out=-90,in=-90] ([xshift=-0.1em]c1.0) to ([xshift=-0.1em]f1.0);
+\draw[line width=2pt,out=-135,in=-135] ([xshift=-0.1em,yshift=0.1em]c1.-45) to ([xshift=-0.1em,yshift=0.1em]f1.-45);
+\draw[line width=2pt,out=-180,in=-180] ([yshift=0.1em]c1.-90) to ([yshift=0.1em]f1.-90);
+\draw[line width=2pt,out=-225,in=-225] ([xshift=0.1em,yshift=0.1em]c1.-135) to ([xshift=0.1em,yshift=0.1em]f1.-135);
+
+\draw[line width=2pt,out=90,in=90] ([xshift=0.1em]c2.180) to ([xshift=0.1em]f2.180);
+\draw[line width=2pt,out=45,in=45] ([xshift=0.1em,yshift=-0.1em]c2.135) to ([xshift=0.1em,yshift=-0.1em]f2.135);
+\draw[line width=2pt,out=0,in=0] ([yshift=-0.1em]c2.90) to ([yshift=-0.1em]f2.90);
+\draw[line width=2pt,out=-45,in=-45] ([xshift=-0.1em,yshift=-0.1em]c2.45) to ([xshift=-0.1em,yshift=-0.1em]f2.45);
+\draw[line width=2pt,out=-90,in=-90] ([xshift=-0.1em]c2.0) to ([xshift=-0.1em]f2.0);
+\draw[line width=2pt,out=-135,in=-135] ([xshift=-0.1em,yshift=0.1em]c2.-45) to ([xshift=-0.1em,yshift=0.1em]f2.-45);
+\draw[line width=2pt,out=-180,in=-180] ([yshift=0.1em]c2.-90) to ([yshift=0.1em]f2.-90);
+\draw[line width=2pt,out=-225,in=-225] ([xshift=0.1em,yshift=0.1em]c2.-135) to ([xshift=0.1em,yshift=0.1em]f2.-135);
+
+\draw[line width=2pt,-] (22.8em,2.6em) -- (23.4em,2.6em) -- (23.4em,-3em);
+\draw[line width=2pt,-] (23.4em,1.6em) -- (23em,1.6em) -- (23em,0.8em) -- (23.4em,0.8em);
+\draw[line width=2pt,-] (23.4em,0.4em) -- (23em,0.4em) -- (23em,-2em) -- (23.4em,-2em);
+\draw[line width=2pt,-] (23.4em,2em) -- (24.4em,2em) -- (25em,2.6em) -- (30.4em,2.6em);
+\draw[line width=2pt,-] (23.4em,-2.5em) -- (30.8em,-2.5em);
+\draw[line width=2pt,-] (25em,-2.5em) -- (25em,-3em) -- (26em,-3em) -- (26em,-2.5em);
+\draw[line width=2pt,-] (26.5em,-2.5em) -- (26.5em,-3em) -- (29.4em, -3em) -- (29.4em,-2.5em) ;
+\draw[line width=2pt,-,out=0,in=120] (30.4em,2.6em) to (32.8em,1em);
+\draw[line width=2pt,-,out=0,in=-70] (30.8em,-2.5em) to (33.2em,1em);
+\draw[line width=2pt,-] (32.8em,1em) to (33.2em,1em);
+\draw[line width=2pt,-] (27.4em,2.6em) -- (28.4em,1.6em) -- (29.6em,1.6em);
+
+\node[] at (0em,4.6em) {\huge\bfnew{Different Hardware}};
+\node[] at (6.4em,22em) {\huge\bfnew{SubTransformers(Weight-Sharing)}};
+\node[] at (-0.4em,37.6em) {\huge\bfnew{SuperTransformer}};
+\node[] at (0.4em,-6.2em) {\huge\bfnew{IOT}};
+\node[] at (14em,-6.2em) {\huge\bfnew{CPU}};
+\node[] at (28em,-6.2em) {\huge\bfnew{GPU}};
+\begin{pgfonlayer}{background}
+{
+\draw[line width=3pt,-] (23.4em,-1.6em) -- (28.2em,-1.6em) -- (28.6em,-2.5em);
+\node[background,minimum height=14em] (bg1) at (13.4em,-0.8em){};
+\node[background,minimum height=11.4em] (bg2) at (13.4em,17.8em){};
+\node[background,minimum height=10em] (bg3) at (13.4em,34em){};
+\draw[thick,out=45,in=-45,-latex] (bg1.north east) to (bg2.north east);
+}
+\end{pgfonlayer}
+
+\node[cell,fill=red!60] (c1_1) at (-1.5em,18.5em){};
+\node[cell,fill=red!60] (c1_2) at (1.5em,18.5em){};
+\node[cell,fill=ublue!60] (c1_3) at (-1.5em,15.5em){};
+\node[cell,fill=ublue!60] (c1_4) at (1.5em,15.5em){};
+
+\node[cell,fill=red!60] (c2_1) at (12.5em,20em){};
+\node[cell,fill=red!60] (c2_2) at (15.5em,20em){};
+\node[cell,fill=ublue!60] (c2_3) at (12.5em,17em){};
+\node[cell,fill=ublue!60] (c2_4) at (15.5em,17em){};
+\node[cell,fill=orange!60] (c2_5) at (9.5em,14em){};
+\node[cell,fill=orange!60] (c2_6) at (12.5em,14em){};
+\node[cell,fill=orange!60] (c2_7) at (15.5em,14em){};
+\node[cell,fill=orange!60] (c2_8) at (18.5em,14em){};
+
+\node[cell,fill=red!60] (c3_1) at (23.5em,19em){};
+\node[cell,fill=red!60] (c3_2) at (26.5em,19em){};
+\node[cell,fill=red!60] (c3_3) at (29.5em,19em){};
+\node[cell,fill=red!60] (c3_4) at (32.5em,19em){};
+\node[cell,fill=ublue!60] (c3_5) at (23.5em,15em){};
+\node[cell,fill=ublue!60] (c3_6) at (26.5em,15em){};
+\node[cell,fill=ublue!60] (c3_7) at (29.5em,15em){};
+\node[cell,fill=ublue!60] (c3_8) at (32.5em,15em){};
+
+
+\draw[-,thick] (c1_1.-90) -- (c1_3.90);
+\draw[-,thick] (c1_1.-45) -- (c1_4.135);
+\draw[-,thick] (c1_2.-90) -- (c1_4.90);
+\draw[-,thick] (c1_2.-135) -- (c1_3.45);
+
+\draw[-,thick] (c2_1.-90) -- (c2_3.90);
+\draw[-,thick] (c2_1.-45) -- (c2_4.135);
+\draw[-,thick] (c2_2.-90) -- (c2_4.90);
+\draw[-,thick] (c2_2.-135) -- (c2_3.45);
+
+\draw[-,thick] (c2_3.-90) -- (c2_6.90);
+\draw[-,thick] (c2_3.-135) -- (c2_5.45);
+\draw[-,thick] (c2_3.-45) -- (c2_7.135);
+\draw[-,thick] (c2_3.-30) -- (c2_8.150);
+\draw[-,thick] (c2_4.-150) -- (c2_5.30);
+\draw[-,thick] (c2_4.-135) -- (c2_6.45);
+\draw[-,thick] (c2_4.-90) -- (c2_7.90);
+\draw[-,thick] (c2_4.-45) -- (c2_8.135);
+
+\draw[-,thick] (c3_1.-90) -- (c3_5.90);
+\draw[-,thick] (c3_1.-45) -- (c3_6.135);
+\draw[-,thick] (c3_1.-30) -- (c3_7.150);
+\draw[-,thick] (c3_1.-15) -- (c3_8.165);
+\draw[-,thick] (c3_2.-90) -- (c3_6.90);
+\draw[-,thick] (c3_2.-135) -- (c3_5.45);
+\draw[-,thick] (c3_2.-45) -- (c3_7.135);
+\draw[-,thick] (c3_2.-30) -- (c3_8.150);
+\draw[-,thick] (c3_3.-150) -- (c3_5.30);
+\draw[-,thick] (c3_3.-135) -- (c3_6.45);
+\draw[-,thick] (c3_3.-90) -- (c3_7.90);
+\draw[-,thick] (c3_3.-45) -- (c3_8.135);
+\draw[-,thick] (c3_4.-165) -- (c3_5.15);
+\draw[-,thick] (c3_4.-150) -- (c3_6.30);
+\draw[-,thick] (c3_4.-135) -- (c3_7.45);
+\draw[-,thick] (c3_4.-90) -- (c3_8.90);
+
+
+\node[cell,fill=red!60] (c4_1) at (9.5em,37em){};
+\node[cell,fill=red!60] (c4_2) at (12.5em,37em){};
+\node[cell,fill=red!60] (c4_3) at (15.5em,37em){};
+\node[cell,fill=red!60] (c4_4) at (18.5em,37em){};
+\node[cell,fill=ublue!60] (c4_5) at (9.5em,34em){};
+\node[cell,fill=ublue!60] (c4_6) at (12.5em,34em){};
+\node[cell,fill=ublue!60] (c4_7) at (15.5em,34em){};
+\node[cell,fill=ublue!60] (c4_8) at (18.5em,34em){};
+\node[cell,fill=orange!60] (c4_9) at (9.5em,31em){};
+\node[cell,fill=orange!60] (c4_10) at (12.5em,31em){};
+\node[cell,fill=orange!60] (c4_11) at (15.5em,31em){};
+\node[cell,fill=orange!60] (c4_12) at (18.5em,31em){};
+
+
+\draw[-,thick] (c4_1.-90) -- (c4_5.90);
+\draw[-,thick] (c4_1.-45) -- (c4_6.135);
+\draw[-,thick] (c4_1.-30) -- (c4_7.150);
+\draw[-,thick] (c4_1.-15) -- (c4_8.165);
+\draw[-,thick] (c4_2.-90) -- (c4_6.90);
+\draw[-,thick] (c4_2.-135) -- (c4_5.45);
+\draw[-,thick] (c4_2.-45) -- (c4_7.135);
+\draw[-,thick] (c4_2.-30) -- (c4_8.150);
+\draw[-,thick] (c4_3.-150) -- (c4_5.30);
+\draw[-,thick] (c4_3.-135) -- (c4_6.45);
+\draw[-,thick] (c4_3.-90) -- (c4_7.90);
+\draw[-,thick] (c4_3.-45) -- (c4_8.135);
+\draw[-,thick] (c4_4.-165) -- (c4_5.15);
+\draw[-,thick] (c4_4.-150) -- (c4_6.30);
+\draw[-,thick] (c4_4.-135) -- (c4_7.45);
+\draw[-,thick] (c4_4.-90) -- (c4_8.90);
+
+\draw[-,thick] (c4_5.-90) -- (c4_9.90);
+\draw[-,thick] (c4_5.-45) -- (c4_10.135);
+\draw[-,thick] (c4_5.-30) -- (c4_11.150);
+\draw[-,thick] (c4_5.-15) -- (c4_12.165);
+\draw[-,thick] (c4_6.-90) -- (c4_10.90);
+\draw[-,thick] (c4_6.-135) -- (c4_9.45);
+\draw[-,thick] (c4_6.-45) -- (c4_11.135);
+\draw[-,thick] (c4_6.-30) -- (c4_12.150);
+\draw[-,thick] (c4_7.-150) -- (c4_9.30);
+\draw[-,thick] (c4_7.-135) -- (c4_10.45);
+\draw[-,thick] (c4_7.-90) -- (c4_11.90);
+\draw[-,thick] (c4_7.-45) -- (c4_12.135);
+\draw[-,thick] (c4_8.-165) -- (c4_9.15);
+\draw[-,thick] (c4_8.-150) -- (c4_10.30);
+\draw[-,thick] (c4_8.-135) -- (c4_11.45);
+\draw[-,thick] (c4_8.-90) -- (c4_12.90);
+
+\draw[line width=2pt,-latex] ([xshift=0.5em,yshift=-0.6em]bg3.south) -- ([xshift=0.5em,yshift=0.6em]bg2.north); 
+\draw[line width=2pt,-latex] ([xshift=-2.5em,yshift=-0.6em]bg3.south) -- ([xshift=-9.5em,yshift=0.6em]bg2.north);
+\draw[line width=2pt,-latex] ([xshift=3.5em,yshift=-0.6em]bg3.south) -- ([xshift=9.5em,yshift=0.6em]bg2.north);
+\draw[line width=2pt,-latex] (0em, 10.4em) -- (0em, 7em); 
+\draw[line width=2pt,-latex] (14em, 10.4em) -- (14em, 7em);
+\draw[line width=2pt,-latex] (28em, 10.4em) -- (28em, 7em);
+\node[align=center] at (0.5em,-9em){\Large\bfnew{TinyML}};
+\node[align=center] at (14em,-9em){\Large\bfnew{Deeper}};
+\node[align=center] at (28em,-9em){\Large\bfnew{Wider}};
+\node[font=\footnotesize,align=center] at (30em,26em){\Large\bfnew{Evolutionary Search with} \\ \Large\bfnew{Hardware Constraints}};
+\node[font=\footnotesize,align=center] at ([yshift=-1em]bg2.south){\Large\bfnew{Specialized Deployment is Efficient}};
+\node[font=\footnotesize,align=center] at ([xshift=3em,yshift=5em]bg1.east){\Large\bfnew{Hardware}\\\Large\bfnew{Latency}\\\Large\bfnew{Feedback}};
+
+\node[circle,inner sep=0pt,minimum size=3em,draw,fill=white] (clock) at ([xshift=3.4em,yshift=-2.4em]bg2.east){};
+\node[circle,inner sep=0pt,minimum size=0.4em,draw,fill=black] (clock2)at ([xshift=3.4em,yshift=-2.4em]bg2.east){};
+\draw[] ([xshift=-0.3em]clock.0) arc (1:180:1.2em);
+\draw[fill=black] (clock2.90) -- ([xshift=0.6em,yshift=-0.6em]clock.135) -- (clock2.180) -- (clock2.90);
+\end{tikzpicture}
+
+
+
+
--- a/Chapter15/Figures/figure-post-norm-vs-pre-norm.tex
+++ b/Chapter15/Figures/figure-post-norm-vs-pre-norm.tex
@@ -5,25 +5,25 @@

 \begin{scope}[minimum height = 20pt]

-\node [anchor=east] (x1) at (-0.5em, 0) {$x_l$};
-\node [anchor=west,draw,fill=red!20,inner xsep=5pt,rounded corners=2pt] (F1) at ([xshift=2em]x1.east){\small{$\mathcal{F}$}};
+\node [anchor=east] (x1) at (-0.5em, 0) {$\mathbi{x}_l$};
+\node [anchor=west,draw,fill=red!20,inner xsep=5pt,rounded corners=2pt] (F1) at ([xshift=2em]x1.east){\small{$F$}};
 \node [anchor=west,circle,draw,minimum size=1em] (n1) at ([xshift=2em]F1.east) {};
 \node [anchor=west,draw,fill=green!20,inner xsep=5pt,rounded corners=2pt] (ln1) at ([xshift=2em]n1.east){\small{\textrm{LN}}};
-\node [anchor=west] (x2) at ([xshift=2em]ln1.east) {$x_{l+1}$};
+\node [anchor=west] (x2) at ([xshift=2em]ln1.east) {$\mathbi{x}_{l+1}$};

-\node [anchor=north] (x3) at ([yshift=-5em]x1.south) {$x_l$};
+\node [anchor=north] (x3) at ([yshift=-5em]x1.south) {$\mathbi{x}_l$};
 \node [anchor=west,draw,fill=green!20,inner xsep=5pt,rounded corners=2pt] (F2) at ([xshift=2em]x3.east){\small{\textrm{LN}}};
-\node [anchor=west,draw,fill=red!20,inner xsep=5pt,rounded corners=2pt] (ln2) at ([xshift=2em]F2.east){\small{$\mathcal{F}$}};
+\node [anchor=west,draw,fill=red!20,inner xsep=5pt,rounded corners=2pt] (ln2) at ([xshift=2em]F2.east){\small{$F$}};
 \node [anchor=west,circle,draw,,minimum size=1em] (n2) at ([xshift=2em]ln2.east){};
-\node [anchor=west] (x4) at ([xshift=2em]n2.east) {$x_{l+1}$};
+\node [anchor=west] (x4) at ([xshift=2em]n2.east) {$\mathbi{x}_{l+1}$};

 \draw[->, line width=1pt] ([xshift=-0.1em]x1.east)--(F1.west);
 \draw[->, line width=1pt] ([xshift=-0.1em]F1.east)--(n1.west);
-\draw[->, line width=1pt] (n1.east)--node[above]{$y_l$}(ln1.west);
+\draw[->, line width=1pt] (n1.east)--node[above]{$\mathbi{y}_l$}(ln1.west);
 \draw[->, line width=1pt] ([xshift=-0.1em]ln1.east)--(x2.west);
 \draw[->, line width=1pt] ([xshift=-0.1em]x3.east)--(F2.west);
 \draw[->, line width=1pt] ([xshift=-0.1em]F2.east)--(ln2.west);
-\draw[->, line width=1pt] ([xshift=0.1em]ln2.east)--node[above]{$y_l$}(n2.west);
+\draw[->, line width=1pt] ([xshift=0.1em]ln2.east)--node[above]{$\mathbi{y}_l$}(n2.west);
 \draw[->, line width=1pt] (n2.east)--(x4.west);
 \draw[->,rounded corners,line width=1pt] ([yshift=-0.2em]x1.north) -- ([yshift=1em]x1.north) -- ([yshift=1.4em]n1.north) -- (n1.north);
 \draw[->,rounded corners,line width=1pt] ([yshift=-0.2em]x3.north) -- ([yshift=1em]x3.north) -- ([yshift=1.4em]n2.north) -- (n2.north);

--- a/Chapter15/Figures/figure-progressive-training.tex
+++ b/Chapter15/Figures/figure-progressive-training.tex
@@ -4,22 +4,22 @@
 \begin{tikzpicture}
 \begin{scope}

-\node [anchor=east,fill=orange!20,draw,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em] (s11) at (-0.5em, 0) {\footnotesize{$\times h$}};
+\node [anchor=east,fill=orange!20,draw,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em] (s11) at (-0.5em, 0) {\footnotesize{$\times l$}};
 \node [rectangle,anchor=west,fill=blue!20,draw,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em] (s12) at ([xshift=1.2em]s11.east) {};

-\node [anchor=north,fill=orange!20,draw,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em] (s21) at ([yshift=-1.2em]s11.south) {\footnotesize{$\times h$}};
-\node [anchor=west,fill=orange!20,draw=red,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em,dashed] (s22) at ([xshift=1.2em]s21.east) {\footnotesize{$\times h$}};
+\node [anchor=north,fill=orange!20,draw,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em] (s21) at ([yshift=-1.2em]s11.south) {\footnotesize{$\times l$}};
+\node [anchor=west,fill=orange!20,draw=red,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em,dashed] (s22) at ([xshift=1.2em]s21.east) {\footnotesize{$\times l$}};
 \node [anchor=west,fill=blue!20,draw,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em] (s23) at ([xshift=1.2em]s22.east) {};

-\node [anchor=north,fill=orange!20,draw,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em] (s31) at ([yshift=-1.2em]s21.south) {\footnotesize{$\times h$}};
-\node [anchor=west,fill=orange!20,draw,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em] (s32) at ([xshift=1.2em]s31.east) {\footnotesize{$\times h$}};
-\node [anchor=west,fill=orange!20,draw=red,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em,dashed] (s33) at ([xshift=1.2em]s32.east) {\footnotesize{$\times h$}};
+\node [anchor=north,fill=orange!20,draw,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em] (s31) at ([yshift=-1.2em]s21.south) {\footnotesize{$\times l$}};
+\node [anchor=west,fill=orange!20,draw,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em] (s32) at ([xshift=1.2em]s31.east) {\footnotesize{$\times l$}};
+\node [anchor=west,fill=orange!20,draw=red,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em,dashed] (s33) at ([xshift=1.2em]s32.east) {\footnotesize{$\times l$}};
 \node [anchor=west,fill=blue!20,draw,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em] (s34) at ([xshift=1.2em]s33.east) {};

-\node [anchor=north,fill=orange!20,draw,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em] (s41) at ([yshift=-1.2em]s31.south) {\footnotesize{$\times h$}};
-\node [anchor=west,fill=orange!20,draw,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em] (s42) at ([xshift=1.2em]s41.east) {\footnotesize{$\times h$}};
-\node [anchor=west,fill=orange!20,draw,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em] (s43) at ([xshift=1.2em]s42.east) {\footnotesize{$\times h$}};
-\node [anchor=west,fill=orange!20,draw=red,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em,dashed] (s44) at ([xshift=1.2em]s43.east) {\footnotesize{$\times h$}};
+\node [anchor=north,fill=orange!20,draw,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em] (s41) at ([yshift=-1.2em]s31.south) {\footnotesize{$\times l$}};
+\node [anchor=west,fill=orange!20,draw,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em] (s42) at ([xshift=1.2em]s41.east) {\footnotesize{$\times l$}};
+\node [anchor=west,fill=orange!20,draw,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em] (s43) at ([xshift=1.2em]s42.east) {\footnotesize{$\times l$}};
+\node [anchor=west,fill=orange!20,draw=red,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em,dashed] (s44) at ([xshift=1.2em]s43.east) {\footnotesize{$\times l$}};
 \node [anchor=west,fill=blue!20,draw,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em] (s45) at ([xshift=1.2em]s44.east) {};

 \node [anchor=east] (p1) at ([xshift=-2em]s11.west) {\footnotesize{step 1}};

--- a/Chapter15/Figures/figure-relationship-between-structures-in-structural-space.jpg
+++ b/Chapter15/Figures/figure-relationship-between-structures-in-structural-space.jpg
--- a/Chapter15/Figures/figure-structure-search-based-on-evolutionary-algorithm.png
+++ b/Chapter15/Figures/figure-structure-search-based-on-evolutionary-algorithm.png
--- a/Chapter15/Figures/figure-structure-search-based-on-gradient-method.png
+++ b/Chapter15/Figures/figure-structure-search-based-on-gradient-method.png
--- a/Chapter15/Figures/figure-structure-search-based-on-reinforcement-learning.png
+++ b/Chapter15/Figures/figure-structure-search-based-on-reinforcement-learning.png
--- a/Chapter15/Figures/figure-sublayer-skip.tex
+++ b/Chapter15/Figures/figure-sublayer-skip.tex
@@ -5,29 +5,29 @@

 \begin{scope}[minimum height = 20pt]

-\node [anchor=east] (x1) at (-0.5em, 0) {$x_l$};
+\node [anchor=east] (x1) at (-0.5em, 0) {$\mathbi{x}_l$};
 \node [anchor=west,draw,fill=red!20,inner xsep=5pt,rounded corners=2pt] (ln1) at ([xshift=1em]x1.east){\small{\textrm{LN}}};
-\node [anchor=west,draw,fill=green!20,inner xsep=5pt,rounded corners=2pt] (f1) at ([xshift=0.6em]ln1.east){\small{$\mathcal{F}$}};
+\node [anchor=west,draw,fill=green!20,inner xsep=5pt,rounded corners=2pt] (f1) at ([xshift=0.6em]ln1.east){\small{$F$}};
 \node [anchor=west,circle,draw,,minimum size=1em] (n1) at ([xshift=3em]f1.east){};
-\node [anchor=west] (x2) at ([xshift=1em]n1.east) {$x_{l+1}$};
+\node [anchor=west] (x2) at ([xshift=1em]n1.east) {$\mathbi{x}_{l+1}$};
 \node [anchor=west,draw,fill=red!20,inner xsep=5pt,rounded corners=2pt] (ln12) at ([xshift=1em]x2.east){\small{\textrm{LN}}};
-\node [anchor=west,draw,fill=green!20,inner xsep=5pt,rounded corners=2pt] (f12) at ([xshift=0.6em]ln12.east){\small{$\mathcal{F}$}};
+\node [anchor=west,draw,fill=green!20,inner xsep=5pt,rounded corners=2pt] (f12) at ([xshift=0.6em]ln12.east){\small{$F$}};
 \node [anchor=west,circle,draw,,minimum size=1em] (n12) at ([xshift=3em]f12.east){};
-\node [anchor=west] (x22) at ([xshift=1em]n12.east) {$x_{l+2}$};
+\node [anchor=west] (x22) at ([xshift=1em]n12.east) {$\mathbi{x}_{l+2}$};

-\node [anchor=north] (x3) at ([yshift=-5em]x1.south) {$x_l$};
+\node [anchor=north] (x3) at ([yshift=-5em]x1.south) {$\mathbi{x}_l$};
 \node [anchor=west,draw,fill=red!20,inner xsep=5pt,rounded corners=2pt] (ln2) at ([xshift=1em]x3.east){\small{\textrm{LN}}};
-\node [anchor=west,draw,fill=green!20,inner xsep=5pt,rounded corners=2pt] (f2) at ([xshift=0.6em]ln2.east){\small{$\mathcal{F}$}};
+\node [anchor=west,draw,fill=green!20,inner xsep=5pt,rounded corners=2pt] (f2) at ([xshift=0.6em]ln2.east){\small{$F$}};
 \node [anchor=west,minimum size=1em] (p1) at ([xshift=1em]f2.east){};
-\node [anchor=north] (m1) at ([yshift=0.6em]p1.south){\tiny{\red{$M=1$}}};
+\node [anchor=north] (m1) at ([yshift=0.6em]p1.south){\footnotesize{\red{Mask=1}}};
 \node [anchor=west,circle,draw,,minimum size=1em] (n2) at ([xshift=3em]f2.east){};
-\node [anchor=west] (x4) at ([xshift=1em]n2.east) {$x_{l+1}$};
+\node [anchor=west] (x4) at ([xshift=1em]n2.east) {$\mathbi{x}_{l+1}$};
 \node [anchor=west,draw,fill=red!20,inner xsep=5pt,rounded corners=2pt] (ln22) at ([xshift=1em]x4.east){\small{\textrm{LN}}};
-\node [anchor=west,draw,fill=green!20,inner xsep=5pt,rounded corners=2pt] (f22) at ([xshift=0.6em]ln22.east){\small{$\mathcal{F}$}};
+\node [anchor=west,draw,fill=green!20,inner xsep=5pt,rounded corners=2pt] (f22) at ([xshift=0.6em]ln22.east){\small{$F$}};
 \node [anchor=west,minimum size=1em] (p2) at ([xshift=1em]f22.east){};
-\node [anchor=north] (m2) at ([yshift=0.6em]p2.south){\tiny{\red{$M=0$}}};
+\node [anchor=north] (m2) at ([yshift=0.6em]p2.south){\footnotesize{\red{Mask=0}}};
 \node [anchor=west,circle,draw,,minimum size=1em] (n22) at ([xshift=3em]f22.east){};
-\node [anchor=west] (x42) at ([xshift=1em]n22.east) {$x_{l+2}$};
+\node [anchor=west] (x42) at ([xshift=1em]n22.east) {$\mathbi{x}_{l+2}$};

 \draw[->, line width=1pt] ([xshift=-0.1em]x1.east)--(ln1.west);
 \draw[->, line width=1pt] ([xshift=-0.1em]ln1.east)--(f1.west);

--- a/Chapter15/Figures/figure-swish-function-image.png
+++ b/Chapter15/Figures/figure-swish-function-image.png
--- a/Chapter15/Figures/figure-transparent-attention-mechanism.png
+++ b/Chapter15/Figures/figure-transparent-attention-mechanism.png
--- a/Chapter15/Figures/figure-weight-visualization-of-convergence-DLCL-network.png
+++ b/Chapter15/Figures/figure-weight-visualization-of-convergence-DLCL-network.png
--- a/Chapter15/Figures/figure-whole-structure-and-internal-structure-in-rnn.png
+++ b/Chapter15/Figures/figure-whole-structure-and-internal-structure-in-rnn.png
--- a/Chapter15/chapter15.tex
+++ b/Chapter15/chapter15.tex
@@ -21,17 +21,26 @@
 %	CHAPTER 15
 %----------------------------------------------------------------------------------------

-\chapter{面向神经机器翻译的网络结构设计}
+\chapter{神经机器翻译结构优化}

 %----------------------------------------------------------------------------------------
 %    NEW SECTION
 %----------------------------------------------------------------------------------------

-\section{深层网络}
+\section{基于自注意力机制的改进}

-\parinterval {\chapterthirteen}已经指出：增加神经网络的深度有助于对句子进行更充分的表示、同时增加模型的容量。但是，简单地堆叠很多层Transformer网络并不能带来性能上的提升，反而会面临更加严重的梯度消失/梯度爆炸的问题。这是由于伴随神经网络变深，梯度无法有效地从输出层回传到底层网络，造成网络浅层部分的参数无法得到充分训练\upcite{Wang2019LearningDT,DBLP:conf/cvpr/YuYR18}。针对这些问题，已经有研究者开始尝试进行求解，并取得了很好的效果。比如，设计更有利于深层信息传递的网络连接和恰当的参数初始化方法等\upcite{Bapna2018TrainingDN,Wang2019LearningDT,DBLP:conf/emnlp/ZhangTS19}。
+%----------------------------------------------------------------------------------------
+%    NEW SECTION
+%----------------------------------------------------------------------------------------
+
+\section{网络连接优化及深层网络建模}
+
+\parinterval 除了对Transformer模型中的局部组件进行改进，改进不同层之间的连接方式也十分重要。常见的做法是融合编码/解码的中间层表示得到更丰富的编码/解码输出\upcite{Wang2018MultilayerRF,Wang2019ExploitingSC,Dou2018ExploitingDR,Dou2019DynamicLA}。同时，可以利用稠密连接等更复杂的层间连接方式来强化或替换残差连接，这类方法在图像识别\upcite{DBLP:journals/corr/HeZRS15,DBLP:conf/cvpr/HuangLMW17}、机器翻译\upcite{Bapna2018TrainingDN,Wang2018MultilayerRF,Dou2018ExploitingDR,WangLearning,Dou2019DynamicLA}
+等任务上取得了很好的效果。
+
+\parinterval 与此同时，宽网络（如Transformer-Big）在机器翻译、语言模型等任务上表现十分出色，但伴随而来的是快速增长的参数量与更大的训练代价。同时受限于任务的复杂度与计算设备的算力，进一步探索更宽的网络显然不是特别高效的手段。在本书第十三章已经指出：增加神经网络的深度同样有助于对句子进行更充分的表示、同时增加模型的容量。但是，简单地堆叠很多层Transformer网络并不能带来性能上的提升，反而会面临更加严重的梯度消失/梯度爆炸的问题。这是由于伴随神经网络变深，梯度无法有效地从输出层回传到底层网络，造成网络浅层部分的参数无法得到充分训练\upcite{Bapna2018TrainingDN,WangLearning,DBLP:journals/corr/abs-2002-04745,DBLP:conf/emnlp/LiuLGCH20}。针对这些问题，已经有研究者开始尝试求解，并取得了很好的效果。比如，设计更有利于深层信息传递的网络连接\upcite{Bapna2018TrainingDN,WangLearning,Wei2020MultiscaleCD,DBLP:conf/acl/WuWXTGQLL19,li2020shallow,DBLP:journals/corr/abs-2007-06257}和恰当的参数初始化方法\upcite{huang2020improving,DBLP:conf/emnlp/ZhangTS19,DBLP:conf/acl/XuLGXZ20,DBLP:conf/emnlp/LiuLGCH20}等。

-\parinterval 但是，如何设计一个足够“深”的机器翻译模型仍然是业界关注的热点问题之一。此外，伴随着网络的继续变深，将会面临一些新的问题，例如，如何加速深层网络的训练，如何解决深层网络的过拟合问题等。下面将会对以上问题展开讨论。
+\parinterval 但是，如何设计一个足够“深”的机器翻译模型仍然是业界关注的热点问题之一。此外，伴随着网络的继续变深，将会面临一些新的问题，例如，如何加速深层网络的训练，如何解决深层网络的过拟合问题等。下面将会对以上问题展开讨论。首先对Transformer模型的内部信息流进行详细的讨论。之后分别从模型结构和参数初始化两个角度求解为什么深层网络难以训练，并介绍相应的解决手段。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -39,7 +48,7 @@

 \subsection{Post-Norm vs Pre-Norm}

-\parinterval 为了探究为何深层的Transformer模型很难直接训练，首先对Transformer的模型结构进行简单的回顾。以Transformer的编码端为例，在多头自注意力网络和前馈神经网络中间，Transformer模型利用残差连接和层正则化操作来提高信息的传递效率。Transformer模型大致分为图\ref{fig:15-1}中两种结构\ \dash \ 后作方式的残差单元（Post-Norm）和前作方式的残差单元（Pre-Norm）。
+\parinterval 为了探究为何深层的Transformer模型很难直接训练，首先对Transformer的模型结构进行简单的回顾（见{\chaptertwelve}）。以Transformer的编码端为例，在多头自注意力网络和前馈神经网络中间，Transformer模型利用残差连接\upcite{DBLP:journals/corr/HeZRS15}和层标准化操作\upcite{Ba2016LayerN}来提高信息的传递效率。Transformer模型大致分为图\ref{fig:15-1}中两种结构\ \dash \ 后作方式的残差单元（Post-Norm）和前作方式的残差单元（Pre-Norm）。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -50,147 +59,387 @@
 \end{figure}
 %-------------------------------------------

-\parinterval 令$x_l$和$x_{l+1}$表示第$l$子层的输入和输出\footnote[1]{这里沿用Transformer中的定义，每一层（Layer）包含多个子层（Sub-layer）。比如，对于Transformer编码器，每一层包含一个自注意力子层和一个前馈神经网络子层。所有子层都需要进行层归一化和残差连接。}，$y_l$表示中间的临时输出；$\textrm{LN}(\cdot)$表示层归一化操作\upcite{Ba2016LayerN},帮助减少子层输出分布的方差。从而让训练变得更稳定；{\red $\mathcal{F}(\cdot)$ （F这种斜体是有什么特殊含义吗）}表示子层所对应的函数，比如前馈神经网络、自注意力网络等。下面分别对Post-Norm和Pre-Norm进行简单的描述。
+\parinterval 令$\mathbi{x}_l$和$\mathbi{x}_{l+1}$表示第$l$个子层的输入和输出\footnote[1]{这里沿用Transformer中的定义，每一层（Layer）包含多个子层（Sub-layer）。比如，对于Transformer编码器，每一层包含一个自注意力子层和一个前馈神经网络子层。所有子层都需要进行层归一化和残差连接。}，$\mathbi{y}_l$表示中间的临时输出；$\textrm{LN}(\cdot)$表示层归一化操作\upcite{Ba2016LayerN},帮助减少子层输出分布的方差。从而让训练变得更稳定；$F(\cdot)$表示子层所对应的函数，比如前馈神经网络、自注意力网络等。下面分别对Post-Norm和Pre-Norm进行简单的描述。
 \begin{itemize}
 \vspace{0.5em}
-\item Post-Norm：早期的Transformer遵循的是Post-Norm结构\upcite{vaswani2017attention}。也就是层正则化作用于每一子层的输入和输出的残差结果上，如图\ref{fig:15-1}(a)所示。可以表示如下：
-\begin{eqnarray}
-x_{l+1}=\textrm{LN}(x_l+\mathcal{F}(x_l;\theta_l))
+\item Post-Norm：早期的Transformer遵循的是Post-Norm结构\upcite{vaswani2017attention}。也就是层标准化作用于每一子层的输入和输出的残差结果上，如图\ref{fig:15-1}(a)所示。可以表示如下：
+\begin{equation}
+\mathbi{x}_{l+1}=\textrm{LN}(\mathbi{x}_l+F(\mathbi{x}_l;{\bm  \theta_l}))
 \label{eq:15-1}
-\end{eqnarray}
-其中，$\theta_l$是子层$l$的参数。
+\end{equation}
+
+\noindent 其中，$\bm \theta_l$是子层$l$的参数。
 \vspace{0.5em}
-\item Pre-Norm：通过调整层正则化的位置，将其放置于每一子层的输入之前，得到了Pre-Norm结构,如图\ref{eq:15-1}(b)所示。其思想与He等人的思想一致\upcite{DBLP:conf/eccv/HeZRS16}，也被广泛应用于最新的Transformer开源系统中\upcite{Vaswani2018Tensor2TensorFN,Ottfairseq,KleinOpenNMT}，公式如下：
-\begin{eqnarray}
-x_{l+1}=x_l+\mathcal{F}(\textrm{LN}(x_l);\theta_l)
+\item Pre-Norm：通过调整层标准化的位置，将其放置于每一子层的输入之前，得到了Pre-Norm结构,如图\ref{eq:15-1}(b)所示。其思想与He等人的思想一致\upcite{DBLP:conf/eccv/HeZRS16}，也被广泛应用于最新的Transformer开源系统中\upcite{Vaswani2018Tensor2TensorFN,Ottfairseq,KleinOpenNMT}，公式如下：
+\begin{equation}
+\mathbi{x}_{l+1}=\mathbi{x}_l+F(\textrm{LN}(\mathbi{x}_l);{\bm  \theta_l})
 \label{eq:15-2}
-\end{eqnarray}
+\end{equation}
+\vspace{0.5em}
 \end{itemize}

-\parinterval 从上述公式可以看到Pre-Norm结构可以通过残差路径将底层网络的输出直接暴露给上层网络；另一方面从反向传播的角度看，使用Pre-Norm结构，顶层的梯度可以更容易地反馈到底层网络。这里以一个含有$L$个子层的结构为例。令$Loss$表示整个神经网络输出上的损失，$x_L$为顶层的输出。对于Post-Norm结构，根据链式法则，损失$Loss$相对于$x_l$的梯度可以表示为：
-\begin{eqnarray}
-\frac{\partial Loss}{\partial x_l}=\frac{\partial Loss}{\partial x_L} \times \prod_{k=l}^{L-1}\frac{\partial \textrm{LN}(y_k)}{\partial y_k} \times \prod_{k=l}^{L-1}(1+\frac{\partial \mathcal{F}(x_k;\theta_k)}{\partial x_k})
+\parinterval 从上述公式可以看到Pre-Norm结构可以通过残差路径将底层网络的输出直接暴露给上层网络；另一方面从反向传播的角度看，使用Pre-Norm结构，顶层的梯度可以更容易地反馈到底层网络。这里以一个含有$N$个子层的结构为例。令$Loss$表示整个神经网络输出上的损失，$\mathbi{x}_N$为顶层的输出。对于Post-Norm结构，根据链式法则，损失$Loss$相对于$\mathbi{x}_l$的梯度可以表示为：
+\begin{equation}
+\frac{\partial Loss}{\partial \mathbi{x}_l}=\frac{\partial Loss}{\partial \mathbi{x}_N} \times \prod_{k=l}^{N-1}\frac{\partial \textrm{LN}(\mathbi{y}_k)}{\partial \mathbi{y}_k} \times \prod_{k=l}^{N-1}(1+\frac{\partial F(\mathbi{x}_k;{\bm \theta_k})}{\partial \mathbi{x}_k})
 \label{eq:15-3}
-\end{eqnarray}
-其中$\prod_{k=l}^{L-1}\frac{\partial \textrm{LN}(y_k)}{\partial y_k}$表示在反向传播过程中经过层正则化得到的复合函数导数，$\prod_{k=l}^{L-1}(1+\frac{\partial \mathcal{F}(x_k;\theta_k)}{\partial x_k})$代表每个子层间残差连接的导数。
+\end{equation}
+
+\parinterval 其中$\prod_{k=l}^{N-1}\frac{\partial \textrm{LN}(\mathbi{y}_k)}{\partial \mathbi{y}_k}$表示在反向传播过程中经过层标准化得到的复合函数导数，$\prod_{k=l}^{N-1}(1+\frac{\partial F(\mathbi{x}_k;{\bm \theta_k})}{\partial \mathbi{x}_k})$代表每个子层间残差连接的导数。

 \parinterval 类似的，也能得到Pre-Norm结构的梯度计算结果,如下式所示：
-\begin{eqnarray}
-\frac{\partial Loss}{\partial x_l}=\frac{\partial Loss}{\partial x_L} \times (1+\sum_{k=l}^{L-1}\frac{\partial \mathcal{F}(\textrm{LN}(x_k);\theta_k)}{\partial x_l})
+\begin{equation}
+\frac{\partial Loss}{\partial \mathbi{x}_l}=\frac{\partial Loss}{\partial \mathbi{x}_N} \times (1+\sum_{k=l}^{N-1}\frac{\partial F(\textrm{LN}(\mathbi{x}_k);{\bm \theta_k})}{\partial \mathbi{x}_l})
 \label{eq:15-4}
-\end{eqnarray}
+\end{equation}

-\parinterval 对比公式\eqref{eq:15-3}和公式\eqref{eq:15-4}可以明显发现Pre-Norm结构直接把顶层的梯度$\frac{\partial Loss}{\partial x_L}$传递给下层，也就是$\frac{\partial Loss}{\partial x_l}$中直接含有$\frac{\partial Loss}{\partial x_L}$的部分。这个性质弱化了梯度计算对模型深度$L$的依赖；而如公式\eqref{eq:15-3}右侧所示，Post-Norm结构会导致一个与$L$相关的多项导数的积，伴随着$L$的增大更容易发生梯度消失和梯度爆炸问题。因此，Pre-Norm结构更适于堆叠多层神经网络的情况。比如，使用Pre-Norm结构可以很轻松的训练一个30层（60个子层）的Transformer编码器网络，并带来可观的BLEU提升。这个结果相当于标准Transformer编码器深度的6倍\upcite{Wang2019LearningDT}。相对的，用Pre-Norm结构训练深网络的时候，训练结果很不稳定，甚至有时候无法完成有效训练。这里把使用Pre-Norm的深层Transformer称为Transformer-Deep。
+\parinterval 对比公式\eqref{eq:15-3}和公式\eqref{eq:15-4}可以明显发现Pre-Norm结构直接把顶层的梯度$\frac{\partial Loss}{\partial \mathbi{x}_N}$传递给下层，也就是$\frac{\partial Loss}{\partial \mathbi{x}_l}$中直接含有$\frac{\partial Loss}{\partial \mathbi{x}_N}$的部分。这个性质弱化了梯度计算对模型深度$N$的依赖；而如公式\eqref{eq:15-3}右侧所示，Post-Norm结构会导致一个与$N$相关的多项导数的积，伴随着$N$的增大更容易发生梯度消失和梯度爆炸问题\footnote[2]{类似地，在循环神经网络中当序列过长时，网络同样容易发生梯度消失和梯度爆炸问题。}。因此，Pre-Norm结构更适于堆叠多层神经网络的情况。比如，使用Pre-Norm结构可以很轻松地训练一个30层（60个子层）编码器的Transformer网络，并带来可观的BLEU提升。这个结果相当于标准Transformer编码器深度的6倍\upcite{WangLearning}。相对的，用Pre-Norm结构训练深层网络的时候，训练结果很不稳定，当编码器深度超过12层后很难完成有效训练\upcite{WangLearning}，尤其是在低精度设备环境下损失函数出现发散情况。这里把使用Pre-Norm的深层Transformer称为Transformer-Deep。

-\parinterval 另一个有趣的发现是，使用深层网络后,训练模型收敛的时间大大缩短。相比于Transformer-Big等宽网络，Transformer-Deep并不需要太大的隐藏层大小就可以取得相当甚至更优的翻译品质。也就是说，Transformer-Deep是一个更“窄”更“深”的网络。这种结构的参数量比Transformer-Big少，系统运行效率更高。表\ref{tab:15-1}对比了不同模型的参数量和训练/推断时间。
+\parinterval 另一个有趣的发现是，使用深层网络后，网络可以更有效地利用较大的学习率和batch size训练，大幅度缩短了模型达到收敛的时间。相比于Transformer-Big等宽网络，Transformer-Deep并不需要太大的隐藏层维度就可以取得相当甚至更优的翻译品质\upcite{WangLearning}。也就是说，Transformer-Deep是一个更“窄”更“深”的网络。这种结构的参数量比Transformer-Big少，系统运行效率更高。

-%----------------------------------------------
-\begin{table}[htp]
-\centering
-\caption{不同Transformer结构的训练/推断时间对比（WMT14英德任务）}
-\begin{tabular}{l | r r r}
-\rule{0pt}{15pt}     系统 & 参数量 & 训练时间 & 推断时间  \\
-\hline
-\rule{0pt}{15pt}     Base & 63M & 4.6h & 19.4s  \\
-\rule{0pt}{15pt}     Big & 210M & 36.1h & 29.3s  \\
-\rule{0pt}{15pt}     DLCL-30 & 137M & 9.8h & 24.6s  \\
-\end{tabular}
-\label{tab:15-1}
-\end{table}
-%----------------------------------------------
-
-\parinterval 还有一个有趣的发现是，当编码器端使用深层网络之后，解码器端使用更浅的网络依然能够维持相当的翻译品质。这是由于解码器端的计算仍然会有对源语言端信息的加工和抽象，当编码器变深之后，解码器对源语言端的加工不那么重要了，因此可以减少解码网络的深度。这样做的一个直接好处是：可以通过减少解码器的深度加速翻译系统。对于一些延时敏感的场景，这种架构是极具潜力的。
+\parinterval 此外研究人员发现当编码器端使用深层网络之后，解码器端使用更浅的网络依然能够维持相当的翻译品质。这是由于解码器端的计算仍然会有对源语言端信息的加工和抽象，当编码器变深之后，解码器对源语言端的加工不那么重要了，因此可以减少解码网络的深度。这样做的一个直接好处是：可以通过减少解码器的深度加速翻译系统。对于一些延时敏感的场景，这种架构是极具潜力的\upcite{DBLP:journals/corr/abs-2006-10369}。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
 %----------------------------------------------------------------------------------------

-\subsection{层聚合}
+\subsection{高效信息传递}
+
+\parinterval 尽管使用Pre-Norm结构可以很容易地训练深层Transformer模型，但从信息传递的角度看，Transformer模型中第$l$层的输入仅仅依赖于前一层的输出。虽然残差连接可以跨层传递信息，但是对于很深的网络，整个模型的输入和输出之间仍需要很多次残差连接才能进行有效的传递。为了使上层的网络可以更加方便地访问下层网络的信息，一种方法是直接引入更多的跨层连接。最简单的一种方法是直接将所有层的输出都连接到最上层，达到聚合多层信息的目的\upcite{Bapna2018TrainingDN,Wang2018MultilayerRF,Dou2018ExploitingDR}。
+
+\parinterval 另一种更加有效的方式是在网络前向计算的过程中建立当前层与之前层表示之间的关系，例如动态线性聚合网络\upcite{WangLearning}和{\small\bfnew{动态层聚合方法}}\upcite{Dou2019DynamicLA}\index{动态层聚合方法}（Dynamic Linear Combination of Layers，DLCL）\index{Dynamic Linear Combination of Layers}。两者共性在于，在每一层的输入中不仅考虑前一层的输出，而是将前面所有层的中间结果（包括词嵌入表示）进行聚合，本质上利用稠密的层间连接提高了网络中信息传递的效率（前向计算和反向梯度计算）。而前者利用线性的层融合手段来保证计算的时效性，主要应用于深层网络任务的训练，理论上等价于常微分方程中的高阶求解方法\upcite{WangLearning}。此外，为了进一步增强上层网络对底层表示的利用，研究人员从多尺度的维度对深层的编码器网络进行分块，并使用GRU网络来捕获不同块之间的联系，得到更高层次的表示。该方法可以看作是对上述动态线性聚合网络的延伸。接下来分别对上述几种改进方法进行展开讨论。
+
+%----------------------------------------------------------------------------------------
+%    NEW SUBSUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsubsection{1. 使用更多的跨层连接}
+
+\parinterval 图15.2描述了引入了更多跨层连接的结构。在网络的前向计算过程中，假设编码端总层数为$N$，当完成编码端$N$层的逐层计算后，通过线性平均、加权平均等机制对网络的中间层表示进行融合，得到蕴含所有层信息的表示\mathbi{g}，作为解码段编码-解码注意力机制的输入，与总共有$M$层的解码器共同处理解码信息。{转录时L换成N，15-4也是}
+
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\includegraphics[scale=0.4]{./Chapter15/Figures/figure-layer-fusion-method.png}
+\caption{层融合方法}
+\label{fig:15-2}
+\end{figure}
+%-------------------------------------------
+
+\parinterval 这里，令$\mathbi{h}_i$是第$i$层编码端的输出，$\mathbi{s}_j^k$是解码器解码第$j$个单词时第$k$层的输出。层融合机制可以大致划分为如下：

-\parinterval 尽管使用Pre-Norm结构可以很容易地训练深层Transformer模型，但从信息传递的角度看，Transformer模型中第$n$层的输入仅仅依赖于前一层的输出。虽然残差连接可以将信息跨层传递，但是对于很深的网络，整个模型的输入和输出之间仍需要很多次残差连接才能进行有效的传递。为了使上层的网络可以更加方便地访问下层网络的信息，一种方法是直接引入更多跨层的连接。最简单的一种方法是直接将所有层的输出都连接到最上层，达到聚合多层信息的目的\upcite{Bapna2018TrainingDN,Wang2018MultilayerRF}。另一种更加有效的方式是使用{\small\bfnew{动态线性层聚合方法}}\index{动态线性层聚合方法}（Dynamic Linear Combination of Layers，DLCL）\index{Dynamic Linear Combination of Layers，DLCL}。在每一层的输入中不仅考虑前一层的输出，而是将前面所有层的中间结果（包括词嵌入）进行线性聚合，理论上等价于常微分方程中的高阶求解方法\upcite{Wang2019LearningDT}。以Pre-Norm结构为例，具体做法如下：
 \begin{itemize}
 \vspace{0.5em}
-\item 对于每一层的输出$x_{l+1}$，对其进行层正则化，得到每一层的信息的表示
-\begin{eqnarray}
-z_{l}=\textrm{LN}(x_{l+1})
+\item 线性平均，即平均池化。通过对各层中间表示进行累加之后取平均值。
+\begin{equation}
+\mathbi{g}=\frac{1}{N}\sum_{l=1}^{N}{\mathbi{h}_l}
 \label{eq:15-5}
-\end{eqnarray}
-注意，$z_0$表示词嵌入层的输出，$z_l(l>0)$表示Transformer网络中最终的各层输出。
+\end{equation}
+
 \vspace{0.5em}
-\item 	定义一个维度为$(L+1)\times(L+1)$的权值矩阵$\vectorn{W}$，矩阵中每一行表示之前各层对当前层计算的贡献度，其中$L$是编码端（或解码端）的层数。令$\vectorn{W}_{l,i}$代表权值矩阵$\vectorn{W}$第$l$行第$i$列的权重，则层聚合的输出为$z_i$的线性加权和：
-\begin{eqnarray}
-g_l=\sum_{i=0}^{l}z_i\times \vectorn{W}_{l,i}
+\item 权重平均。在线性平均的基础上，赋予每一个中间层表示相应的权重。权重的值通常采用可学习的参数矩阵$\mathbi{W}$表示，通过反向传播来不断调整每一层的权重比例，通常会略优于线性平均方法。
+\begin{equation}
+\mathbi{g}=\sum_{l=1}^{N}{\mathbi{W}_l\mathbi{h}_l}
 \label{eq:15-6}
+\end{equation}
+
+\vspace{0.5em}
+\item 前馈神经网络。将之前中间层的表示进行级联，之后利用前馈神经网络得到融合的表示。
+\begin{equation}
+\mathbi{g}=\textrm{FNN}([\mathbi{h}_1,\mathbi{h}_2,\ldots,\mathbi{h}_N])
+\label{eq:15-7}
+\end{equation}
+
+\noindent 其中，$[\cdot]$表示级联操作。这种方式对比权重平均具有更复杂的线性运算，同时引入非线性变化增加网络的表示能力。
+\vspace{0.5em}
+\item 基于多跳的自注意力机制，其结构图如图\ref{fig:15-3}所示。其做法与前馈神经网络类似，首先将不同层的表示拼接成2维的句子级矩阵表示示\footnote[3]{对比1维向量有更强的表示能力\upcite{DBLP:journals/corr/LinFSYXZB17}}。之后利用类似于前馈神经网络的思想将维度为$\mathbb{R}^{d\times N}$的矩阵映射到维度为$\mathbb{R}^{d\times n_{hop}}$的矩阵表示。
+\begin{equation}
+\mathbi{o}=\sigma ([\mathbi{h}_1,\mathbi{h}_2,\ldots,\mathbi{h}_N]^{T} \cdot \mathbi{W}_1)\mathbi{W}_2
+\label{eq:15-8}
+\end{equation}
+
+\noindent 其中$\mathbi{W}_1 \in \mathbb{R}^{d\times d_a}$，$\mathbi{W}_2 \in \mathbb{R}^{d_a\times n_{hop}}$。之后使用Softmax函数计算不同层在同一维度空间的归一化概率$\mathbi{u}_l$：
+\begin{equation}
+\mathbi{u}_l=\frac{\textrm{exp}(\mathbi{o}_l)}{\sum_i^N{\textrm{exp}(\mathbi{o}_i)}}
+\label{eq:15-9}
+\end{equation}
+
+\noindent 最后通过向量积操作得到维度为$\mathbb{R}^{d\times n_{hop}}$的稠密表示$\mathbi{v}$。通过单层的前馈神经网络得到最终的融合表示 ：
+\begin{eqnarray}
+\mathbi{v}_n & = & [\mathbi{h}_1,\mathbi{h}_2,\ldots,\mathbi{h}_N]\cdot \mathbi{u}_l \\
+\mathbi{g} & = & \textrm{FNN}([\mathbi{v}_1,\mathbi{v}_2,\ldots,\mathbi{v}_N])
+\label{eq:15-11}
 \end{eqnarray}
-$g_l$会作为输入的一部分送入第$l+1$层。其网络的结构如图\ref{fig:15-2}所示
+\vspace{0.5em}
+\end{itemize}
+
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\includegraphics[scale=0.7]{./Chapter15/Figures/figure-layer-fusion-method-2d.png}
+\caption{Post-Norm Transformer vs Pre-Norm Transformer}
+\label{fig:15-3}
+\end{figure}
+%-------------------------------------------
+
+\parinterval 上述工作更多应用于浅层的Transformer网络，这种仅在编码端顶部使用融合机制的方法并没有在深层Transformer上得到有效地验证。主要原因是融合机制仅作用于编码端或解码端的顶部，对网络中间层的计算并没有显著提升。因此当网络深度较深时，信息在前向计算和反向更新过程中的传播效率仍然有待提高，但这种“静态”的融合方式也为深层Transformer研究奠定了基础。例如研究人员提出了透明注意力网络\upcite{Bapna2018TrainingDN}，即在权重平均的基础上，引入了$(N+1)\times (M+1)$的权重矩阵，其中$N$和$M$分别代表编码端与解码端的层数。其核心思想是让解码端中每一层的编码-解码注意力网络接收到的编码端融合表示中对应不同编码层的权重是独立的，而不是共享相同的融合表示，如图\ref{fig:15-4}所示。此外二维的权重矩阵对比一维矩阵具有更强的学习能力，实验表明在英德数据集上获得了与宽网络（Transformer-Big）相当的翻译性能。
+
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\includegraphics[scale=0.4]{./Chapter15/Figures/figure-transparent-attention-mechanism.png}
+\caption{透明注意力机制}
+\label{fig:15-4}
+\end{figure}
+%-------------------------------------------
+
+%----------------------------------------------------------------------------------------
+%    NEW SUBSUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsubsection{2. 动态层融合}
+
+\parinterval 那如何进一步提高信息的传递效率？对比层融合方法，本节介绍的动态层融合可以更充分利用之前层的信息，其网络连接更加稠密，表示能力更强\upcite{Bapna2018TrainingDN,WangLearning,Wei2020MultiscaleCD,DBLP:conf/acl/WuWXTGQLL19,li2020shallow,DBLP:journals/corr/abs-2007-06257}。以基于Pre-Norm结构的DLCL网络中编码器为例，具体做法如下：
+
+\begin{itemize}
+\vspace{0.5em}
+\item 对于每一层的输出$\mathbi{x}_{l}$，对其进行层标准化，得到每一层的信息的表示
+\begin{equation}
+\mathbi{h}_l=\textrm{LN}(\mathbi{x}_{l})
+\label{eq:15-12}
+\end{equation}
+注意，$\mathbi{h}_0$表示词嵌入层的输出，$\mathbi{h}_l(l>0)$表示Transformer网络中最终的各层输出。
+\vspace{0.5em}
+\item 	定义一个维度为$(N+1)\times (M+1)$的权值矩阵$\mathbi{W}$，矩阵中每一行表示之前各层对当前层计算的贡献度，其中$L$是编码端（或解码端）的层数。令$\mathbi{W}_{l,i}$代表权值矩阵$\mathbi{W}$第$l$行第$i$列的权重，则层聚合的输出为$\mathbi{h}_i$的线性加权和：
+\begin{equation}
+\mathbi{g}_l=\sum_{i=0}^{l}\mathbi{h}_i\times \mathbi{W}_{l,i}
+\label{eq:15-13}
+\end{equation}
+$\mathbi{g}_l$会作为输入的一部分送入第$l+1$层。其网络的结构如图\ref{fig:15-2}所示
+\vspace{0.5em}
 \end{itemize}

 %---------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter15/Figures/figure-dynamic-linear-aggregation-network-structure}
-\caption{动态线性层聚合网络结构图}
-\label{fig:15-2}
+\caption{线性层聚合网络}
+\label{fig:15-5}
+\end{figure}
+%-------------------------------------------
+
+\parinterval 可以看到，权值矩阵$\mathbi{W}$是一个下三角矩阵。开始时，对矩阵参数的每行进行平均初始化，即初始化矩阵$\mathbi{W}_0$的每一行各个位置的值为$1/\lambda , \lambda \in (1,2,3 \cdots l+1)$。伴随着神经网络的训练，网络通过反向传播算法来不断更新$\mathbi{W}$中每一行不同位置权重的大小。
+
+\parinterval 动态线性层聚合的一个好处是，系统可以自动学习不同层对当前层的贡献度。在实验中也发现，离当前层更近的部分贡献度（权重）会更大，如图\ref{fig:15-6}所示，在每一行中颜色越深代表对当前层的贡献度越大，这也是符合直觉的。
+
+\parinterval 对比上述介绍的动态层线性聚合方法，研究人员利用更为复杂的胶囊网络\upcite{Dou2019DynamicLA}，树状层次结构\upcite{Dou2018ExploitingDR}作为层间的融合方式，动态地计算每一层网络的输入。然而，研究人员发现进一步增加模型编码端的深度并不能取得更优的翻译性能。因此如何进一步突破神经网络深度的限制是值得关注的研究方向，类似的话题在图像处理领域也激起了广泛的讨论\upcite{DBLP:conf/nips/SrivastavaGS15,DBLP:conf/icml/BalduzziFLLMM17,DBLP:conf/icml/Allen-ZhuLS19,DBLP:conf/icml/DuLL0Z19}。
+
+%---------------------------------------------
+\begin{figure}[htp]
+\centering
+\includegraphics[scale=0.4]{./Chapter15/Figures/figure-weight-visualization-of-convergence-DLCL-network.png}
+\caption{对收敛的DLCL网络进行权重的可视化\upcite{WangLearning}}
+\label{fig:15-6}
 \end{figure}
 %-------------------------------------------

-\parinterval 可以看到，权值矩阵$\vectorn{W}$是一个下三角矩阵。开始时，对矩阵参数的每行进行平均初始化，即初始化矩阵$\vectorn{W}_0$的每一行各个位置的值为$1/M,M \in (1,2,3 \cdots L+1)$。 伴随着神经网络的训练，网络通过反向传播算法来不断更新$\vectorn{W}$中每一行不同位置权重的大小。
+%----------------------------------------------------------------------------------------
+%    NEW SUBSUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsubsection{3. 多尺度协同网络}
+
+\parinterval 沿着动态线性层聚合思路（DLCL），研究者进一步提出了{\small\bfnew{多尺度协同框架}}\index{多尺度协同框架}（Multiscale Collaborative Framework）\index{Multiscale Collaborative Framework}进一步强化深层网络的表示能力该框架大致可以概括为两方面：

-\parinterval 动态线性层聚合的一个好处是，系统可以自动学习不同层对当前层的贡献度。在实验中也发现，离当前层更近的部分贡献度（权重）会更大，这也是符合直觉的。
+\begin{itemize}
+\vspace{0.5em}
+\item 采用组间协同机制。将连续的多个编码层或解码层组成块，保证编码端和解码端具有相同的块数\footnote[4]{例如，一个36层的深层网络包含6个编码块和6个解码块，其中每连续的6层编码层网络组成一个编码块，每一层解码层都是一个单独的解码块。}其核心思想是解码端能够利用对应的编码块输出作为注意力网络的Key和Value进行跨语言间映射，进而达到对编码端不同层次信息的利用。同时，在一定程度上缩短了信息传递的路径，让编码端底层网络的参数接收更丰富的梯度信息，进而达到更优的收敛状态。
+\vspace{0.5em}
+\item 上下文协同机制。单独使用组间协同机制来缩短信息传递的路径并不能带来显著地性能提升，尤其是当网络变得极深时，组间的堆叠无法有效地建模层间的长期依赖。为了改善这个问题，可以采用GRU网络来提取不同层之间的上下文信息，等价于在不同时间步根据对应编码块的输出来更新GRU的隐层状态。提取到的上下文信息，即{\red $g$（再确认一下上标）}也被作为输入分别送入编码端与解码端，利用注意力网络进行特征融合。之后采用门控机制的思想将得到的上下文表示与对应的编码/解码的自注意力输出融合。
+\vspace{0.5em}
+\end{itemize}

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
 %----------------------------------------------------------------------------------------

-\subsection{深层模型的训练加速}
+\subsection{面向深层模型的参数初始化策略}
+
+\parinterval 网络的学习不仅仅依赖于精心设计的网络结构，合适的初始化策略同样十分重要。例如Transformer中参数矩阵采用了Xavier初始化方式根据其输入维度和输出维度来控制均匀分布的边界\upcite{pmlr-v9-glorot10a}。该初始化可以保证各层的激活函数的输出和状态梯度在传播过程中方差的一致性，即同时考虑正向传播和反向传播的输入输出的方差相同；同样针对不同的激活函数应该采取合适的初始化方式来更好的发挥复杂网络的优势。例如更适用于ReLu激活函数的Kaiming初始化方式\upcite{DBLP:conf/iccv/HeZRS15}，其核心思想同样是保证方差一致性。
+
+\parinterval 目前的Transformer模型初始化方式已经是精心设计过的，但该类初始化更适合于浅层网络，在深层Transformer网络训练时表现不佳\upcite{pmlr-v9-glorot10a}。近期，研究人员针对深层网络的参数初始化问题进行了广泛的探索。下面分别对比分析几个最近提出的初始化策略。
+
+%----------------------------------------------------------------------------------------
+%    NEW SUBSUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsubsection{1. 基于深度缩放的初始化策略}
+
+\parinterval 为什么Transformer现有初始化策略无法满足训练深层网络？这是由于伴随着网络层数的加深，输入的特征要经过很多的线性及非线性变换，受网络中激活函数导数值域范围和连乘操作的影响，常常会带来梯度爆炸或梯度消失的问题。其根本理由是过深堆叠网络无法保证梯度在回传过程中的方差一致性，尽管在目前深层模型中所采用的很多标准化方式如层标准化、批次标准化等都是从方差一致性的维度来解决此问题，即通过将网络各层输出的取值范围控制在激活函数的梯度敏感区域，从而维持网络中梯度传递的稳定性。
+
+\parinterval 首先说明浅层Transformer模型中是如何维持梯度的方差一致性。Transformer中在初始化线性变换矩阵$\mathbi{W}$的参数时采用了Xavier初始化方法保障网络在不断优化过程中方差的问题，其方式为从一个均匀分布中进行随机采样：
+\begin{equation}
+\mathbi{W} \in \mathbb{R}^{d_i\times d_o} \sim u(-\gamma,\gamma),\gamma=\sqrt{\frac{6}{d_i\times d_o}}
+\label{eq:15-16}
+\end{equation}
+
+\noindent 其中$\mathbi{W}$为网络中的参数，$d_i$和$d_o$分别为线性变换中输入和输出的纬度，通过这种方式可以维持在前向与反向过程中输入与输出方差的一致性\upcite{DBLP:conf/iccv/HeZRS15}。这是由于在矩阵运算的中，神经元输出$\mathbi{Z}=\sum_{i=1}^n{\mathbi{w}_i \mathbi{x}_i}$，$n$是上一层神经元的数量。因此，根据概率统计里的两个随机变量乘积的方差展开式为：
+\begin{equation}
+\textrm{Var}(\mathbi{w}_i \mathbi{x}_i) = E[\mathbi{w}_i]^2 \textrm{Var}(\mathbi{x}_i) + E[\mathbi{x}_i]^2 \textrm{Var}(\mathbi{w}_i) + \textrm{Var}(\mathbi{w}_i)\textrm{Var}(\mathbi{x}_i)
+\label{eq:15-17}
+\end{equation}
+
+\parinterval 在大多数情况下，基于各种标准化手段可以维持$E[\mathbi{w}_i]^2$和$E[\mathbi{x}_i]^2$等于或者近似为0，因此输出的方差：
+\begin{equation}
+\textrm{Var}(\mathbi{Z}) = \sum_{i=1}^n{\textrm{Var}(\mathbi{x}_i) \textrm{Var}(\mathbi{w}_i)} = n\textrm{Var}(\mathbi{W})\textrm{Var}(\mathbi{X})
+\label{eq:15-18}
+\end{equation}
+
+\parinterval 因此当$\textrm{Var}(\mathbi{W})=\frac{1}{n}$时，则可以保证输入和输出空间的分布差异不至于过大。通过计算得到网络正向传播时$\textrm{Var}(\mathbi{W})=\frac{1}{d_i}$，反向传播时$\textrm{Var}(\mathbi{W})=\frac{1}{d_o}$，通过对其取平均值，控制网络参数$\mathbi{W}$的方差为$\frac{2}{d_i+d_o}$，则可以维持在前向与反向过程中输入与输出方差的一致性。由于当参数服从边界为$[a,b]$的均匀分布，其方差为$\frac{{b-a}^2}{12}$，为了达到目标方差值域，在初始化时将边界设为$\sqrt{\frac{6}{d_i+d_o}}$。
+
+\parinterval 但是随着网络层数的加深，研究人员发现简单的通过伤初始化得到的参数状态对基于Post-Norm的 Transformer各层输出方差的约束逐渐减弱。当网络堆叠至较深时会发现，模型顶层输出的方差较大，同时反向传播的梯度范数顶层也要大于底层。因此，一个自然的想法是根据网络的深度对不同层的参数矩阵采取不同的初始化方式，进而强化对各层输出方差的约束：
+\begin{equation}
+\mathbi{W} \in \mathbb{R}^{d_i\times d_o} \sim u(-\gamma \frac{\alpha}{\sqrt{l}},\gamma \frac{\alpha}{\sqrt{l}})
+\label{eq:15-19}
+\end{equation}
+
+\noindent 其中，$l$为对应的网络层的深度，$\alpha$为预先设定的超参数来控制缩放的比例，通过这种方式降低网络层输出方差，可以将缩减顶层网络输出分布与输入分布之间的差异，减少顶层网络参数的梯度范数。从而缓解由于网络层堆叠过深所带来的梯度消失问题，保证深层网络能够稳定的训练。
+
+%----------------------------------------------------------------------------------------
+%    NEW SUBSUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsubsection{2. 基于Lipschitz的初始化策略}
+
+\parinterval 在前面已经介绍过，在Pre-Norm结构中每一个子层的输出$\mathbi{x}_{l+1}^{pre}=\mathbi{x}_l+\mathbi{y}_l$，其中$\mathbi{x}_l$为当前子层的输入， $\mathbi{y}_l$为经过自注意力层或前馈神经网络层计算后得到的子层输出。在Post-Norm结构中，在残差连接之后还要进行层标准化操作，具体的计算流程为：
+
+\begin{itemize}
+\vspace{0.5em}
+\item 计算输入的均值：${\bm  \mu}=\textrm{mean}⁡(\mathbi{x}_l+\mathbi{y}_l)$
+\vspace{0.5em}
+\item 计算输入的方差：${\bm  \sigma}=\textrm{std}⁡(\mathbi{x}_l+\mathbi{y}_l)$
+\vspace{0.5em}
+\item 根据均值和方差对输入进行放缩，其中$\mathbi{w}$和$\mathbi{b}$为可学习参数，用于进一步调整均值和方差到合适的位置，提高网络的表示能力：
+\begin{equation}
+\mathbi{x}_{l+1}^{post}=\frac{\mathbi{x}_l+\mathbi{y}_l-{\bm  \mu}}{\bm  \sigma} \cdot \mathbi{w}+\mathbi{b}
+\label{eq:15-20}
+\end{equation}
+\noindent 将其展开后可得：
+\begin{equation}
+\mathbi{x}_{l+1}^{post}=\frac{\mathbi{w}}{\bm  \sigma} \cdot \mathbi{x}_{l+1}^{pre}-\frac{\mathbi{w}}{\bm  \sigma} \cdot {\bm  \mu}+\mathbi{b}
+\label{eq:15-21}
+\end{equation}
+\vspace{0.5em}
+\end{itemize}
+
+\parinterval 可以看到相比于Pre-Norm的计算方式，基于Post-Norm的Transformer中子层的输出为Pre-Norm形式的$\frac{\mathbi{w}}{\bm  \sigma}$倍，当$\frac{\mathbi{w}}{\bm  \sigma}<1.0$时，使残差层的输出较小，输入与输出分布之间差异过大，导致深层Transformer系统难以收敛。因此基于Lipschitz 的初始化策略通过维持条件$\frac{\mathbi{w}}{\bm  \sigma}>1.0$，保证网络输入与输出范数一致，进而缓解梯度消失的问题\upcite{DBLP:conf/acl/XuLGXZ20}。一般情况下，$\mathbi{w}$可以被初始化为1，因此Lipschitz Initialization最终的约束条件则为：
+\begin{equation}
+0.0<{\bm  \sigma}=\textrm{std}⁡(\mathbi{x}_l+\mathbi{y}_l) \leq 1.0
+\label{eq:15-22}
+\end{equation}
+
+\parinterval 为了实现这个目标，可以限制$a \leq \mathbi{x}_l+\mathbi{y}_l \leq a + \Delta a$，在经过推导后可以发现，只要$\Delta a \leq 1.0$即可满足此条件。
+
+%----------------------------------------------------------------------------------------
+%    NEW SUBSUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsubsection{3. T-Fixup初始化策略}
+
+\parinterval 另外一种初始化方法是是从网络的结构与优化器的计算方式入手，通过可视化训练过程中参数的范数发现Post-Norm网络在训练前期（Warmup阶段）难以很精确的估计参数的二阶动量，导致训练不稳定问题\upcite{huang2020improving}。其结论也证实了其他研究人员的发现\upcite{WangLearning}，即层标准化是导致深层Transformer难以优化的主要矛盾。并强调了Post-Norm方式下Transformer的底层网络，尤其是编码段的词嵌入层严重面临梯度消失问题。导致矛盾的原因在于在不改变层标准化位置的条件下，由于Adam优化器利用滑动平均的方式来估计参数的二阶矩，其方差是无界的。这样在前期模型只能看到受限样本的前提下，其二阶矩很难进行有效的估计。因此反向更新参数时会引起参数的梯度方差过大问题。相较于用Pre-Norm代替Post-Norm结构来训练深层网络，也可以通过去除warmup策略并移除层标准化机制，并针对网络中不同的参数矩阵制定了相应的缩放机制来保证训练的稳定\upcite{huang2020improving}。具体的缩放策略如下：
+
+\begin{itemize}
+\vspace{0.5em}
+\item 类似标准的Transformer初始化方式，使用Xavier初始化方式来初始化除了词嵌入以外的所有参数矩阵。词嵌入矩阵服从$\mathbb{N}(0,d^{-\frac{1}{2}})${\red （再确认N的形式，下面的也要确认）}的高斯分布，其中$d$代表词嵌入的维度。
+\vspace{0.5em}
+\item 对编码端中自注意力网络中$\mathbi{W}^V$和$\mathbi{W}^o$矩阵以及前馈神经网络中所有参数矩阵进行缩放因子为$0.67\mathbb{N}^{-\frac{1}{4}}$的缩放，其中$\mathbi{W}^o$是注意力操作中的参数矩阵{\red 看下怎么说好（详见12章transformer部分）} 。
+\vspace{0.5em}
+\item 对解码端中注意力网络中的$\mathbi{W}^V$和$\mathbi{W}^o$以及前馈神经网络中所有参数矩阵进行缩放因子为$(9\mathbb{N})^{-\frac{1}{4}}$的缩放
+\vspace{0.5em}
+\end{itemize}
+
+\parinterval 这种初始化方法由于没有Warmup策略，学习率会直接从峰值根据参数的更新次数进行退火，大幅度增大了网络收敛的时间。其主要贡献是在不考虑训练代价的前提下，合理的初始化方法可以让浅层的Transformer网络达到更优的翻译性能。如何进一步解决该初始化方法下的模型收敛速度是比较关键的课题
+
+%----------------------------------------------------------------------------------------
+%    NEW SUBSUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsubsection{4. ADMIN初始化策略}
+
+\parinterval 相比于前人得出深层网络不能充分训练的痛点在于底层网络出现参数消失问题的结论，一些研究人员持不同的看法。他们认为Post-Norm网络在训练过程中过度的依赖残差支路，在训练初期很容易发生参数梯度方差过大的隐患\upcite{DBLP:conf/emnlp/LiuLGCH20}。同时分别从理论和经验性实验验证了底层网络的梯度消失是导致了训练的不稳定的重要因素，但并不是唯一因素。通过对基于Post-Norm的编码器和解码器与基于Pre-Norm的编码器与解码器进行两两组合，作者发现梯度消失的主要故障点在于基于Post-Norm的解码器。而且尽管通过调整网络结构解决梯度消失问题，网络的不稳定训练仍然没有很好地解决。进一步对基于Pre-Norm网络的输入与输出进行分析，研究人员发现其输入与输出之间方差的变换率为$O(\log N)${\red 跟他们确认一下有没有特殊用法，还有N是什么}。为了解决Post-Norm网络结构在训练初期过于依赖残差支路，作者提出了两阶段的初始化方法来间接控制其输入与输出之间的方差在$O(\log N)$内。其子层之间的残差连接如公式\eqref{eq:15-23}所示：
+\begin{equation}
+\mathbi{x}_{l+1}=\mathbi{x}_l \cdot {\bm  \omega_{l+1}} + F_{l+1}(\mathbi{x}_l)
+\label{eq:15-23}
+\end{equation}

-\parinterval 尽管训练这种窄而深的神经网络对比宽网络有更快的收敛速度，但伴随着训练数据的增加，以及模型进一步的加深，神经网络的训练代价成为不可忽视的问题。例如，在几千万甚至上亿的双语平行语料上训练一个48层的Transformer模型需要将近几周的时间能达到收敛\footnote[2]{训练时间的估算是在单台8卡Titan V GPU服务器上得到的}。因此，在保证模型精度不变的前提下如何高效地完成深层网络的训练也是至关重要的。在实践中能够发现，深层网络中相邻层之间具有一定的相似性。因此，一个想法是：能否通过不断复用浅层网络的参数来初始化更深层的网络，渐进式的训练深层网络，避免从头训练整个网络，进而达到加速深层网络训练的目的。
+\noindent 其两阶段的初始化方法如下所示：
+
+\begin{itemize}
+\vspace{0.5em}
+\item Profiling阶段：${\bm  \omega_{l+1}} = 1$，只进行网络的前向计算，无需进行网络的梯度计算。在训练样本上计算$F_{l+1}(\mathbi{x}_l)$的方差
+\vspace{0.5em}
+\item 	Initialization阶段：通过Profiling阶段得到的$F_{l+1}(\mathbi{x}_l)$的方差来初始化$\bm  \omega_{l+1}$：
+\begin{equation}
+{\bm  \omega_{l+1}} = \sqrt{\sum_{j<l}\textrm{Var}[F_{l+1}(\mathbi{x}_l)]}
+\label{eq:15-24}
+\end{equation}
+\vspace{0.5em}
+\end{itemize}
+
+\parinterval 通过这种方式，研究人员成功训练了深层网络，同时该动态地参数初始化方法不受限于具体的模型结构，方法的稳定性更优。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
 %----------------------------------------------------------------------------------------

-\subsection{渐进式训练}
+\subsection{深层网络的训练加速}
+
+\parinterval 尽管训练这种窄而深的神经网络对比宽网络有更快的收敛速度\upcite{WangLearning}，但伴随着训练数据的增加，以及模型进一步的加深，神经网络的训练代价成为不可忽视的问题。例如，在几千万甚至上亿的双语平行语料上训练一个48层的Transformer模型需要将近几周的时间能达到收敛\footnote[5]{训练时间的估算是在单台8卡Titan V GPU服务器上得到的}。因此，在保证模型精度不变的前提下如何高效地完成深层网络的训练也是至关重要的。在实践中能够发现，深层网络中相邻层之间具有一定的相似性。因此，一个想法是：能否通过不断复用浅层网络的参数来初始化更深层的网络，渐进式的训练深层网络，避免从头训练整个网络，进而达到加速深层网络训练的目的。
+
+%----------------------------------------------------------------------------------------
+%    NEW SUBSUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsubsection{1. 渐进式训练}

 \parinterval 所谓渐进式训练是指从浅层网络开始，在训练过程中逐渐增加训练的深度。一种比较简单的方法是将网络分为浅层部分和深层部分，之后分别进行训练，最终达到提高模型翻译性能的目的\upcite{DBLP:conf/acl/WuWXTGQLL19}。

-\parinterval 另一种方式是动态构建深层网络，并尽可能复用浅层网络的训练结果。假设开始的时候模型包含$h$层网络，然后训练这个模型至收敛。之后，直接拷贝这$h$层网络（包括参数），并堆叠出一个$2h$层的模型。之后继续训练，重复这个过程。进行$n$次之后就得到了$n\times h$层的模型。图\ref{fig:15-3}给出了在编码端使用渐进式训练的示意图。
+\parinterval 另一种方式是动态构建深层网络，并尽可能复用浅层网络的训练结果。假设开始的时候模型包含$l$层网络，然后训练这个模型至收敛。之后，直接拷贝这$l$层网络（包括参数），并堆叠出一个$2l$层的模型。之后继续训练，重复这个过程。进行$n$次之后就得到了$n\times l$层的模型。图\ref{fig:15-9}给出了在编码端使用渐进式训练的示意图。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter15/Figures/figure-progressive-training}
 \caption{渐进式深层网络训练过程}
-\label{fig:15-3}
+\label{fig:15-9}
 \end{figure}
 %-------------------------------------------

 \parinterval 渐进式训练的好处在于深层模型并不是从头开始训练。每一次堆叠，都相当于利用“浅”模型给“深”模型提供了一个很好的初始点，这样深层模型的训练会更加容易。

 %----------------------------------------------------------------------------------------
-%    NEW SUB-SECTION
+%    NEW SUBSUB-SECTION
 %----------------------------------------------------------------------------------------

-\subsection{分组稠密连接}
+\subsubsection{2. 分组稠密连接}

-\parinterval 很多研究者已经发现深层网络不同层之间的稠密连接能够很明显地提高信息传递的效率\upcite{Wang2019LearningDT,DBLP:conf/cvpr/HuangLMW17,Dou2018ExploitingDR,DBLP:conf/acl/WuWXTGQLL19}。与此同时，对之前层信息的不断复用有助于得到更好的表示，但随之而来的是网络计算代价过大的问题。由于动态线性层聚合方法（DLCL）在每一次聚合时都需要重新计算之前每一层表示对当前层网络输入的贡献度，因此伴随着编码端整体深度的不断增加，这部分的计算代价变得不可忽略。例如，一个基于动态层聚合的48层Transformer模型的训练时间比不使用动态层聚合慢近1.9倍。同时，缓存中间结果也增加了显存的使用量，尽管使用了FP16计算，每张12G显存的GPU上计算的词也不能超过2048个，这导致训练开销急剧增大。
+\parinterval 很多研究者已经发现深层网络不同层之间的稠密连接能够很明显地提高信息传递的效率\upcite{WangLearning,DBLP:conf/cvpr/HuangLMW17,Dou2018ExploitingDR,DBLP:conf/acl/WuWXTGQLL19}。与此同时，对之前层信息的不断复用有助于得到更好的表示，但随之而来的是网络计算代价过大的问题。由于动态线性层聚合方法（DLCL）在每一次聚合时都需要重新计算之前每一层表示对当前层网络输入的贡献度，因此伴随着编码端整体深度的不断增加，这部分的计算代价变得不可忽略。例如，一个基于动态层聚合的48层Transformer模型的训练时间比不使用动态层聚合慢近1.9倍。同时，缓存中间结果也增加了显存的使用量，尽管使用了FP16计算，每张12G显存的GPU上计算的词也不能超过2048个，这导致训练开销急剧增大。

-\parinterval 缓解这个问题的一种方法是使用更稀疏的层间连接方式。其核心思想与动态线性层聚合是类似的，不同点在于可以通过调整层之间连接的稠密程度来降低训练代价。比如，可以将每$p$层分为一组，之后动态线性层聚合只在不同组之间进行。这样，通过调节$p$值的大小可以控制网络中连接的稠密程度，作为一种训练代价与翻译性能之间的权衡。显然，标准的Transformer模型\upcite{vaswani2017attention}和DLCL模型\upcite{Wang2019LearningDT}都可以看作是该方法的一种特例。如图\ref{fig:15-4}所示：当$p=1$时，每一个单独的块被看作一个独立的组，这等价于基于动态层聚合的DLCL模型；当$p=\infty$时，这等价于正常的Transformer模型。值得注意的是，如果配合渐进式训练。在分组稠密连接中可以设置$p=h$。
+\parinterval 缓解这个问题的一种方法是使用更稀疏的层间连接方式。其核心思想与动态线性层聚合是类似的，不同点在于可以通过调整层之间连接的稠密程度来降低训练代价。比如，可以将每$p$层分为一组，之后动态线性层聚合只在不同组之间进行。这样，通过调节$p$值的大小可以控制网络中连接的稠密程度，作为一种训练代价与翻译性能之间的权衡。显然，标准的Transformer模型\upcite{vaswani2017attention}和DLCL模型\upcite{WangLearning}都可以看作是该方法的一种特例。如图\ref{fig:15-10}所示：当$p=1$时，每一个单独的块被看作一个独立的组，这等价于基于动态层聚合的DLCL模型；当$p=\infty$时，这等价于正常的Transformer模型。值得注意的是，如果配合渐进式训练。在分组稠密连接中可以设置$p=h$。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter15/Figures/figure-sparse-connections-between-different-groups}
 \caption{不同组之间的稀疏连接}
-\label{fig:15-4}
+\label{fig:15-10}
 \end{figure}
 %-------------------------------------------

 %----------------------------------------------------------------------------------------
-%    NEW SUB-SECTION
+%    NEW SUBSUB-SECTION
 %----------------------------------------------------------------------------------------

-\subsection{学习率重置策略}
+\subsubsection{3. 学习率重置}

 \parinterval 尽管渐进式训练策略与分组稠密连接结构可以加速深层网络的训练，但使用传统的学习率衰减策略会导致堆叠深层模型时的学习率较小，因此模型无法快速地达到收敛状态，同时也影响最终的模型性能。

@@ -201,7 +450,7 @@ $g_l$会作为输入的一部分送入第$l+1$层。其网络的结构如图\ref
 \centering
 \input{./Chapter15/Figures/figure-learning-rate}
 \caption{学习率重置vs从头训练的学习率曲线}
-\label{fig:15-5}
+\label{fig:15-11}
 \end{figure}
 %-------------------------------------------

@@ -209,18 +458,19 @@ $g_l$会作为输入的一部分送入第$l+1$层。其网络的结构如图\ref
 \begin{itemize}
 \vspace{0.5em}
 \item 在训练的初期，模型先经历一个学习率预热的过程：
-\begin{eqnarray}
+\begin{equation}
 lr=d_{model}^{-0.5}\cdot step\_num \cdot warmup\_steps^{-0.5}
-\label{eq:15-7}
-\end{eqnarray}
-这里，$step\_num$表示参数更新的次数，$warmup\_step$表示预热的更次次数，$d_{model}^{-0.5}$表示Transformer模型隐层大小，$lr$是学习率。
+\label{eq:15-25}
+\end{equation}
+\noindent 这里，$step\_num$表示参数更新的次数，$warmup\_step$表示预热的更次次数，$d_{model}^{-0.5}$表示Transformer模型隐层大小，$lr$是学习率。
 \vspace{0.5em}
 \item 	在之后的迭代训练过程中，每当进行新的迭代，学习率都会重置到峰值，之后进行相应的衰减：
-\begin{eqnarray}
+\begin{equation}
 lr=d_{model}^{-0.5}\cdot step\_num^{-0.5}
-\label{eq:15-8}
-\end{eqnarray}
-这里$step\_num$代表学习率重置后更新的步数。
+\label{eq:15-26}
+\end{equation}
+\noindent 这里$step\_num$代表学习率重置后更新的步数。
+\vspace{0.5em}
 \end{itemize}

 \parinterval 综合使用渐进式训练、分组稠密连接、学习率重置策略可以在保证翻译品质不变的前提下，缩减近40\%的训练时间（40层编码器）。同时，加速比伴随着模型的加深与数据集的增大会进一步地扩大。
@@ -229,65 +479,342 @@ lr=d_{model}^{-0.5}\cdot step\_num^{-0.5}
 %    NEW SUB-SECTION
 %----------------------------------------------------------------------------------------

-\subsection{深层模型的鲁棒性训练}
+\subsection{深层网络的鲁棒性训练}
+

 \parinterval 伴随着网络的加深，还会面临另外一个比较严峻的问题\ \dash \ 过拟合。由于参数量的增大，深层网络的输入与输出分布之间的差异也会越来越大，然而不同子层之间的{\small\bfnew{相互适应}}\index{相互适应}（Co-adaptation）\index{Co-adaptation}也会更加的明显，导致任意子层网络对其他子层的依赖过大。这对于训练阶段是有帮助的，因为不同子层可以协同工作从而更好地拟合训练数据。然而这种方式也降低了模型的泛化能力，即深层网络更容易陷入过拟合问题。

-\parinterval 通常，可以使用Dropout手段用来缓解过拟合问题（见{\chapterthirteen}）。不幸的是,尽管目前Transformer模型使用了多种Dropout手段（如Residual Dropout、Attention Dropout、 ReLU Dropout等），过拟合问题在深层网络中仍然存在。从图\ref{fig:15-6}中可以看到，深层网络对比浅层网络在训练集和校验集的困惑度上都有显著的优势，然而网络在训练一段时间后出现校验集困惑度上涨的现象，说明模型已经过拟合于训练数据。
+\parinterval 通常，可以使用Dropout手段用来缓解过拟合问题（见{\chapterthirteen}）。不幸的是,尽管目前Transformer模型使用了多种Dropout手段（如Residual Dropout、Attention Dropout、 ReLU Dropout等），过拟合问题在深层网络中仍然存在。从图\ref{fig:15-12}中可以看到，深层网络对比浅层网络在训练集和校验集的困惑度上都有显著的优势，然而网络在训练一段时间后出现校验集困惑度上涨的现象，说明模型已经过拟合于训练数据。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter15/Figures/figure-wmt16}
 \caption{浅层网络(左)与深层网络（右）在WMT16英德的校验集与训练集的困惑度}
-\label{fig:15-6}
+\label{fig:15-12}
 \end{figure}
 %-------------------------------------------

 \parinterval 在{\chapterthirteen}提到的Layer Dropout方法可以有效地缓解这个问题。以编码端为例， Layer Dropout的过程可以被描述为：在训练过程中，对自注意力子层或前馈神经网络子层进行随机丢弃，以减少不同子层之间的相互适应。这里选择Pre-Norm结构作为基础架构，它可以被描述为：
-\begin{eqnarray}
-x_{l+1}=\mathcal{F}(\textrm{LN}(x_l))+x_l
-\label{eq:15-9}
-\end{eqnarray}
-其中$\textrm{LN}( \cdot )$表示层正则化函数， $\mathcal{F}( \cdot )$表示自注意力机制或者前馈神经网络，$x_l$表示第$l$个子层的输出。之后，使用一个掩码$M$（值为0或1）来控制每一子层是正常计算还是丢弃。于是，该子层的计算公式可以被重写为：
-\begin{eqnarray}
-x_{l+1}=M \cdot \mathcal{F}(\textrm{LN}(x_l))+x_l
-\label{eq:15-10}
-\end{eqnarray}
-$M=0$代表该子层被丢弃，而$M=1$代表正常进行当前子层的计算。图\ref{fig:15-7}展示了这个方法与标准Pre-Norm结构之间的区别。
+\begin{equation}
+x_{l+1}=F(\textrm{LN}(\mathbi{x}_l)) + \mathbi{x}_l
+\label{eq:15-27}
+\end{equation}
+
+\noindent 其中$\textrm{LN}( \cdot )$表示层标准化函数， $F( \cdot )$表示自注意力机制或者前馈神经网络，$\mathbi{x}_l$表示第$l$个子层的输出。之后，使用一个掩码$\textrm{Mask}$（值为0或1）来控制每一子层是正常计算还是丢弃。于是，该子层的计算公式可以被重写为：
+\begin{equation}
+\mathbi{x}_{l+1}=\textrm{Mask} \cdot F(\textrm{LN}(\mathbi{x}_l))+\mathbi{x}_l
+\label{eq:15-28}
+\end{equation}
+
+\noindent $\textrm{Mask}=0$代表该子层被丢弃，而$\textrm{Mask}=1$代表正常进行当前子层的计算。图\ref{fig:15-13}展示了这个方法与标准Pre-Norm结构之间的区别。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter15/Figures/figure-sublayer-skip}
 \caption{标准的Pre-Norm结构与基于随机跳跃子层的Pre-Norm结构}
-\label{fig:15-7}
+\label{fig:15-13}
 \end{figure}
 %-------------------------------------------

-\parinterval 除此之外，有研究者已经发现残差网络中底层的子网络通过对输入进行抽象得到的表示对最终的输出有很大的影响，上层网络通过对底层网络得到的表示不断修正来拟合训练目标\upcite{DBLP:journals/corr/GreffSS16}。该结论同样适用于Transformer模型，比如，在训练中，残差支路以及底层的梯度范数通常比较大，这也间接表明底层网络在整个优化的过程中需要更大的更新。考虑到这个因素，在设计每一个子层被丢弃的概率时可以采用自底向上线性增大的策略，保证底层的网络相比于顶层更容易保留下来。这里用$L$来代表编码端块的个数，$l$代表当前的子层的编号，那么$M$可以通过以下的方式得到：
+\parinterval 除此之外，有研究者已经发现残差网络中底层的子网络通过对输入进行抽象得到的表示对最终的输出有很大的影响，上层网络通过对底层网络得到的表示不断修正来拟合训练目标\upcite{DBLP:journals/corr/GreffSS16}。该结论同样适用于Transformer模型，比如，在训练中，残差支路以及底层的梯度范数通常比较大，这也间接表明底层网络在整个优化的过程中需要更大的更新。考虑到这个因素，在设计每一个子层被丢弃的概率时可以采用自底向上线性增大的策略，保证底层的网络相比于顶层更容易保留下来。
+
+%----------------------------------------------------------------------------------------
+%    NEW SECTION
+%----------------------------------------------------------------------------------------
+
+\section{基于结构搜索的翻译模型优化}
+
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsection{神经网络结构搜索}
+
+\parinterval 目前为止，对模型的很多改良都来自于研究人员自身的经验及灵感。从某种意义上说，很多时候，模型结构的优化依赖于研究人员对任务的理解以及自身的想象力，同时所设计出的模型结构还需要在对应任务上进行实验。优秀的模型往往需要很长时间的探索与验证。因此，人们希望在无需过多外部干预的情况下，让计算机自动地找到最适用于当前任务的神经网络模型结构，这种方法被称作{\small\bfnew{神经架构搜索}}\index{神经架构搜索}（Neural Architecture Search）\index{Neural Architecture Search}，在神经网络模型中有时也被称作{\small\bfnew{神经网络结构搜索}}\index{神经网络结构搜索}或{\small\bfnew{网络结构搜索}}\index{网络结构搜索}\upcite{DBLP:conf/iclr/ZophL17,DBLP:conf/cvpr/ZophVSL18,Real2019AgingEF}。
+
+\parinterval 网络结构搜索属于{\small\bfnew{自动机器学习}}\index{自动机器学习}（Automated Machine Learning）\index{Automated Machine Learning}的范畴，其目的在于根据对应任务上的数据找到最合适的模型结构。在这个过程中，模型结构就像传统神经网络中的模型参数一样自动地被学习出来。以机器翻译任务为例，通过网络结构搜索的方法能够在Transformer模型的基础上对神经网络结构进行自动优化，找到更适用于机器翻译任务的模型结构。图\ref{fig:15-14}(a) 给出传统人工设计的Transformer模型编码器中若干层的结构，图\ref{fig:15-14}(b) 给出该结构经过进化算法优化后的编码器中相应的结构\upcite{DBLP:conf/icml/SoLL19}。可以看到，网络结构搜索系统得到的模型中，出现了与传统人工设计的Transformer结构不同的跨层连接，同时还搜索到了全新的多分支网络结构，而这种结构是人工不易设计出来的。
+
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\includegraphics[scale=0.1]{./Chapter15/Figures/figure-encoder-structure-of-transformer-model-optimized-by-nas.jpg}
+\caption{传统Transformer以及通过网络结构搜索方法优化后的Transformer模型编码器结构}
+\label{fig:15-14}
+\end{figure}
+%-------------------------------------------
+
+\parinterval 那么网络结构搜索究竟是一种什么样的技术呢？实际上，人类一直希望能够更加自动化地、更加快速地解决自己所遇到的问题。而机器学习也是为了达到这个目的所产生的技术。如图\ref{fig:15-15}所示，机器学习方法可以看做是一个黑盒模型，这个模型能够根据人类所提供输入自动给出所期望的输出，这里的输入和输出既可以是图像信息，也可以是自然语言领域中的文字。在传统机器学习方法中，研究者需要设计大量的特征来描述待解决的问题，即“特征工程”。在深度学习时代，神经网络模型可以进行特征的抽取和学习，但是随之而来的是需要人工设计神经网络结构，这项工作仍然十分繁重。因此一些科研人员开始思考，能否将模型结构设计的工作也交由机器自动完成？深度学习方法中模型参数能够通过梯度下降等方式进行自动优化，那么模型结构是否可以也看做是一种特殊的模型参数，使用搜索算法自动地根据不同任务的数据自动找到最适用于当前任务的模型结构？
+
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\includegraphics[scale=0.6]{./Chapter15/Figures/figure-evolution-and-change-of-ml-methods.jpg}
+\caption{机器学习方法的演化与变迁}
+\label{fig:15-15}
+\end{figure}
+%-------------------------------------------
+
+\parinterval 于是，就有了网络结构搜索这个研究方向。早在上世纪八十年代，研究人员就已经在论文中使用进化算法对神经网络结构进行了设计\upcite{DBLP:conf/icga/MillerTH89}，之后也有很多研究人员沿着该思路继续对基于进化算法的结构搜索进行了探索\upcite{mandischer1993representation,koza1991genetic,DBLP:conf/ijcnn/Dodd90,DBLP:conf/nips/HarpSG89,DBLP:journals/compsys/Kitano90,DBLP:conf/icec/SantosD94}。近些年，随着深度学习技术的发展，神经网络结构搜索这个方向也重新走入更多人的视线中，受到了来自计算机视觉、自然语言处理多个领域的关注与应用。
+
+\parinterval 虽然目前网络结构搜索的技术尚且处于相对初级的阶段，不过在近些年该方向已经在很多任务中崭露头角。例如，在WMT19国际机器翻译比赛中有参赛单位使用了基于梯度的结构搜索方法对翻译模型进行改进\upcite{DBLP:conf/nips/LuoTQCL18}。此外，在语言建模等任务中也有结构搜索的大量应用，取得了很好的结果\upcite{DBLP:conf/icml/PhamGZLD18,DBLP:conf/iclr/LiuSY19,DBLP:conf/acl/LiHZXJXZLL20,DBLP:conf/emnlp/JiangHXZZ19}。下面将对结构搜索的基本方法和其在机器翻译中的应用进行介绍。
+
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsection{结构搜索的基本方法}
+
+\parinterval 对于网络结构搜索而言，其目标在于通过数据驱动的方式对自动地找到最适用于指定任务的模型结构。以机器翻译这类有监督任务为例，对于给定的具有$K$个训练样本的训练集合$\{(\mathbi{x}_{1},\mathbi{y}_{1}),\ldots,(\mathbi{x}_{n},\mathbi{y}_{n})\}$（其中$\mathbi{x}_{i}$表示的是第$i$个样本的输入数据，$\mathbi{y}_{i}$表示该样本的目标标签值），那么网络结构搜索的过程可以被建模根据数据找到最佳模型结构$\hat{a}$的过程，如下所示：
+\begin{equation}
+\hat{a} = \arg\max_{a}\sum_{i=1}^{n}{\funp{P}(\mathbi{y}_{i}|\mathbi{x}_{i};a)}
+\label{eq:15-29}
+\end{equation}
+
+\noindent 公式中$\funp{P}(\mathbi{y}_{i}|\mathbi{x}_{i};a)$为模型$a$观察到数据$\mathbi{x}_{i}$后预测为$\mathbi{y}_{i}$标签的概率，而模型结构$a$本身可以看作是输入$\mathbi{x}$到输出$\mathbi{y}$的映射函数。因此可以简单的把模型$a$看作根据$\mathbi{x}$预测$\mathbi{y}$的一个函数，记为：
+\begin{equation}
+a = \funp{P}(\cdot|\mathbi{x};a)
+\label{eq:15-30}
+\end{equation}
+
+\noindent 其中，$\funp{P}(\cdot|\mathbi{x};a)$表示在所有译文句子上的一个分布。图\ref{fig:15-16}展示了神经网络结构搜索方法的主要流程，主要包括三个部分：
+
+\begin{itemize}
+\vspace{0.5em}
+\item 设计搜索空间：理论上来说网络结构搜索应在所有潜在的模型结构所组成的空间中进行搜索（图\ref{fig:15-16}）。在这种情况下如果不对候选模型结构进行限制的话，搜索空间会十分巨大。因此，在实际的结构搜索过程中往往会针对特定任务设计一个搜索空间，这个搜索空间是全体结构空间的一个子集，之后的搜索过程将在这个子空间中进行。如图\ref{fig:15-16}例子中的搜索空间所示，该空间由循环神经网络构成，其中候选的模型包括人工设计的LSTM、GRU等模型结构，也包括其他潜在的循环神经网络结构。
+\vspace{0.5em}
+\item 	选择搜索策略：在设计好搜索空间之后，结构搜索的过程将选择一种合适的策略对搜索空间进行探索，找到最适用于当前任务的模型结构。不同于模型参数的学习，模型结构之间本身不存在直接可计算的关联，所以很难通过传统的最优化算法对其进行学习。因此，搜索策略往往选择采用遗传算法或强化学习等方法间接对模型结构进行设计或优化\upcite{DBLP:conf/icml/SoLL19,DBLP:conf/aaai/RealAHL19,DBLP:conf/icml/RealMSSSTLK17,DBLP:conf/iclr/ElskenMH19,DBLP:conf/iclr/ZophL17,DBLP:conf/cvpr/ZophVSL18,DBLP:conf/icml/PhamGZLD18,DBLP:conf/iclr/BakerGNR17,DBLP:conf/cvpr/TanCPVSHL19,DBLP:conf/iclr/LiuSVFK18}。不过近些年来也有研究人员开始尝试将模型结构建模为超网络中的参数，这样即可使用基于梯度的方式直接对最优结构进行搜索\upcite{DBLP:conf/nips/LuoTQCL18,DBLP:conf/iclr/LiuSY19,DBLP:conf/iclr/CaiZH19,DBLP:conf/cvpr/LiuCSAHY019,DBLP:conf/cvpr/WuDZWSWTVJK19,DBLP:conf/iclr/XieZLL19,DBLP:conf/uai/LiT19,DBLP:conf/cvpr/DongY19,DBLP:conf/iclr/XuX0CQ0X20,DBLP:conf/iclr/ZelaESMBH20,DBLP:conf/iclr/MeiLLJYYY20}。
+\vspace{0.5em}
+\item 	进行性能评估：在搜索到模型结构之后需要对这种模型结构的性能进行验证，确定当前时刻找到的模型结构性能优劣。但是对于结构搜索任务来说，在搜索的过程中将产生大量中间模型结构，如果直接对所有可能的结构进行评价，其时间代价是难以接受的。因此在结构搜索任务中也有很多研究人员尝试如何快速获取模型性能（绝对性能或相对性能）\upcite{DBLP:conf/nips/LuoTQCL18,DBLP:journals/jmlr/LiJDRT17,DBLP:conf/eccv/LiuZNSHLFYHM18}。
+\vspace{0.5em}
+\end{itemize}
+
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\includegraphics[scale=0.22]{./Chapter15/Figures/figure-main-flow-of-neural-network-structure-search.png}
+\caption{神经网络结构搜索的主要流程}
+\label{fig:15-16}
+\end{figure}
+%-------------------------------------------
+
+\parinterval 下面将对网络结构搜索任务中搜索空间、搜索策略以及性能评估几个方向进行简单介绍。
+
+%----------------------------------------------------------------------------------------
+%    NEW SUBSUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsubsection{1. 搜索空间}
+
+\parinterval 对搜索空间建模是结构搜索任务中非常基础的一部分。如图\ref{fig:15-17}所示，结构空间中包含着所有潜在的模型结构，虽然可以假设这些模型结构都是等概率出现的，不过当众多模型结构面对同一个任务的时候，模型本身就具有了不同的性能表现。例如，卷积神经网络非常适合处理图像数据，而类似Transformer这类的模型结构在自然语言处理领域中可能更具优势。此外，由于不同网络结构之间往往存在局部结构上的复用，因此在结构空间中不同结构之间存在着距离上的远近，如图\ref{fig:15-17}中基于自注意力机制的模型结构往往聚在一起，而基于循环神经网络和基于卷积神经网络的各类模型之间的距离也相对较近。因此，在设计搜索空间的时候，很重要的一点在于，根据经验或者实验确定对当前任务而言更容易产出高性能模型结构的区域，将这个区域作为结构搜索任务中的搜索空间则更有可能找到最优结构。以自然语言处理任务为例，最初的网络结构搜索工作主要对由循环神经网络构成的搜索空间进行探索\upcite{DBLP:conf/iclr/ZophL17,DBLP:conf/icml/PhamGZLD18,DBLP:conf/iclr/LiuSY19}，而近些年针对Transformer模型结构的研究工作也越来越多地引起研究人员的关注\upcite{DBLP:conf/icml/SoLL19,DBLP:journals/taslp/FanTXQLL20,DBLP:conf/ijcai/ChenLQWLDDHLZ20,DBLP:conf/acl/WangWLCZGH20}。
+
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\includegraphics[scale=0.6]{./Chapter15/Figures/figure-relationship-between-structures-in-structural-space.jpg}
+\caption{结构空间内结构之间的关系}
+\label{fig:15-17}
+\end{figure}
+%-------------------------------------------
+
+\parinterval 在设计搜索空间的时候很重要的一个问题在于如何表示一个网络结构。在目前的结构搜索方法中，通常将模型结构分为整体框架和内部结构（元结构）两部分。整个模型结构由整体框架将若干内部结构的输出按照特定的方式组织起来，得到最终的模型输出。如图\ref{fig:15-18}所示，以循环神经网络模型结构为例，黄色部分即为内部结构，红色部分为整体框架，负责对循环单元进行组织。
+
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\includegraphics[scale=0.3]{./Chapter15/Figures/figure-whole-structure-and-internal-structure-in-rnn.png}
+\caption{循环神经网络模型中的整体结构和内部结构}
+\label{fig:15-18}
+\end{figure}
+%-------------------------------------------
+
+\parinterval 具体来说，内部结构负责的是如何计算出循环神经网络计算过程中$t$时刻的循环单元输出$\mathbi{h}_t$，如下所示：
+\begin{equation}
+\mathbi{h}_t = \pi(\hat{\mathbi{h}}_{t-1},\hat{\mathbi{x}_t})
+\label{eq:15-31}
+\end{equation}
+
+\parinterval 其中函数$\pi(\cdot)$即为结构表示中的内部结构，而对于循环单元之间的组织方式（即整体框架）则决定了循环单元的输入信息，也就是上式中的循环单元表示$\hat{\mathbi{h}}_{t-1}$和输入表示$\hat{\mathbi{x}}_{t}$。理论上二者均能获得对应时刻之前所有可以获得的表示信息，因此可表示为：
 \begin{eqnarray}
-M = \left\{\begin{array}{ll}
-0&P \leqslant p_l\\
-1&P > p_l
-\end{array}\right.
-\label{eq:15-11}
+\hat{\mathbi{h}}_{t-1} &=& f(\mathbi{h}_{[0,t-1]};\mathbi{x}_{[1,t-1]}) \\ 
+\hat{\mathbi{x}_t} &=& g(\mathbi{x}_{[1,t]};\mathbi{h}_{[0,t-1]})
+\label{eq:15-33}
 \end{eqnarray}
-其中，$P$是服从伯努利分布的随机变量，$p_l$指每一个子层被丢弃的概率，具体计算方式如下：
+
+\noindent 其中$\mathbi{h}_{[0,t-1]} = \{\mathbi{h}_0,\ldots,\mathbi{h}_{t-1}\}$，$\mathbi{x}_{[1,t-1]} = \{\mathbi{x}_1,\ldots,\mathbi{x}_{t-1}\}$，函数$f(\cdot)$和$g(\cdot)$即为循环神经网络模型中的整体框架。
+
+\parinterval 可以看到，整体框架和内部结构共同组成了神经网络的模型结构，换句话说确定了在搜索过程中整体框架以及内部结构包括哪些模型也就确定了搜索空间。
+
+\begin{itemize}
+\vspace{0.5em}
+\item 整体框架：如图\ref{fig:15-17}所示，不同任务下不同结构往往会表现出不同的建模能力，而类似的结构在结构空间中又相对集中，因此在搜索空间的设计中，整体框架部分一般根据不同任务特点选择已经得到验证的经验性结构，通过这种方式能够快速定位到更有潜力的搜索空间。如对于图像任务来说，一般会将卷积神经网络设计为候选搜索空间\upcite{DBLP:conf/iclr/ElskenMH19,DBLP:conf/icml/PhamGZLD18,DBLP:conf/iclr/LiuSY19,DBLP:conf/eccv/LiuZNSHLFYHM18,DBLP:conf/icml/CaiYZHY18}，而对于包括机器翻译在内的自然语言处理任务而言，则会更倾向于使用循环神经网络或基于自注意力机制的Transformer模型附近的结构空间作为搜索空间\upcite{DBLP:conf/icml/SoLL19,DBLP:conf/iclr/ZophL17,DBLP:conf/icml/PhamGZLD18,DBLP:conf/iclr/LiuSY19,DBLP:journals/taslp/FanTXQLL20,DBLP:conf/ijcai/ChenLQWLDDHLZ20,DBLP:conf/acl/WangWLCZGH20}。此外，也可以拓展搜索空间以覆盖更多网络结构\upcite{DBLP:conf/acl/LiHZXJXZLL20}。
+\vspace{0.5em}
+\item 	内部结构：由于算力限制，网络结构搜索的任务通常使用经验性的架构作为模型的整体框架，之后通过对搜索到的内部结构进行堆叠得到完整的模型结构。而对于内部结构的设计需要考虑到搜索过程中的最小搜索单元以及搜索单元之间的连接方式，最小搜索单元指的是在结构搜索过程中可被选择的最小独立计算单元（或被称为搜索算子、操作），在不同搜索空间的设计中，最小搜索单元的颗粒度各有不同，相对较小的搜索粒度主要包括诸如矩阵乘法、张量缩放等基本数学运算\upcite{DBLP:journals/corr/abs-2003-03384}，中等粒度的搜索单元包括例如常见的激活函数，如ReLU、Tanh等\upcite{DBLP:conf/iclr/LiuSY19,DBLP:conf/acl/LiHZXJXZLL20,Chollet2017XceptionDL}，同时在搜索空间的设计上也有研究人员倾向于选择较大颗粒度的局部结构作为搜索单元，如注意力机制、层标准化等人工设计的经验性结构\upcite{DBLP:conf/icml/SoLL19,DBLP:conf/nips/LuoTQCL18,DBLP:journals/taslp/FanTXQLL20}。不过，对于搜索颗粒度的问题，目前还缺乏有效的方法针对不同任务进行自动优化。
+\vspace{0.5em}
+\end{itemize}
+
+\parinterval 实际上，使用“整体+局部”的两层搜索空间表示模型的原因在于：问题过于复杂，无法有效的遍历原始的搜索空间。如果存在足够高效的搜索策略，搜索空间的表示也可能会发生变化，比如，直接对任意的网络结构使用统一的表示方式。理论上讲，这样的搜索空间可以涵盖更多的潜在模型结构。
+
+%----------------------------------------------------------------------------------------
+%    NEW SUBSUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsubsection{2. 搜索策略}
+
+\parinterval 在定义好搜索空间之后，如何进行网络结构的搜索也同样重要。该过程被称为搜索策略的设计，其主要目的为根据已找到的模型结构计算出下一个最有潜力的模型结构，为保证模型有效性，在一些方法中也会引入外部知识（如经验性的模型结构或张量运算规则）对搜索过程中的结构进行剪枝。目前常见的搜索策略一般包括基于进化算法、强化学习、贝叶斯优化、梯度的方法的方式，不同的搜索策略一般也和搜索空间中结构建模方式相关。
+
+\begin{itemize}
+\vspace{0.5em}
+\item 进化算法{\red 检查这些词是不是第一次提到}：最初主要通过进化算法对神经网络中的模型结构以及权重参数进行优化\upcite{DBLP:conf/icga/MillerTH89,DBLP:journals/tnn/AngelineSP94,stanley2002evolving,DBLP:journals/alife/StanleyDG09}。而随着最优化算法的发展，近年来对于网络参数的学习更多地采用梯度下降法的方式，不过使用进化算法对模型结构进行优化却依旧被沿用至今\upcite{DBLP:conf/aaai/RealAHL19,DBLP:conf/icml/RealMSSSTLK17,DBLP:conf/iclr/ElskenMH19,DBLP:conf/ijcai/SuganumaSN18,Real2019AgingEF,DBLP:conf/iclr/LiuSVFK18,DBLP:conf/iccv/XieY17}。目前主流的方式主要是将模型结构看做是遗传算法中种群的个体，通过使用轮盘赌或锦标赛等抽取方式对种群中的结构进行取样作为亲本，之后通过亲本模型的突变产生新的模型结构，最终对这些新的模型结构进行适应度评估{\red （见XXX节）}，根据模型结构在校验集上性能表现确定是否能够将其加入种群，整个过程如图\ref{fig:15-19}所示。对于进化算法中结构的突变主要指的是对模型中局部结构的改变，如增加跨层连接、替换局部操作等。
+
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\includegraphics[scale=0.25]{./Chapter15/Figures/figure-structure-search-based-on-evolutionary-algorithm.png}
+\caption{基于进化算法的结构搜索}
+\label{fig:15-19}
+\end{figure}
+%-------------------------------------------
+
+\vspace{0.5em}
+\item 	强化学习：除了使用进化算法对网络结构进行学习之外，近些年随着强化学习在各领域中的广泛应用，研究人员也逐渐开始将这种方法引入到神经网络的结构学习中来。例如，研究人员将网络结构的设计看做是序列生成任务，使用字符序列对网络结构进行表述，通过由强化学习指导训练出的循环神经网络对该模型结构序列（目标任务网络结构）进行预测，从而为目标任务生成提供高效的网络结构\upcite{DBLP:conf/iclr/ZophL17}。基于强化学习的结构搜索方法过程如图\ref{fig:15-20}所示，其中主体可以看做是一个模型结构的生成模型，用于产生当前状态下从主体角度上看最适用于当前任务的模型结构，强化学习中的动作在这里指的是由主体{\red （强化学习里经常用Agent，是这里的主体吗？？？Agent一般被翻译为智能体）}产生一个模型结构，而环境对应着模型将要应用在的任务，当环境得到了模型结构后，环境将输出当前任务下该模型的输出以及对输出结果的评价，二者分别对应强化学习中的状态和奖励，这两个信息将反馈给主体，让结构生成器对该状态下生成的模型结果有一个清晰的了解，从而对自身结构生成的模式进行调整，然后继续生成更优的模型结构。对于基于强化学习的结构搜索策略来说，不同研究工作的差异主要集中在如何表示生成网络的过程以及如何对结构生成器进行优化\upcite{DBLP:conf/iclr/ZophL17,DBLP:conf/cvpr/ZophVSL18,DBLP:conf/iclr/BakerGNR17,DBLP:conf/cvpr/ZhongYWSL18}。
+
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\includegraphics[scale=0.25]{./Chapter15/Figures/figure-structure-search-based-on-reinforcement-learning.png}
+\caption{基于强化学习的结构搜索}
+\label{fig:15-20}
+\end{figure}
+%-------------------------------------------
+
+\vspace{0.5em}
+\item 	梯度方法：在基于进化算法和强化学习的结构搜索策略中，往往将模型结构看作是离散空间中的若干点，搜索策略通过诸如“进化”或“激励”等方式促进种群或者结构生成器产生更适用于对应任务的模型结构，这种方式相对而言并未直接对模型结构进行优化。不同于这些方法，有研究人员尝试在连续空间中对模型结构进行表示\upcite{DBLP:conf/iclr/LiuSY19}，这种方式将模型结构建模为超网络中的权重，通过使用基于梯度的最优化方法对权重进行优化最终达到搜索结构的目的，如图\ref{fig:15-21}所示。这种方式相对进化算法以及强化学习方法而言更加直接，在搜索过程中的算力消耗以及时间消耗上也更有优势，因此也吸引了很多研究人员对基于梯度的结构搜索策略进行不断探索\upcite{DBLP:conf/cvpr/WuDZWSWTVJK19,DBLP:conf/iclr/XuX0CQ0X20,DBLP:conf/acl/LiHZXJXZLL20,DBLP:conf/emnlp/JiangHXZZ19}。
+\vspace{0.5em}
+
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\includegraphics[scale=0.25]{./Chapter15/Figures/figure-structure-search-based-on-gradient-method.png}
+\caption{基于梯度方法的结构搜索}
+\label{fig:15-21}
+\end{figure}
+%-------------------------------------------
+
+\end{itemize}
+
+\parinterval 除了上述提到的基于进化算法、强化学习以及梯度的方法之外，结构搜索策略还有很多其他的方式，例如基于贝叶斯优化的方法、基于随机搜索的方法。贝叶斯优化的方法在搜索结构超参数的任务中表现优异，能够在给定模型结构基础上找到最适用于当前任务的超参数\upcite{DBLP:conf/icml/BergstraYC13,DBLP:conf/ijcai/DomhanSH15,DBLP:conf/icml/MendozaKFSH16,DBLP:journals/corr/abs-1807-06906}，而随机搜索的方法也在一些结构搜索任务中引起众多研究人员的注意\upcite{DBLP:conf/uai/LiT19,li2020automated,DBLP:conf/cvpr/BenderLCCCKL20}。
+
+%----------------------------------------------------------------------------------------
+%    NEW SUBSUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsubsection{3. 性能评估}
+
+\parinterval 在结构搜索的过程中会产生大量中间结构，因此如何快速获得这些结构的性能优劣也是网络结构搜索任务中的关键一环。通常有三类方法进行模型性能的快速评估。
+
+\begin{itemize}
+\vspace{0.5em}
+\item 数据以及超参数的调整：一种常见的方法是从数据和超参数的角度简化模型训练难度。具体来说，可采取的策略包括使用更少量、更容易建模的数据作为训练集合\upcite{DBLP:conf/aistats/KleinFBHH17,DBLP:journals/corr/ChrabaszczLH17}，如图像任务中使用低分辨率数据对模型进行训练及评估，在自然语言处理任务中也可以选择更容易建模的口语领域数据进行实验。在超参数的方面也可以通过减少模型训练轮数、减少模型的层数、减少每层中神经元数量等方式来简化模型参数，达到加速训练、评估的目的\upcite{DBLP:conf/cvpr/ZophVSL18,Real2019AgingEF,DBLP:journals/corr/abs-1807-06906}。上述两种方法虽然并不能准确地对模型绝对性能进行评价，但是其结果已经能够对搜索起到一定指示作用，并帮助结构搜索策略对结构生成的过程进行调优。
+\vspace{0.5em}
+\item 	现有参数的继承及复用：另一类方法希望从训练过程的角度出发，让中间过程产生的模型结构能够在现有的模型参数基础上进行继续优化，从而快速达到收敛状态进行性能评估\upcite{DBLP:conf/icml/RealMSSSTLK17,DBLP:conf/iclr/ElskenMH19,DBLP:conf/icml/CaiYZHY18,DBLP:conf/aaai/CaiCZYW18,DBLP:conf/iclr/ElskenMH18}。这种方式无需从头训练中间结构，通过“热启动”的方式对模型参数进行优化，大幅减少性能评估过程中的时间消耗。此外对于前文提到的基于梯度的结构搜索方法，由于将众多候选模型结构建模在同一个超网络中，因此在完成对超网络的参数优化的时候，其子模型的模型参数也得到优化，通过这种共享参数的方式也能够快速对网络结构性能进行评估\upcite{DBLP:conf/icml/PhamGZLD18,DBLP:conf/iclr/XieZLL19,DBLP:conf/iclr/LiuSY19,DBLP:conf/iclr/CaiZH19,DBLP:conf/nips/SaxenaV16,DBLP:conf/icml/BenderKZVL18}。
+\vspace{0.5em}
+\item 	模型性能的预测：模型性能预测也是一个具有潜力的加速性能评估过程的方法，这种方式旨在通过少量训练过程中的性能变化曲线来预估模型是否具有潜力，从而快速终止低性能模型的训练过程，节约更多训练时间\upcite{DBLP:conf/ijcai/DomhanSH15,DBLP:conf/iclr/KleinFSH17,DBLP:conf/iclr/BakerGRN18}。除了根据训练过程中的曲线进行模型性能的预测之外，也有研究人员根据局部结构的性能来对整体结构性能表现进行预测，这种方式也能更快速地了解构搜索过程中中间结构的表示能力\upcite{DBLP:conf/eccv/LiuZNSHLFYHM18}。
+\vspace{0.5em}
+\end{itemize}
+
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsection{机器翻译任务下的结构搜索}
+
+\parinterval 目前来说，网络结构搜索的方法在包括图像、自然语言处理等领域中方兴未艾，不过对于自然语言处理的任务来说，更多的是在语言建模、命名实体识别等简单任务上进行的尝试\upcite{DBLP:conf/acl/LiHZXJXZLL20,DBLP:conf/emnlp/JiangHXZZ19}。同时，大多工作更多是在基于循环神经网络的模型结构上进行进行探索，相较目前在机器翻译领域中广泛使用的Transformer模型结构来说，它们在性能表现上并没有体现出绝对优势。此外，由于机器翻译任务的复杂性，对基于Transformer的机器翻译模型的结构搜索方法会更少一些。不过这部分工作依旧在机器翻译任务上得到了很好的表现。例如，在WMT19机器翻译比赛中，神经网络结构优化方法在多个任务上取得了很好的成绩\upcite{DBLP:conf/nips/LuoTQCL18,DBLP:conf/wmt/XiaTTGHCFGLLWWZ19}。对于结构搜索在机器翻译领域的应用目前主要包括两个方面，分别是对模型性能的改进以及模型效率的优化：
+
+%----------------------------------------------------------------------------------------
+%    NEW SUBSUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsubsection{1. 模型性能改进}
+
+\parinterval 结构搜索任务中一个非常重要的目标在于找到更加适用于当前任务的模型结构。目前来看，有两种思路：1）对模型中局部结构的搜索；2）对局部结构组合方式的优化。
+
+\begin{itemize}
+\vspace{0.5em}
+\item 搜索模型中的局部结构：在机器翻译任务中，一种典型的局部模型结构搜索方法是面向激活函数的搜索\upcite{DBLP:conf/iclr/RamachandranZL18}。该方法将激活函数看作是简单函数的若干次复合使用，图\ref{fig:15-22}是激活函数结构的一个示例，其中核心单元由两个输入、两个一元函数（如绝对值、幂方运算等）和一个二元函数（如乘法、取最大值运算等）组成。
+
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\includegraphics[scale=0.5]{./Chapter15/Figures/figure-activation-function-swish-structure-diagram.png}
+\caption{激活函数结构图}
+\label{fig:15-22}
+\end{figure}
+%-------------------------------------------
+
+\noindent 还有方法将为深层神经网络找到更适合的激活函数作为搜索的目标\upcite{DBLP:conf/iclr/RamachandranZL18}，通过基于强化学习的搜索策略对激活函数空间进行探索，找到了若干新的激活函数，之后通过对这些激活函数在包括图像分类、机器翻译等任务上进行实验，确定了Swish激活函数在深层神经网络上的有效性，函数公式如下式所示，函数曲线如图\ref{fig:15-23}所示。{\red 下面公式中x的形式}
 \begin{eqnarray}
-p_l=\frac{l}{2L}\cdot \varphi
-\label{eq:15-12}
+f(x) &=& x \cdot \delta(\beta x) \\
+\delta(z) &=& {(1 + \exp{(-z)})}^{-1}
+\label{eq:15-35}
 \end{eqnarray}
-这里，$1 \leqslant l \leqslant 2L$ ，且$\varphi$是预先设定的超参数。

-\parinterval 在Layer Dropout中，一个由$2L$个子层构成的残差网络，其顶层的输出相当于是$2^{2L}$个子网络的聚合结果。通过随机丢弃$n$个子层，则会屏蔽掉$2^n$个子网络的输出，将子网络的总体数量降低至$2^{2L-n}$。如图\ref{fig:15-8}所示的残差网络展开图，当有3个子层时，从输入到输出共存在8条路径，当删除子层sublayer2后，从输入到输出路径的路径则会减少到4条。
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\includegraphics[scale=0.5]{./Chapter15/Figures/figure-swish-function-image.png}
+\caption{Swish函数图像}
+\label{fig:15-23}
+\end{figure}
+%-------------------------------------------
+
+\noindent Swish函数相比传统人工设计的ReLU而言在多个机器翻译的测试集上具有更优的性能表现，同时不要求对原本模型结构进行更多修改，非常容易实现。
+
+\vspace{0.5em}
+\item 	搜索模型中局部结构的组合：在基于Transformer模型的网络结构搜索任务中，对于局部结构的组合方式的学习也受到了很多关注，其中包括基于进化算法的方法以及基于梯度对现有Transformer模型结构进行改良的方式\upcite{DBLP:conf/icml/SoLL19,DBLP:journals/taslp/FanTXQLL20}。这类方法不同于前文所述的对局部结构的改良，更多地是希望利用现有经验性的局部结构进行组合，找到最佳的整体结构。在模型结构的表示方法上，这些方法会根据先验知识为搜索单元设定一个部分框架，如每当信息传递过来之后先进行层标准化，之后再对候选位置上的操作使用对应的搜索策略进行搜索。另外这类方法也会在Transformer结构中引入多分支的支持，一个搜索单元的输出可以被多个后续单元所使用，通过这种方式有效扩大了结构搜索过程中的搜索空间，能够在现有Transformer结构基础上找到更优的模型结构。对模型中结构组合方式的学习如图\ref{fig:15-24}所示。
+
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\includegraphics[scale=0.25]{./Chapter15/Figures/figure-learning-of-local-structure-combination.png}
+\caption{局部结构组合方式的学习}
+\label{fig:15-24}
+\end{figure}
+%-------------------------------------------
+
+\vspace{0.5em}
+\end{itemize}
+
+\parinterval 此外对于模型结构中超参数的自动搜索同样能够有效提升模型的性能表现\upcite{DBLP:journals/corr/abs-2009-02070}。Transformer模型虽然已经在机器翻译任务中获得了非常优异的性能，不过如果希望稳定训练该模型还需要很多超参数上经验的累积，与此同时，不同翻译任务在使用Transformer模型的时候，也需要对超参数以及局部结构进行相应的调整（如层标准化的位置，注意力操作的过程中是否进行缩放，注意力操作的头数，激活函数等），因此面向超参数的搜索同样能够帮助研究人员快速找到最适用的模型结构。
+
+%----------------------------------------------------------------------------------------
+%    NEW SUBSUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsubsection{2. 模型效率优化}
+
+\parinterval 网络结构搜索的方法除了能够应用于对机器翻译模型性能进行改进之外，也能够用来对模型执行效率进行优化。从实用的角度出发，一些研究人员尝试将设备的计算能力提供给结构搜索方法进行参考，希望能够找到适合于对应设备算力的模型结构。同时也有一些研究人员专注于对大规模的模型进行压缩，加速其在推断过程中的效率，这方面的工作不仅限于在机器翻译模型上，也有部分工作对基于注意力机制的预训练模型进行压缩。
+
+\begin{itemize}
+\vspace{0.5em}
+\item 面向特定设备的模型结构优化：随着终端设备算力的日益增强，在小设备上直接进行机器翻译的需求也日益增大，简单地对Transformer模型的超参数进行削减能够让翻译模型在这些低算力的设备上运行，不过研究人员仍然希望得到更适用于当前设备的翻译模型。因此一些研究人员开始尝试使用结构搜索的方式对现有Transformer模型进行改良，在结构优化的过程中将设备上的算力条件作为一个约束，为不同硬件设备（如CPU、GPU等设备）发现更加有效的结构\upcite{DBLP:conf/acl/WangWLCZGH20}。例如可以将搜索空间中各种基于Transformer结构的变体建模在同一个超网络中，通过权重共享的方式进行训练。使用硬件算力约束训练得到的子模型，通过进化算法对子模型进行搜索，搜索到适用于目标硬件的模型结构，整个过程如图\ref{fig:15-25}所示。通过该方法搜索到的模型能够在保证机器翻译模型性能的前提下获得较大的效率提升。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
-\input{./Chapter15/Figures/figure-expanded-residual-network}
-\caption{Layer Dropout中残差网络的展开图}
-\label{fig:15-8}
+\input{./Chapter15/Figures/figure-model-structure-optimization-framework-for-specific-equipment.tex}
+\caption{面向特定设备的模型结构优化框架}
+\label{fig:15-25}
 \end{figure}
 %-------------------------------------------

+\vspace{0.5em}
+\item 	模型压缩：此外，在不考虑设备自身算力条件的情况下，也有一些研究人员通过结构搜索的方式对基于Transformer的预训练模型进行压缩。例如，将Transformer模型拆分为若干小组件，然后通过基于采样的结构搜索方法对压缩后的模型结构进行搜索，尝试找到最优且高效的推断模型\upcite{DBLP:journals/corr/abs-2008-06808}。相类似，也有研究者在基于BERT的预训练模型上通过结构搜索的方式进行模型压缩，通过基于梯度的结构搜索方法，针对不同的下游任务所需知识将预训练的BERT模型压缩为相应的小模型\upcite{DBLP:conf/ijcai/ChenLQWLDDHLZ20}。
+\vspace{0.5em}
+\end{itemize}
+
+\parinterval 虽然受限于算力等条件限制，目前很多网络结构搜索的方法并没有直接在机器翻译任务中进行实验，但是这些方法并没有被限制在特定任务上。例如，可微分结构搜索方法被成功的用于学习更好的循环单元结构，这类方法完全可以应用在机器翻译任务上，不过大部分工作并没有在这个任务上进行尝试。此外，受到自然语言处理领域预训练模型的启发，一些研究人员也表示网络结构预搜索可能是一个有潜力方向，也有研究人员尝试在大规模语言模型任务上进行结构搜索\upcite{DBLP:conf/acl/LiHZXJXZLL20}，然后将搜索到的模型结构应用到更多其他自然语言处理的任务中，这种方式有效提升了模型结构的可复用性，同时从大规模单语数据中获取到的信息可能比从特定任务下受限的数据集合中所得到的信息更加充分，能够更好地指导模型结构的设计。对于机器翻译任务而言，结构的预搜索同样是一个值得关注的研究方向。
+
+
+
--- a/Chapter8/chapter8.tex
+++ b/Chapter8/chapter8.tex
@@ -653,13 +653,13 @@ span\textrm{[0,4]}&=&\textrm{“猫} \quad \textrm{喜欢} \quad \textrm{吃} \q

 \subsection{基于句法的翻译模型分类}

-\parinterval 可以说基于句法的翻译模型贯穿了现代统计机器翻译的发展历程。从概念上讲，不管是层次短语模型，还是语言学句法模型都是基于句法的模型。基于句法的机器翻译模型种类繁多，这里先对相关概念进行简要介绍，以避免后续论述中产生歧义。表\ref{tab:4-2}给出了基于句法的机器翻译中涉及的一些概念。
+\parinterval 可以说基于句法的翻译模型贯穿了现代统计机器翻译的发展历程。从概念上讲，不管是层次短语模型，还是语言学句法模型都是基于句法的模型。基于句法的机器翻译模型种类繁多，这里先对相关概念进行简要介绍，以避免后续论述中产生歧义。表\ref{tab:8-2}给出了基于句法的机器翻译中涉及的一些概念。

 %----------------------------------------------
 \begin{table}[htp]{
 \begin{center}
 \caption{基于句法的机器翻译中常用概念}
-\label{tab:4-2}
+\label{tab:8-2}
 {
 \begin{tabular}{p{6.5em} | l}
 术语 & 说明 \\

--- a/ChapterAppend/chapterappend.tex
+++ b/ChapterAppend/chapterappend.tex
@@ -26,6 +26,110 @@
 \begin{appendices}
 \chapter{附录A}
 \label{appendix-A}
+\parinterval  从实践的角度，机器翻译的发展主要可以归功于两方面的推动作用：开源系统和评测。开源系统通过代码共享的方式使得最新的研究成果可以快速传播，同时实验结果可以复现。而评测比赛，使得各个研究组织的成果可以进行科学的对比，共同推动机器翻译的发展与进步。此外，开源项目也促进了不同团队之间的协作，让研究人员在同一个平台上集中力量攻关。
+
+%----------------------------------------------------------------------------------------
+%    NEW SECTION
+%----------------------------------------------------------------------------------------
+
+\section{统计机器翻译开源系统}
+
+\begin{itemize}
+\vspace{0.5em}
+\item NiuTrans.SMT。NiuTrans\upcite{Tong2012NiuTrans}是由东北大学自然语言处理实验室自主研发的统计机器翻译系统，该系统可支持基于短语的模型、基于层次短语的模型以及基于句法的模型。由于使用C++ 语言开发，所以该系统运行时间快，所占存储空间少。系统中内嵌有$n$-gram语言模型，故无需使用其他的系统即可对完成语言建模。网址：\url{http://opensource.niutrans.com/smt/index.html}
+\vspace{0.5em}
+\item Moses。Moses\upcite{Koehn2007Moses}是统计机器翻译时代最著名的系统之一，（主要）由爱丁堡大学的机器翻译团队开发。最新的Moses系统支持很多的功能，例如，它既支持基于短语的模型，也支持基于句法的模型。Moses 提供因子化翻译模型（Factored Translation Model），因此该模型可以很容易地对不同层次的信息进行建模。此外，它允许将混淆网络和字格作为输入，可缓解系统的1-best输出中的错误。Moses还提供了很多有用的脚本和工具，被机器翻译研究者广泛使用。网址：\url{http://www.statmt.org/moses/}
+\vspace{0.5em}
+\item Joshua。Joshua\upcite{Li2010Joshua}是由约翰霍普金斯大学的语言和语音处理中心开发的层次短语翻译系统。由于Joshua是由Java语言开发，所以它在不同的平台上运行或开发时具有良好的可扩展性和可移植性。Joshua也是使用非常广泛的开源机器翻译系统之一。网址：\url{https://cwiki.apache.org/confluence/display/JOSHUA/}
+\vspace{0.5em}
+\item SilkRoad。SilkRoad是由五个国内机构（中科院计算所、中科院软件所、中科院自动化所、厦门大学和哈尔滨工业大学）联合开发的基于短语的统计机器翻译系统。该系统是中国乃至亚洲地区第一个开源的统计机器翻译系统。SilkRoad支持多种解码器和规则提取模块，这样可以组合成不同的系统，提供多样的选择。网址：\url{http://www.nlp.org.cn/project/project.php?projid=14}
+\vspace{0.5em}
+\item SAMT。SAMT\upcite{zollmann2007the}是由卡内基梅隆大学机器翻译团队开发的语法增强的统计机器翻译系统。SAMT在解码的时候使用目标树来生成翻译规则，而不严格遵守目标语言的语法。SAMT 的一个亮点是它提供了简单但高效的方式在机器翻译中使用句法信息。由于SAMT在hadoop中实现，它可受益于大数据集的分布式处理。网址：\url{http://www.cs.cmu.edu/zollmann/samt/}
+\vspace{0.5em}
+\item HiFST。HiFST\upcite{iglesias2009hierarchical}是剑桥大学开发的统计机器翻译系统。该系统完全基于有限状态自动机实现，因此非常适合对搜索空间进行有效的表示。网址：\url{http://ucam-smt.github.io/}
+\vspace{0.5em}
+\item cdec。cdec\upcite{dyer2010cdec}是一个强大的解码器，是由Chris Dyer 和他的合作者们一起开发。cdec的主要功能是它使用了翻译模型的一个统一的内部表示，并为结构预测问题的各种模型和算法提供了实现框架。所以，cdec也可以被用来做一个对齐系统或者一个更通用的学习框架。此外，由于使用C++语言编写，cdec的运行速度较快。网址：\url{http://cdec-decoder.org/index.php?title=MainPage}
+\vspace{0.5em}
+\item Phrasal。Phrasal\upcite{Cer2010Phrasal}是由斯坦福大学自然语言处理小组开发的系统。除了传统的基于短语的模型，Phrasal还支持基于非层次短语的模型，这种模型将基于短语的翻译延伸到非连续的短语翻译，增加了模型的泛化能力。网址：\url{http://nlp.stanford.edu/phrasal/}
+\vspace{0.5em}
+\item Jane。Jane\upcite{vilar2012jane}是一个基于短语和基于层次短语的机器翻译系统，由亚琛工业大学的人类语言技术与模式识别小组开发。Jane提供了系统融合模块，因此可以非常方便的对多个系统进行融合。网址：\url{https://www-i6.informatik.rwth-aachen.de/jane/}
+\vspace{0.5em}
+\item GIZA++。GIZA++\upcite{och2003systematic}是Franz Och研发的用于训练IBM模型1-5和HMM单词对齐模型的工具包。在早期，GIZA++是所有统计机器翻译系统中词对齐的标配工具。网址：\url{https://github.com/moses-smt/giza-pp}
+\vspace{0.5em}
+\item FastAlign。FastAlign\upcite{DBLP:conf/naacl/DyerCS13}是一个快速，无监督的词对齐工具，由卡内基梅隆大学开发。网址：\url{https://github.com/clab/fast\_align}
+\vspace{0.5em}
+\end{itemize}
+
+%----------------------------------------------------------------------------------------
+%    NEW SECTION
+%----------------------------------------------------------------------------------------
+\section{神经机器翻译开源系统}
+
+\begin{itemize}
+\vspace{0.5em}
+\item GroundHog。GroundHog\upcite{bahdanau2014neural}基于Theano\upcite{al2016theano}框架，由蒙特利尔大学LISA 实验室使用Python语言编写的一个框架，旨在提供灵活而高效的方式来实现复杂的循环神经网络模型。它提供了包括LSTM在内的多种模型。Bahdanau等人在此框架上又编写了GroundHog神经机器翻译系统。该系统也作为了很多论文的基线系统。网址：\url{https://github.com/lisa-groundhog/GroundHog}
+\vspace{0.5em}
+\item Nematus。Nematus\upcite{DBLP:journals/corr/SennrichFCBHHJL17}是英国爱丁堡大学开发的，基于Theano框架的神经机器翻译系统。该系统使用GRU作为隐层单元，支持多层网络。Nematus 编码端有正向和反向的编码方式，可以同时提取源语句子中的上下文信息。该系统的一个优点是，它可以支持输入端有多个特征的输入（例如词的词性等）。网址：\url{https://github.com/EdinburghNLP/nematus}
+\vspace{0.5em}
+\item ZophRNN。ZophRNN\upcite{zoph2016simple}是由南加州大学的Barret Zoph 等人使用C++语言开发的系统。Zoph既可以训练序列表示模型（如语言模型），也可以训练序列到序列的模型（如神经机器翻译模型）。当训练神经机器翻译系统时，ZophRNN也支持多源输入。网址：\url{https://github.com/isi-nlp/Zoph\_RNN}
+\vspace{0.5em}
+\item Fairseq。Fairseq\upcite{Ottfairseq}是由Facebook开发的，基于PyTorch框架的用以解决序列到序列问题的工具包，其中包括基于卷积神经网络、基于循环神经网络、基于Transformer的模型等。Fairseq是当今使用最广泛的神经机器翻译开源系统之一。网址：\url{https://github.com/facebookresearch/fairseq}
+\vspace{0.5em}
+\item Tensor2Tensor。Tensor2Tensor\upcite{Vaswani2018Tensor2TensorFN}是由谷歌推出的，基于TensorFlow框架的开源系统。该系统基于Transformer模型，因此可以支持大多数序列到序列任务。得益于Transformer 的网络结构，系统的训练速度较快。现在，Tensor2Tensor也是机器翻译领域广泛使用的开源系统之一。网址：\url{https://github.com/tensorflow/tensor2tensor}
+\vspace{0.5em}
+\item OpenNMT。OpenNMT\upcite{KleinOpenNMT}系统是由哈佛大学自然语言处理研究组开源的，基于Torch框架的神经机器翻译系统。OpenNMT系统的早期版本使用Lua 语言编写，现在也扩展到了TensorFlow和PyTorch，设计简单易用，易于扩展，同时保持效率和翻译精度。网址：\url{https://github.com/OpenNMT/OpenNMT}
+\vspace{0.5em}
+\item 斯坦福神经机器翻译开源代码库。斯坦福大学自然语言处理组（Stanford NLP）发布了一篇教程，介绍了该研究组在神经机器翻译上的研究信息，同时实现了多种翻译模型\upcite{luong2016acl_hybrid}。 网址：\url{https://nlp.stanford.edu/projects/nmt/}
+\vspace{0.5em}
+\item THUMT。清华大学NLP团队实现的神经机器翻译系统，支持Transformer等模型\upcite{ZhangTHUMT}。该系统主要基于TensorFlow和Theano实现，其中Theano版本包含了RNNsearch模型，训练方式包括MLE （Maximum Likelihood Estimate）, MRT（Minimum Risk Training）, SST（Semi-Supervised Training）。TensorFlow 版本实现了Seq2Seq, RNNsearch, Transformer三种基本模型。网址：\url{https://github.com/THUNLP-MT/THUMT}
+\vspace{0.5em}
+\item NiuTrans.NMT。由小牛翻译团队基于NiuTensor实现的神经机器翻译系统。支持循环神经网络、Transformer等结构，并支持语言建模、序列标注、机器翻译等任务。支持机器翻译GPU与CPU 训练及解码。其小巧易用，为开发人员提供快速二次开发基础。此外，NiuTrans.NMT已经得到了大规模应用，形成了支持304种语言翻译的小牛翻译系统。网址：\url{http://opensource.niutrans.com/niutensor/index.html}
+\vspace{0.5em}
+\item MARIAN。主要由微软翻译团队搭建\upcite{JunczysMarian}，其使用C++实现的用于GPU/CPU训练和解码的引擎，支持多GPU训练和批量解码，最小限度依赖第三方库，静态编译一次之后，复制其二进制文件就能在其他平台使用。网址：\url{https://marian-nmt.github.io/}
+\vspace{0.5em}
+\item Sockeye。由Awslabs开发的神经机器翻译框架\upcite{hieber2017sockeye}。其中支持RNNsearch、Transformer、CNN等翻译模型，同时提供了从图片翻译到文字的模块以及WMT 德英新闻翻译、领域适应任务、多语言零资源翻译任务的教程。网址：\url{https://awslabs.github.io/sockeye/}
+\vspace{0.5em}
+\item CytonMT。由NICT开发的一种用C++实现的神经机器翻译开源工具包\upcite{WangCytonMT}。主要支持Transformer模型，并支持一些常用的训练方法以及解码方法。网址：\url{https://github.com/arthurxlw/cytonMt}
+\vspace{0.5em}
+\item OpenSeq2Seq。由NVIDIA团队开发的\upcite{DBLP:journals/corr/abs-1805-10387}基于TensorFlow的模块化架构，用于序列到序列的模型，允许从可用组件中组装新模型，支持混合精度训练，利用NVIDIA Volta Turing GPU中的Tensor核心，基于Horovod的快速分布式训练，支持多GPU，多节点多模式。网址：\url{https://nvidia.github.io/OpenSeq2Seq/html/index.html}
+\vspace{0.5em}
+\item NMTPyTorch。由勒芒大学语言实验室发布的基于序列到序列框架的神经网络翻译系统\upcite{nmtpy2017}，NMTPyTorch的核心部分依赖于Numpy，PyTorch和tqdm。其允许训练各种端到端神经体系结构，包括但不限于神经机器翻译、图像字幕和自动语音识别系统。网址：\url{https://github.com/lium-lst/nmtpytorch}
+\vspace{0.5em}
+\end{itemize}
+
+
+%----------------------------------------------------------------------------------------
+%    NEW SECTION
+%----------------------------------------------------------------------------------------
+\section{公开评测任务}
+\parinterval 机器翻译相关评测主要有两种组织形式，一种是由政府及国家相关机构组织，权威性强。如由美国国家标准技术研究所组织的NIST评测、日本国家科学咨询系统中心主办的NACSIS Test Collections for IR（NTCIR）PatentMT、日本科学振兴机构（Japan Science and Technology Agency，简称JST）等组织联合举办的Workshop on Asian Translation（WAT）以及国内由中文信息学会主办的全国机器翻译大会（China Conference on Machine Translation，简称CCMT）；另一种是由相关学术机构组织，具有领域针对性的特点，如倾向新闻领域的Conference on Machine Translation（WMT）以及面向口语的International Workshop on Spoken Language Translation（IWSLT）。下面将针对上述评测进行简要介绍。
+
+\begin{itemize}
+\vspace{0.5em}
+\item CCMT（全国机器翻译大会），前身为CWMT（全国机器翻译研讨会）是国内机器翻译领域的旗舰会议，自2005年起已经组织多次机器翻译评测，对国内机器翻译相关技术的发展产生了深远影响。该评测主要针对汉语、英语以及国内的少数民族语言（蒙古语、藏语、维吾尔语等）进行评测，领域包括新闻、口语、政府文件等，不同语言方向对应的领域也有所不同。评价方式不同届略有不同，主要采用自动评价的方式，自CWMT\ 2013起则针对某些领域增设人工评价。自动评价的指标一般包括BLEU-SBP、BLEU-NIST、TER、METEOR、NIST、GTM、mWER、mPER 以及ICT 等，其中以BLEU-SBP 为主，汉语为目标语的翻译采用基于字符的评价方式，面向英语的翻译采用基于词的评价方式。每年该评测吸引国内外近数十家企业及科研机构参赛，业内认可度极高。关于CCMT的更多信息可参考中文信息学会机器翻译专业委员会相关页面：\url{http://sc.cipsc.org.cn/mt/index.php/CWMT.html}。
+\vspace{0.5em}
+\item WMT由Special Interest Group for Machine Translation（SIGMT）主办，会议自2006年起每年召开一次，是一个涉及机器翻译多种任务的综合性会议，包括多领域翻译评测任务、质量评价任务以及其他与机器翻译的相关任务（如文档对齐评测等）。现在WMT已经成为机器翻译领域的旗舰评测会议，很多研究工作都以WMT评测结果作为基准。WMT评测涉及的语言范围较广，包括英语、德语、芬兰语、捷克语、罗马尼亚语等十多种语言，翻译方向一般以英语为核心，探索英语与其他语言之间的翻译性能，领域包括新闻、信息技术、生物医学。最近，也增加了无指导机器翻译等热门问题。WMT在评价方面类似于CCMT，也采用人工评价与自动评价相结合的方式，自动评价的指标一般为BLEU、TER 等。此外，WMT公开了所有评测数据，因此也经常被机器翻译相关人员所使用。更多WMT的机器翻译评测相关信息可参考SIGMT官网：\url{http://www.sigmt.org/}。
+\vspace{0.5em}
+\item NIST机器翻译评测开始于2001年，是早期机器翻译公开评测中颇具代表性的任务，现在WMT和CCMT很多任务的设置也大量参考了当年NIST评测的内容。NIST评测由美国国家标准技术研究所主办，作为美国国防高级计划署（DARPA）中TIDES计划的重要组成部分。早期，NIST评测主要评价阿拉伯语和汉语等语言到英语的翻译效果，评价方法一般采用人工评价与自动评价相结合的方式。人工评价采用5分制评价。自动评价使用多种方式，包括BLEU，METEOR，TER以及HyTER。此外NIST从2016 年起开始对稀缺语言资源技术进行评估，其中机器翻译作为其重要组成部分共同参与评测，评测指标主要为BLEU。除对机器翻译系统进行评测之外，NIST在2008 和2010年对于机器翻译的自动评价方法（MetricsMaTr）也进行了评估，以鼓励更多研究人员对现有评价方法进行改进或提出更加贴合人工评价的方法。同时NIST评测所提供的数据集由于数据质量较高受到众多科研人员喜爱，如MT04，MT06等（汉英）平行语料经常被科研人员在实验中使用。不过，近几年NIST评测已经停止。更多NIST的机器翻译评测相关信息可参考官网：\url{https://www.nist.gov/programs-projects/machine-translation}。
+\vspace{0.5em}
+\item 从2004年开始举办的IWSLT也是颇具特色的机器翻译评测，它主要关注口语相关的机器翻译任务，测试数据包括TED talks的多语言字幕以及QED 教育讲座影片字幕等，语言涉及英语、法语、德语、捷克语、汉语、阿拉伯语等众多语言。此外在IWSLT 2016 中还加入了对于日常对话的翻译评测，尝试将微软Skype中一种语言的对话翻译成其他语言。评价方式采用自动评价的模式，评价标准和WMT类似，一般为BLEU 等指标。另外，IWSLT除了对文本到文本的翻译评测外，还有自动语音识别以及语音转另一种语言的文本的评测。更多IWSLT的机器翻译评测相关信息可参考IWSLT\ 2019官网：\url{https://workshop2019.iwslt.org/}。
+\vspace{0.5em}
+\item 日本举办的机器翻译评测WAT是亚洲范围内的重要评测之一，由日本科学振兴机构（JST）、情报通信研究机构（NICT）等多家机构共同组织，旨在为亚洲各国之间交流融合提供便宜之处。语言方向主要包括亚洲主流语言（汉语、韩语、印地语等）以及英语对日语的翻译，领域丰富多样，包括学术论文、专利、新闻、食谱等。评价方式包括自动评价（BLEU、RIBES以及AMFM 等）以及人工评价，其特点在于对于测试语料以段落为单位进行评价，考察其上下文关联的翻译效果。更多WAT的机器翻译评测相关信息可参考官网：\url{http://lotus.kuee.kyoto-u.ac.jp/WAT/}。
+\vspace{0.5em}
+\item NTCIR计划是由日本国家科学咨询系统中心策划主办的，旨在建立一个用在自然语言处理以及信息检索相关任务上的日文标准测试集。在NTCIR-9和NTCIR-10中开设的Patent Machine Translation（PatentMT）任务主要针对专利领域进行翻译测试，其目的在于促进机器翻译在专利领域的发展和应用。在NTCIR-9中，评测方式采取人工评价与自动评价相结合，以人工评价为主导。人工评价主要根据准确度和流畅度进行评估，自动评价采用BLEU、NIST等方式进行。NTCIR-10评价方式在此基础上增加了专利审查评估、时间评估以及多语种评估，分别考察机器翻译系统在专利领域翻译的实用性、耗时情况以及不同语种的翻译效果等。更多NTCIR评测相关信息可参考官网：\url{http://research.nii.ac.jp/ntcir/index-en.html}。
+\vspace{0.5em}
+\end{itemize}
+
+\parinterval 以上评测数据大多可以从评测网站上下载，此外部分数据也可以从LDC（Lingu-istic Data Consortium）上申请，网址为\url{https://www.ldc.upenn.edu/}。ELRA（Euro-pean Language Resources Association）上也有一些免费的语料库供研究使用，其官网为\url{http://www.elra.info/}。从机器翻译发展的角度看，这些评测任务给相关研究提供了基准数据集，使得不同的系统都可以在同一个环境下进行比较和分析，进而建立了机器翻译研究所需的实验基础。此外，公开评测也使得研究者可以第一时间了解机器翻译研究的最新成果，比如，有多篇ACL会议最佳论文的灵感就来自当年参加机器翻译评测任务的系统。
+
+\end{appendices}
+%----------------------------------------------------------------------------------------
+%	CHAPTER  APPENDIX B
+%----------------------------------------------------------------------------------------
+
+\begin{appendices}
+\chapter{附录B}
+\label{appendix-B}
 \parinterval 在构建机器翻译系统的过程中，数据是必不可少的，尤其是现在主流的神经机器翻译系统，系统的性能往往受限于语料库规模和质量。所幸的是，随着语料库语言学的发展，一些主流语种的相关语料资源已经十分丰富。

 \parinterval 为了方便读者进行相关研究，我们汇总了几个常用的基准数据集，这些数据集已经在机器翻译领域中被广泛使用，有很多之前的相关工作可以进行复现和对比。同时，我们收集了一下常用的平行语料，方便读者进行一些探索。
@@ -161,12 +265,12 @@
 \end{appendices}

 %----------------------------------------------------------------------------------------
-%	CHAPTER  APPENDIX B
+%	CHAPTER  APPENDIX C
 %----------------------------------------------------------------------------------------

 \begin{appendices}
-\chapter{附录B}
-\label{appendix-B}
+\chapter{附录C}
+\label{appendix-C}

 %----------------------------------------------------------------------------------------
 %    NEW SECTION

--- a/bibliography.bib
+++ b/bibliography.bib
@@ -3854,16 +3854,11 @@ year = {2012}
 %%%%% chapter 9------------------------------------------------------
 @article{brown1992class,
  title={Class-based n-gram models of natural language},
-  author={Brown and
-              Peter F and
-              Desouza and
-              Peter V and
-              Mercer amd
-              Robert L
-              and Pietra and
-              Vincent J Della
-              and Lai and
-              Jenifer C},
+  author={Peter F. Brown and
+               Vincent J. Della Pietra and
+               Peter V. De Souza and
+               Jennifer C. Lai and
+               Robert L. Mercer},
  journal={Computational linguistics},
  volume={18},
  number={4},
@@ -3873,10 +3868,8 @@ year = {2012}

 @inproceedings{mikolov2012context,
  title={Context dependent recurrent neural network language model},
-  author={Mikolov and
-            Tomas and
-            Zweig and
-            Geoffrey},
+  author={Tomas Mikolov and
+               Geoffrey Zweig},
  publisher={IEEE Spoken Language Technology Workshop},
  pages={234--239},
  year={2012}
@@ -3884,38 +3877,28 @@ year = {2012}

 @article{zaremba2014recurrent,
  title={Recurrent Neural Network Regularization},
-  author={Zaremba and
-             Wojciech and
-             Sutskever and
-             Ilya and
-             Vinyals and
-             Oriol},
+  author={Wojciech Zaremba and
+               Ilya Sutskever and
+               Oriol Vinyals},
  journal={arXiv: Neural and Evolutionary Computing},
  year={2014}
 }

 @article{zilly2016recurrent,
  title={Recurrent Highway Networks},
-  author={Zilly and
-            Julian and
-            Srivastava and
-            Rupesh Kumar and
-            Koutnik and
-            Jan and
-            Schmidhuber and
-            Jurgen},
+  author={Julian G. Zilly and
+               Rupesh Kumar Srivastava and
+               Jan Koutn{\'{\i}}k and
+               J{\"{u}}rgen Schmidhuber},
  journal={International Conference on Machine Learning},
  year={2016}
 }

 @article{merity2017regularizing,
  title={Regularizing and optimizing LSTM language models},
-  author={Merity and
-             tephen and
-             Keskar and
-             Nitish Shirish and
-             Socher and
-             Richard},
+  author={Stephen Merity and
+               Nitish Shirish Keskar and
+               Richard Socher},
  journal={International Conference on Learning Representations},
  year={2017}
 }
@@ -3993,7 +3976,7 @@ year = {2012}
 @article{Ba2016LayerN,
  author    = {Lei Jimmy Ba and
               Jamie Ryan Kiros and
-               Geoffrey E. Hinton},
+               Geoffrey Hinton},
  title     = {Layer Normalization},
  journal   = {CoRR},
  volume    = {abs/1607.06450},
@@ -4018,7 +4001,7 @@ year = {2012}
               Satoshi Nakamura},
  title     = {Incorporating Discrete Translation Lexicons into Neural Machine Translation},
  pages     = {1557--1567},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2016}
 }

@@ -4062,7 +4045,7 @@ year = {2012}
  year={2011}
 }
 @inproceedings{mccann2017learned,
-  author    = {Bryan McCann and
+  author    = {Bryan Mccann and
               James Bradbury and
               Caiming Xiong and
               Richard Socher},
@@ -4081,15 +4064,15 @@ year = {2012}
 		  Matt Gardner and 
 		  Christopher Clark and 
 		  Kenton Lee and 
-		  L. Zettlemoyer},
-  publisher={arXiv preprint arXiv:1802.05365},
+		  Luke Zettlemoyer},
+  publisher={Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics},
  year={2018}
 }


 @article{Graves2013HybridSR,
  title={Hybrid speech recognition with Deep Bidirectional LSTM},
-  author={A. Graves and 
+  author={Alex Graves and 
          Navdeep Jaitly and 
 		  Abdel-rahman Mohamed},
  publisher={IEEE Workshop on Automatic Speech Recognition and Understanding},
@@ -4101,7 +4084,7 @@ year = {2012}
  title={Character-Word LSTM Language Models},
  author={Lyan Verwimp and 
          Joris Pelemans and 
-		  H. V. Hamme and 
+		  Hugo Van Hamme and 
 		  Patrick Wambacq},
  publisher={European Association of Computational Linguistics},
  year={2017}
@@ -4112,7 +4095,7 @@ year = {2012}
               Kyunghyun Cho},
  title     = {Gated Word-Character Recurrent Language Model},
  pages     = {1992--1997},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2016}
 }
 @inproceedings{Hwang2017CharacterlevelLM,
@@ -4146,7 +4129,7 @@ year = {2012}
  title={Larger-Context Language Modelling},
  author={Tian Wang and 
          Kyunghyun Cho},
-  journal={arXiv preprint arXiv:1511.03729},
+  journal={Annual Meeting of the Association for Computational Linguistics},
  year={2015}
 }
 @article{Adel2015SyntacticAS,
@@ -4174,7 +4157,7 @@ year = {2012}
 }
 @inproceedings{Pham2016ConvolutionalNN,
  title={Convolutional Neural Network Language Models},
-  author={Ngoc-Quan Pham and 
+  author={Ngoc-quan Pham and 
          German Kruszewski and 
 		  Gemma Boleda},
  publisher={Conference on Empirical Methods in Natural Language Processing},
@@ -4268,9 +4251,9 @@ year = {2012}
 @inproceedings{Bastings2017GraphCE,
  title={Graph Convolutional Encoders for Syntax-aware Neural Machine Translation},
  author={Jasmijn Bastings and 
-          Ivan Titov and W. Aziz and 
+          Ivan Titov and Wilker Aziz and 
 		  Diego Marcheggiani and 
-		  K. Sima'an},
+		  Khalil Sima'an},
  publisher={Conference on Empirical Methods in Natural Language Processing},
  year={2017}
 }
@@ -4727,8 +4710,8 @@ author    = {Yoshua Bengio and
               Quoc V. Le and
               Ruslan Salakhutdinov},
  title     = {Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context},
-  journal   = {CoRR},
-  volume    = {abs/1901.02860},
+  journal   = {Annual Meeting of the Association for Computational Linguistics},
+  pages     = {2978--2988},
  year      = {2019}
 }
 @inproceedings{li-etal-2019-word,
@@ -4810,7 +4793,7 @@ author    = {Yoshua Bengio and
  year      = {2017}
 }
 @article{Hinton2015Distilling,
-  author    = {Geoffrey E. Hinton and
+  author    = {Geoffrey Hinton and
               Oriol Vinyals and
               Jeffrey Dean},
  title     = {Distilling the Knowledge in a Neural Network},
@@ -4821,7 +4804,7 @@ author    = {Yoshua Bengio and

 @inproceedings{Ott2018ScalingNM,
  title={Scaling Neural Machine Translation},
-  author={Myle Ott and Sergey Edunov and David Grangier and M. Auli},
+  author={Myle Ott and Sergey Edunov and David Grangier and Michael Auli},
  publisher={Annual Meeting of the Association for Computational Linguistics},
  year={2018}
 }
@@ -4842,7 +4825,7 @@ author    = {Yoshua Bengio and
               Alexander M. Rush},
  title     = {Sequence-Level Knowledge Distillation},
  pages     = {1317--1327},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2016}
 }
 @article{Akaike1969autoregressive,
@@ -4878,7 +4861,7 @@ author    = {Yoshua Bengio and
 }
 @inproceedings{He2018LayerWiseCB,
  title={Layer-Wise Coordination between Encoder and Decoder for Neural Machine Translation},
-  author={Tianyu He and X. Tan and Yingce Xia and D. He and T. Qin and Zhibo Chen and T. Liu},
+  author={Tianyu He and Xu Tan and Yingce Xia and Di He and Tao Qin and Zhibo Chen and Tie-Yan Liu},
  publisher={Conference on Neural Information Processing Systems},
  year={2018}
 }
@@ -4956,7 +4939,7 @@ author    = {Yoshua Bengio and
               Deyi Xiong},
  title     = {Encoding Gated Translation Memory into Neural Machine Translation},
  pages     = {3042--3047},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2018}
 }
 @inproceedings{yang-etal-2016-hierarchical,
@@ -4968,7 +4951,7 @@ author    = {Yoshua Bengio and
               Eduard H. Hovy},
  title     = {Hierarchical Attention Networks for Document Classification},
  pages     = {1480--1489},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics},
  year      = {2016}
 }
 %%%%% chapter 10------------------------------------------------------
@@ -4983,7 +4966,7 @@ author    = {Yoshua Bengio and
               Jian Sun},
  title     = {Faster {R-CNN:} Towards Real-Time Object Detection with Region Proposal
               Networks},
-  journal   = {Institute of Electrical and Electronics Engineers},
+  journal   = {{IEEE} Transactions on Pattern Analysis and Machine Intelligence},
  volume    = {39},
  number    = {6},
  pages     = {1137--1149},
@@ -5002,7 +4985,6 @@ author    = {Yoshua Bengio and
  publisher = {European Conference on Computer Vision},
  volume    = {9905},
  pages     = {21--37},
-  publisher = {Springer},
  year      = {2016}
 }

@@ -5028,7 +5010,7 @@ author    = {Yoshua Bengio and
               Qun Liu},
  title     = {genCNN: {A} Convolutional Architecture for Word Sequence Prediction},
  pages     = {1567--1576},
-  publisher = {The Association for Computer Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2015}
 }

@@ -5038,7 +5020,7 @@ author    = {Yoshua Bengio and
               Navdeep Jaitly},
  title     = {Very deep convolutional networks for end-to-end speech recognition},
  pages     = {4845--4849},
-  publisher = {Institute of Electrical and Electronics Engineers},
+  publisher = {International Conference on Acoustics, Speech and Signal Processing},
  year      = {2017}
 }

@@ -5049,7 +5031,7 @@ author    = {Yoshua Bengio and
  title     = {A deep convolutional neural network using heterogeneous pooling for
               trading acoustic invariance with phonetic confusion},
  pages     = {6669--6673},
-  publisher = {Institute of Electrical and Electronics Engineers},
+  publisher = {International Conference on Acoustics, Speech and Signal Processing},
  year      = {2013}
 }

@@ -5058,8 +5040,7 @@ author    = {Yoshua Bengio and
               Hieu Pham and
               Christopher D. Manning},
  title     = {Effective Approaches to Attention-based Neural Machine Translation},
-  publisher = {Conference on Empirical Methods in Natural
-               Language Processing},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  pages     = {1412--1421},
  year      = {2015}
 }
@@ -5083,7 +5064,7 @@ author    = {Yoshua Bengio and
  title     = {Leveraging Linguistic Structures for Named Entity Recognition with
               Bidirectional Recursive Neural Networks},
  pages     = {2664--2669},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2017}
 }

@@ -5099,10 +5080,10 @@ author    = {Yoshua Bengio and
  author    = {Emma Strubell and
               Patrick Verga and
               David Belanger and
-               Andrew McCallum},
+               Andrew Mccallum},
  title     = {Fast and Accurate Entity Recognition with Iterated Dilated Convolutions},
  pages     = {2670--2680},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2017}
 }

@@ -5168,7 +5149,7 @@ author    = {Yoshua Bengio and
               Tommi S. Jaakkola},
  title     = {Molding CNNs for text: non-linear, non-consecutive convolutions},
  pages     = {1565--1575},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2015}
 }

@@ -5178,7 +5159,7 @@ author    = {Yoshua Bengio and
  title     = {Effective Use of Word Order for Text Categorization with Convolutional
               Neural Networks},
  pages     = {103--112},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics},
  year      = {2015}
 }

@@ -5187,7 +5168,7 @@ author    = {Yoshua Bengio and
               Ralph Grishman},
  title     = {Relation Extraction: Perspective from Convolutional Neural Networks},
  pages     = {39--48},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics},
  year      = {2015}
 }

@@ -5205,7 +5186,7 @@ author    = {Yoshua Bengio and
               Barry Haddow and
               Alexandra Birch},
  title     = {Improving Neural Machine Translation Models with Monolingual Data},
-  publisher = {The Association for Computer Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2016}
 }

@@ -5220,7 +5201,7 @@ author    = {Yoshua Bengio and

 @article{Waibel1989PhonemeRU,
  title={Phoneme recognition using time-delay neural networks},
-  author={Alexander Waibel and Toshiyuki Hanazawa and Geoffrey Everest Hinton and Kiyohiro Shikano and K.J. Lang},
+  author={Alexander Waibel and Toshiyuki Hanazawa and Geoffrey Hinton and Kiyohiro Shikano and Kevin J. Lang},
  journal={IEEE Transactions on Acoustics, Speech, and Signal Processing},
  year={1989},
  volume={37},
@@ -5229,7 +5210,7 @@ author    = {Yoshua Bengio and

 @article{LeCun1989BackpropagationAT,
  title={Backpropagation Applied to Handwritten Zip Code Recognition},
-  author={Yann LeCun and Bernhard Boser and John Denker and Don Henderson and R.E.Howard and W.E. Hubbard and Larry Jackel},
+  author={Yann Lecun and Bernhard Boser and John Denker and Don Henderson and Richard E.Howard and Wayne E. Hubbard and Larry Jackel},
  journal={Neural Computation},
  year={1989},
  volume={1},
@@ -5243,7 +5224,7 @@ author    = {Yoshua Bengio and
  year={1998},
  volume={86},
  number={11},
-  pages={2278-2324},
+  pages={2278-2324}
 }

 @inproceedings{DBLP:journals/corr/HeZRS15,
@@ -5254,7 +5235,7 @@ author    = {Yoshua Bengio and
  title     = {Deep Residual Learning for Image Recognition},
  publisher = {{IEEE} Conference on Computer Vision and Pattern Recognition},
  pages     = {770--778},
-  year      = {2016},
+  year      = {2016}
 }

 @inproceedings{DBLP:conf/cvpr/HuangLMW17,
@@ -5279,10 +5260,9 @@ author    = {Yoshua Bengio and
 @article{He2020MaskR,
  title={Mask R-CNN},
  author={Kaiming He and Georgia Gkioxari and Piotr Doll{\'a}r and Ross B. Girshick},
-  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
-  year={2020},
-  volume={42},
-  pages={386-397}
+  journal={International Conference on Computer Vision},
+  pages={2961--2969},
+  year={2017}
 }

 @inproceedings{Kalchbrenner2014ACN,
@@ -5317,7 +5297,7 @@ author    = {Yoshua Bengio and
  author    = {C{\'{\i}}cero Nogueira dos Santos and
               Maira Gatti},
  pages     = {69--78},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {International Conference on Computational Linguistics},
  year={2014}
 }

@@ -5374,7 +5354,7 @@ author    = {Yoshua Bengio and
 		 Michael Auli},
 title = {Pay Less Attention with Lightweight and Dynamic Convolutions},
 publisher = {International Conference on Learning Representations},
- year = {2019},
+ year = {2019}
 }

 @inproceedings{kalchbrenner-blunsom-2013-recurrent,
@@ -5382,8 +5362,8 @@ author    = {Yoshua Bengio and
               Phil Blunsom},
  title     = {Recurrent Continuous Translation Models},
  pages     = {1700--1709},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2013},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
+  year      = {2013}
 }

 @article{Wu2016GooglesNM,
@@ -5459,7 +5439,7 @@ author    = {Yoshua Bengio and
  author    = {Ilya Sutskever and
               James Martens and
               George E. Dahl and
-               Geoffrey Everest Hinton},
+               Geoffrey Hinton},
  publisher = {International Conference on Machine Learning},
  pages     = {1139--1147},
  year={2013}
@@ -5474,7 +5454,7 @@ author    = {Yoshua Bengio and
 }

 @article{JMLR:v15:srivastava14a,
-  author  = {Nitish Srivastava and Geoffrey Everest Hinton and Alex Krizhevsky and Ilya Sutskever and Ruslan Salakhutdinov},
+  author  = {Nitish Srivastava and Geoffrey Hinton and Alex Krizhevsky and Ilya Sutskever and Ruslan Salakhutdinov},
  title   = {Dropout: A Simple Way to Prevent Neural Networks from Overfitting},
  journal = {Journal of Machine Learning Research},
  year    = {2014},
@@ -5508,7 +5488,7 @@ author    = {Yoshua Bengio and
  title={Rigid-motion scattering for image classification},
  author={Sifre, Laurent and Mallat, St{\'e}phane},
  year={2014},
-  publisher={Citeseer}
+  journal={Citeseer}
 }

 @article{Taigman2014DeepFaceCT,
@@ -5567,7 +5547,7 @@ author    = {Yoshua Bengio and
               Tong Zhang},
  title     = {Deep Pyramid Convolutional Neural Networks for Text Categorization},
  pages     = {562--570},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2017}
 }

@@ -5596,7 +5576,7 @@ author    = {Yoshua Bengio and
  title     = {Speech-Transformer: {A} No-Recurrence Sequence-to-Sequence Model for
               Speech Recognition},
  pages     = {5884--5888},
-  publisher = {Institute of Electrical and Electronics Engineers},
+  publisher = {International Conference on Acoustics, Speech and Signal Processing},
  year      = {2018}
 }

@@ -5774,14 +5754,14 @@ author    = {Yoshua Bengio and
 }
 @article{Liu2020LearningTE,
 	title={Learning to Encode Position for Transformer with Continuous Dynamical Model},
-	author={Xuanqing Liu and Hsiang-Fu Yu and I. Dhillon and Cho-Jui Hsieh},
+	author={Xuanqing Liu and Hsiang-Fu Yu and Inderjit Dhillon and Cho-Jui Hsieh},
 	journal={ArXiv},
 	year={2020},
 	volume={abs/2003.09229}
 }
 @inproceedings{Jawahar2019WhatDB,
 	title={What Does BERT Learn about the Structure of Language?},
-	author={Ganesh Jawahar and B. Sagot and Djam{\'e} Seddah},
+	author={Ganesh Jawahar and Beno{\^{\i}}t Sagot and Djam{\'e} Seddah},
 	publisher={Annual Meeting of the Association for Computational Linguistics},
 	year={2019}
 }
@@ -7760,148 +7740,1185 @@ author    = {Zhuang Liu and
  year      = {2017}
 }

-
-%%%%% chapter 15------------------------------------------------------
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-%%%%% chapter 16------------------------------------------------------
-@inproceedings{DBLP:conf/wmt/CurreyBH17,
-  author    = {Anna Currey and
-               Antonio Valerio Miceli Barone and
-               Kenneth Heafield},
-  title     = {Copied Monolingual Data Improves Low-Resource Neural Machine Translation},
-  pages     = {148--156},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2017}
-}
-
-@inproceedings{DBLP:conf/emnlp/EdunovOAG18,
-  author    = {Sergey Edunov and
-               Myle Ott and
-               Michael Auli and
-               David Grangier},
-  title     = {Understanding Back-Translation at Scale},
-  pages     = {489--500},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2018}
-}
-@inproceedings{DBLP:conf/emnlp/FadaeeM18,
-  author    = {Marzieh Fadaee and
-               Christof Monz},
-  title     = {Back-Translation Sampling by Targeting Difficult Words in Neural Machine
-               Translation},
-  pages     = {436--446},
+@inproceedings{Bapna2018TrainingDN,
+  author    = {Ankur Bapna and
+               Mia Xu Chen and
+               Orhan Firat and
+               Yuan Cao and
+               Yonghui Wu},
+  title     = {Training Deeper Neural Machine Translation Models with Transparent
+               Attention},
+  pages     = {3028--3033},
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2018}
 }
-@inproceedings{DBLP:conf/nlpcc/XuLXLLXZ19,
-  author    = {Nuo Xu and
-               Yinqiao Li and
-               Chen Xu and
-               Yanyang Li and
+
+@inproceedings{WangLearning,
+  author    = {Qiang Wang and
               Bei Li and
               Tong Xiao and
-               Jingbo Zhu},
-  title     = {Analysis of Back-Translation Methods for Low-Resource Neural Machine
-               Translation},
-  volume    = {11839},
-  pages     = {466--475},
-  publisher = {Springer},
-  year      = {2019}
-}
-@inproceedings{DBLP:conf/wmt/CaswellCG19,
-  author    = {Isaac Caswell and
-               Ciprian Chelba and
-               David Grangier},
-  title     = {Tagged Back-Translation},
-  pages     = {53--63},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2019}
-}
-@inproceedings{DBLP:conf/emnlp/WangLWLS19,
-  author    = {Shuo Wang and
-               Yang Liu and
-               Chao Wang and
-               Huanbo Luan and
-               Maosong Sun},
-  title     = {Improving Back-Translation with Uncertainty-based Confidence Estimation},
-  pages     = {791--802},
+               Jingbo Zhu and
+               Changliang Li and
+               Derek F. Wong and
+               Lidia S. Chao},
+  title     = {Learning Deep Transformer Models for Machine Translation},
+  pages     = {1810--1822},
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
-@article{DBLP:journals/corr/abs200111327,
-  author    = {Idris Abdulmumin and
-               Bashir Shehu Galadanci and
-               Abubakar Isa},
-  title     = {Iterative Batch Back-Translation for Neural Machine Translation: {A}
-               Conceptual Model},
+
+@article{DBLP:journals/corr/abs-2002-04745,
+  author    = {Ruibin Xiong and
+               Yunchang Yang and
+               Di He and
+               Kai Zheng and
+               Shuxin Zheng and
+               Chen Xing and
+               Huishuai Zhang and
+               Yanyan Lan and
+               Liwei Wang and
+               Tie{-}Yan Liu},
+  title     = {On Layer Normalization in the Transformer Architecture},
  journal   = {CoRR},
+  volume    = {abs/2002.04745},
  year      = {2020}
 }
-@article{DBLP:journals/corr/abs200403672,
-  author    = {Zi-Yi Dou and
-               Antonios Anastasopoulos and
-               Graham Neubig},
-  title     = {Dynamic Data Selection and Weighting for Iterative Back-Translation},
-  journal   = {CoRR},
+
+@inproceedings{DBLP:conf/emnlp/LiuLGCH20,
+  author    = {Liyuan Liu and
+               Xiaodong Liu and
+               Jianfeng Gao and
+               Weizhu Chen and
+               Jiawei Han},
+  title     = {Understanding the Difficulty of Training Transformers},
+  pages     = {5747--5763},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2020}
 }
-@inproceedings{DBLP:conf/emnlp/WuZHGQLL19,
-  author    = {Lijun Wu and
-               Jinhua Zhu and
-               Di He and
-               Fei Gao and
-               Tao Qin and
-               Jianhuang Lai and
-               Tie-Yan Liu},
-  title     = {Machine Translation With Weakly Paired Documents},
-  pages     = {4374--4383},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2019}
+
+@inproceedings{DBLP:journals/corr/HeZRS15,
+  author    = {Kaiming He and
+               Xiangyu Zhang and
+               Shaoqing Ren and
+               Jian Sun},
+  title     = {Deep Residual Learning for Image Recognition},
+  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
+  pages     = {770--778},
+  year      = {2016},
 }
-@article{DBLP:journals/corr/abs-1901-09069,
-  author    = {Felipe Almeida and
-               Geraldo Xex{\'{e}}o},
-  title     = {Word Embeddings: {A} Survey},
+
+@article{Ba2016LayerN,
+  author    = {Lei Jimmy Ba and
+               Jamie Ryan Kiros and
+               Geoffrey E. Hinton},
+  title     = {Layer Normalization},
  journal   = {CoRR},
-  year      = {2019}
+  volume    = {abs/1607.06450},
+  year      = {2016}
 }
-@article{DBLP:journals/corr/abs-2002-06823,
-  author    = {Jinhua Zhu and
-               Yingce Xia and
-               Lijun Wu and
-               Di He and
-               Tao Qin and
-               Wengang Zhou and
-               Houqiang Li and
-               Tie-Yan Liu},
-  title     = {Incorporating {BERT} into Neural Machine Translation},
-  journal   = {CoRR},
-  year      = {2020}
+
+@inproceedings{Vaswani2018Tensor2TensorFN,
+   author    = {Ashish Vaswani and
+               Samy Bengio and
+               Eugene Brevdo and
+               Fran{\c{c}}ois Chollet and
+               Aidan N. Gomez and
+               Stephan Gouws and
+               Llion Jones and
+               Lukasz Kaiser and
+               Nal Kalchbrenner and
+               Niki Parmar and
+               Ryan Sepassi and
+               Noam Shazeer and
+               Jakob Uszkoreit},
+  title     = {Tensor2Tensor for Neural Machine Translation},
+  pages     = {193--199},
+  publisher = {Association for Machine Translation in the Americas},
+  year      = {2018}
 }
-@inproceedings{song2019mass,
-  author    = {Kaitao Song and
-               Xu Tan and
-               Tao Qin and
-               Jianfeng Lu and
-               Tie-Yan Liu},
-  title     = {{MASS:} Masked Sequence to Sequence Pre-training for Language Generation},
-  volume    = {97},
-  pages     = {5926--5936},
-  publisher = {{PMLR}},
+
+@inproceedings{Dou2019DynamicLA,
+  author    = {Zi-Yi Dou and
+               Zhaopeng Tu and
+               Xing Wang and
+               Longyue Wang and
+               Shuming Shi and
+               Tong Zhang},
+  title     = {Dynamic Layer Aggregation for Neural Machine Translation with Routing-by-Agreement},
+  pages     = {86--93},
+  publisher = {AAAI Conference on Artificial Intelligence},
  year      = {2019}
 }
-@article{DBLP:journals/corr/Ruder17a,
-  author    = {Sebastian Ruder},
-  title     = {An Overview of Multi-Task Learning in Deep Neural Networks},
-  journal   = {CoRR},
-  volume    = {abs/1706.05098},
-  year      = {2017}
+
+@article{Wang2018MultilayerRF,
+  title={Multi-layer Representation Fusion for Neural Machine Translation},
+  author={Qiang Wang and Fuxue Li and Tong Xiao and Yanyang Li and Yinqiao Li and Jingbo Zhu},
+  journal={ArXiv},
+  year={2018},
+  volume={abs/2002.06714}
 }
-@inproceedings{DBLP:conf/emnlp/DomhanH17,
-  author    = {Tobias Domhan and
-               Felix Hieber},
+
+@inproceedings{Dou2018ExploitingDR,
+   author    = {Zi-Yi Dou and
+               Zhaopeng Tu and
+               Xing Wang and
+               Shuming Shi and
+               Tong Zhang},
+  title     = {Exploiting Deep Representations for Neural Machine Translation},
+  pages     = {4253--4262},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2018}
+}
+
+@inproceedings{DBLP:journals/corr/LinFSYXZB17,
+  author    = {Zhouhan Lin and
+               Minwei Feng and
+               C{\'{\i}}cero Nogueira dos Santos and
+               Mo Yu and
+               Bing Xiang and
+               Bowen Zhou and
+               Yoshua Bengio},
+  title     = {A Structured Self-Attentive Sentence Embedding},
+  publisher = {International Conference on Learning Representations},
+  year      = {2017},
+}
+
+@inproceedings{DBLP:conf/nips/SrivastavaGS15,
+  author    = {Rupesh Kumar Srivastava and
+               Klaus Greff and
+               J{\"{u}}rgen Schmidhuber},
+  title     = {Training Very Deep Networks},
+  publisher = {Conference on Neural Information Processing Systems},
+  pages     = {2377--2385},
+  year      = {2015}
+}
+
+@inproceedings{DBLP:conf/icml/BalduzziFLLMM17,
+  author    = {David Balduzzi and
+               Marcus Frean and
+               Lennox Leary and
+               J. P. Lewis and
+               Kurt Wan{-}Duo Ma and
+               Brian McWilliams},
+  title     = {The Shattered Gradients Problem: If resnets are the answer, then what
+               is the question?},
+  publisher = {International Conference on Machine Learning},
+  volume    = {70},
+  pages     = {342--350},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:conf/icml/Allen-ZhuLS19,
+  author    = {Zeyuan Allen{-}Zhu and
+               Yuanzhi Li and
+               Zhao Song},
+  title     = {A Convergence Theory for Deep Learning via Over-Parameterization},
+  publisher = {International Conference on Machine Learning},
+  volume    = {97},
+  pages     = {242--252},
+  year      = {2019}
+}
+
+@inproceedings{DBLP:conf/icml/DuLL0Z19,
+  author    = {Simon S. Du and
+               Jason D. Lee and
+               Haochuan Li and
+               Liwei Wang and
+               Xiyu Zhai},
+  title     = {Gradient Descent Finds Global Minima of Deep Neural Networks},
+  publisher = {International Conference on Machine Learning},
+  volume    = {97},
+  pages     = {1675--1685},
+  year      = {2019}
+}
+
+@inproceedings{pmlr-v9-glorot10a,
+  author    = {Xavier Glorot and
+               Yoshua Bengio},
+  title     = {Understanding the difficulty of training deep feedforward neural networks},
+  publisher = {International Conference on Artificial Intelligence and Statistics},
+  volume    = {9},
+  pages     = {249--256},
+  year      = {2010}
+}
+
+@inproceedings{DBLP:conf/iccv/HeZRS15,
+  author    = {Kaiming He and
+               Xiangyu Zhang and
+               Shaoqing Ren and
+               Jian Sun},
+  title     = {Delving Deep into Rectifiers: Surpassing Human-Level Performance on
+               ImageNet Classification},
+  pages     = {1026--1034},
+  publisher = {IEEE International Conference on Computer Vision},
+  year      = {2015}
+}
+
+@inproceedings{huang2020improving,
+	title={Improving Transformer Optimization Through Better Initialization},
+	author={Xiao Shi {Huang} and Juan {Perez} and Jimmy {Ba} and Maksims {Volkovs}},
+  publisher = {International Conference on Machine Learning},
+	year={2020}
+}
+
+@inproceedings{DBLP:conf/iclr/ZophL17,
+  author    = {Barret Zoph and
+               Quoc V. Le},
+  title     = {Neural Architecture Search with Reinforcement Learning},
+  publisher = {International Conference on Learning Representations},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:conf/cvpr/ZophVSL18,
+  author    = {Barret Zoph and
+               Vijay Vasudevan and
+               Jonathon Shlens and
+               Quoc V. Le},
+  title     = {Learning Transferable Architectures for Scalable Image Recognition},
+  pages     = {8697--8710},
+  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
+  year      = {2018}
+}
+
+@inproceedings{Real2019AgingEF,
+  title={Aging Evolution for Image Classifier Architecture Search},
+  author={E. Real and A. Aggarwal and Y. Huang and Quoc V. Le},
+  booktitle={AAAI Conference on Artificial Intelligence},
+  year={2019}
+}
+
+@inproceedings{DBLP:conf/icml/SoLL19,
+  author    = {David R. So and
+               Quoc V. Le and
+               Chen Liang},
+  title     = {The Evolved Transformer},
+  volume    = {97},
+  pages     = {5877--5886},
+  publisher = {International Conference on Machine Learning},
+  year      = {2019}
+}
+
+@inproceedings{DBLP:conf/icga/MillerTH89,
+  author    = {Geoffrey F. Miller and
+               Peter M. Todd and
+               Shailesh U. Hegde},
+  title     = {Designing Neural Networks using Genetic Algorithms},
+  pages     = {379--384},
+  publisher = {International Conference on Genetic Algorithms},
+  year      = {1989}
+}
+
+@inproceedings{mandischer1993representation,
+  title={Representation and evolution of neural networks},
+  author={Mandischer, Martin},
+  publisher={Artificial Neural Nets and Genetic Algorithms},
+  pages={643--649},
+  year={1993}
+}
+
+@inproceedings{koza1991genetic,
+  title={Genetic generation of both the weights and architecture for a neural network},
+  author={Koza, John R and Rice, James P},
+  publisher={international joint conference on neural networks},
+  volume={2},
+  pages={397--404},
+  year={1991}
+}
+
+@inproceedings{DBLP:conf/ijcnn/Dodd90,
+  author    = {N. Dodd},
+  title     = {Optimisation of network structure using genetic techniques},
+  publisher = {International Joint Conference on Neural Networks, San
+               Diego, CA, USA, June 17-21, 1990},
+  pages     = {965--970},
+  year      = {1990}
+}
+
+@inproceedings{DBLP:conf/nips/HarpSG89,
+  author    = {Steven A. Harp and
+               Tariq Samad and
+               Aloke Guha},
+  title     = {Designing Application-Specific Neural Networks Using the Genetic Algorithm},
+  publisher = {Advances in Neural Information Processing Systems},
+  pages     = {447--454},
+  year      = {1989}
+}
+
+@article{DBLP:journals/compsys/Kitano90,
+  author    = {Hiroaki Kitano},
+  title     = {Designing Neural Networks Using Genetic Algorithms with Graph Generation
+               System},
+  journal   = {Complex Systems},
+  volume    = {4},
+  number    = {4},
+  year      = {1990}
+}
+
+@inproceedings{DBLP:conf/icec/SantosD94,
+  author    = {Jos{\'{e}} Santos Reyes and
+               Richard J. Duro},
+  title     = {Evolutionary Generation and Training of Recurrent Artificial Neural
+               Networks},
+  pages     = {759--763},
+  publisher = {IEEE Conference on Evolutionary Computation},
+  year      = {1994}
+}
+
+@inproceedings{DBLP:conf/nips/LuoTQCL18,
+  author    = {Renqian Luo and
+               Fei Tian and
+               Tao Qin and
+               Enhong Chen and
+               Tie{-}Yan Liu},
+  title     = {Neural Architecture Optimization},
+  publisher = {Advances in Neural Information Processing Systems},
+  pages     = {7827--7838},
+  year      = {2018}
+}
+
+@inproceedings{DBLP:conf/icml/PhamGZLD18,
+  author    = {Hieu Pham and
+               Melody Y. Guan and
+               Barret Zoph and
+               Quoc V. Le and
+               Jeff Dean},
+  title     = {Efficient Neural Architecture Search via Parameter Sharing},
+  volume    = {80},
+  pages     = {4092--4101},
+  publisher = {International Conference on Machine Learning},
+  year      = {2018}
+}
+
+@inproceedings{DBLP:conf/iclr/LiuSY19,
+  author    = {Hanxiao Liu and
+               Karen Simonyan and
+               Yiming Yang},
+  title     = {{DARTS:} Differentiable Architecture Search},
+  publisher = {International Conference on Learning Representations},
+  year      = {2019}
+}
+
+@inproceedings{DBLP:conf/acl/LiHZXJXZLL20,
+  author    = {Yinqiao Li and
+               Chi Hu and
+               Yuhao Zhang and
+               Nuo Xu and
+               Yufan Jiang and
+               Tong Xiao and
+               Jingbo Zhu and
+               Tongran Liu and
+               Changliang Li},
+  title     = {Learning Architectures from an Extended Search Space for Language
+               Modeling},
+  pages     = {6629--6639},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2020}
+}
+
+@inproceedings{DBLP:conf/emnlp/JiangHXZZ19,
+  author    = {Yufan Jiang and
+               Chi Hu and
+               Tong Xiao and
+               Chunliang Zhang and
+               Jingbo Zhu},
+  title     = {Improved Differentiable Architecture Search for Language Modeling
+               and Named Entity Recognition},
+  pages     = {3583--3588},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2019}
+}
+
+@inproceedings{DBLP:conf/aaai/RealAHL19,
+  author    = {Esteban Real and
+               Alok Aggarwal and
+               Yanping Huang and
+               Quoc V. Le},
+  title     = {Regularized Evolution for Image Classifier Architecture Search},
+  pages     = {4780--4789},
+  publisher = {AAAI Conference on Artificial Intelligence},
+  year      = {2019}
+}
+
+@inproceedings{DBLP:conf/icml/RealMSSSTLK17,
+  author    = {Esteban Real and
+               Sherry Moore and
+               Andrew Selle and
+               Saurabh Saxena and
+               Yutaka Leon Suematsu and
+               Jie Tan and
+               Quoc V. Le and
+               Alexey Kurakin},
+  title     = {Large-Scale Evolution of Image Classifiers},
+  volume    = {70},
+  pages     = {2902--2911},
+  publisher = {International Conference on Machine Learning},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:conf/iclr/ElskenMH19,
+  author    = {Thomas Elsken and
+               Jan Hendrik Metzen and
+               Frank Hutter},
+  title     = {Efficient Multi-Objective Neural Architecture Search via Lamarckian
+               Evolution},
+  publisher = {International Conference on Learning Representations},
+  year      = {2019}
+}
+
+@inproceedings{DBLP:conf/iclr/BakerGNR17,
+  author    = {Bowen Baker and
+               Otkrist Gupta and
+               Nikhil Naik and
+               Ramesh Raskar},
+  title     = {Designing Neural Network Architectures using Reinforcement Learning},
+  publisher = {International Conference on Learning Representations},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:conf/cvpr/TanCPVSHL19,
+  author    = {Mingxing Tan and
+               Bo Chen and
+               Ruoming Pang and
+               Vijay Vasudevan and
+               Mark Sandler and
+               Andrew Howard and
+               Quoc V. Le},
+  title     = {MnasNet: Platform-Aware Neural Architecture Search for Mobile},
+  pages     = {2820--2828},
+  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
+  year      = {2019}
+}
+
+@inproceedings{DBLP:conf/iclr/LiuSVFK18,
+  author    = {Hanxiao Liu and
+               Karen Simonyan and
+               Oriol Vinyals and
+               Chrisantha Fernando and
+               Koray Kavukcuoglu},
+  title     = {Hierarchical Representations for Efficient Architecture Search},
+  publisher = {International Conference on Learning Representations},
+  year      = {2018}
+}
+
+@inproceedings{DBLP:conf/iclr/CaiZH19,
+  author    = {Han Cai and
+               Ligeng Zhu and
+               Song Han},
+  title     = {ProxylessNAS: Direct Neural Architecture Search on Target Task and
+               Hardware},
+  publisher = {International Conference on Learning Representations},
+  year      = {2019}
+}
+
+@inproceedings{DBLP:conf/cvpr/LiuCSAHY019,
+  author    = {Chenxi Liu and
+               Liang{-}Chieh Chen and
+               Florian Schroff and
+               Hartwig Adam and
+               Wei Hua and
+               Alan L. Yuille and
+               Fei{-}Fei Li},
+  title     = {Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic
+               Image Segmentation},
+  pages     = {82--92},
+  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
+  year      = {2019}
+}
+
+@inproceedings{DBLP:conf/cvpr/WuDZWSWTVJK19,
+  author    = {Bichen Wu and
+               Xiaoliang Dai and
+               Peizhao Zhang and
+               Yanghan Wang and
+               Fei Sun and
+               Yiming Wu and
+               Yuandong Tian and
+               Peter Vajda and
+               Yangqing Jia and
+               Kurt Keutzer},
+  title     = {FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable
+               Neural Architecture Search},
+  pages     = {10734--10742},
+  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
+  year      = {2019}
+}
+
+@inproceedings{DBLP:conf/iclr/XieZLL19,
+  author    = {Sirui Xie and
+               Hehui Zheng and
+               Chunxiao Liu and
+               Liang Lin},
+  title     = {{SNAS:} stochastic neural architecture search},
+  publisher = {International Conference on Learning Representations},
+  year      = {2019}
+}
+
+@inproceedings{DBLP:conf/uai/LiT19,
+  author    = {Liam Li and
+               Ameet Talwalkar},
+  title     = {Random Search and Reproducibility for Neural Architecture Search},
+  pages     = {129},
+  publisher = {Conference on Uncertainty in Artificial Intelligence},
+  year      = {2019}
+}
+
+@inproceedings{DBLP:conf/cvpr/DongY19,
+  author    = {Xuanyi Dong and
+               Yi Yang},
+  title     = {Searching for a Robust Neural Architecture in Four {GPU} Hours},
+  pages     = {1761--1770},
+  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
+  year      = {2019}
+}
+
+@inproceedings{DBLP:conf/iclr/XuX0CQ0X20,
+  author    = {Yuhui Xu and
+               Lingxi Xie and
+               Xiaopeng Zhang and
+               Xin Chen and
+               Guo{-}Jun Qi and
+               Qi Tian and
+               Hongkai Xiong},
+  title     = {{PC-DARTS:} Partial Channel Connections for Memory-Efficient Architecture
+               Search},
+  publisher = {International Conference on Learning Representations},
+  year      = {2020}
+}
+
+@inproceedings{DBLP:conf/iclr/ZelaESMBH20,
+  author    = {Arber Zela and
+               Thomas Elsken and
+               Tonmoy Saikia and
+               Yassine Marrakchi and
+               Thomas Brox and
+               Frank Hutter},
+  title     = {Understanding and Robustifying Differentiable Architecture Search},
+  publisher = {International Conference on Learning Representations},
+  year      = {2020}
+}
+
+@inproceedings{DBLP:conf/iclr/MeiLLJYYY20,
+  author    = {Jieru Mei and
+               Yingwei Li and
+               Xiaochen Lian and
+               Xiaojie Jin and
+               Linjie Yang and
+               Alan L. Yuille and
+               Jianchao Yang},
+  title     = {AtomNAS: Fine-Grained End-to-End Neural Architecture Search},
+  publisher = {International Conference on Learning Representations},
+  year      = {2020}
+}
+
+@article{DBLP:journals/jmlr/LiJDRT17,
+  author    = {Lisha Li and
+               Kevin G. Jamieson and
+               Giulia DeSalvo and
+               Afshin Rostamizadeh and
+               Ameet Talwalkar},
+  title     = {Hyperband: {A} Novel Bandit-Based Approach to Hyperparameter Optimization},
+  journal   = {Journal of Machine Learning Research},
+  volume    = {18},
+  pages     = {185:1--185:52},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:conf/eccv/LiuZNSHLFYHM18,
+  author    = {Chenxi Liu and
+               Barret Zoph and
+               Maxim Neumann and
+               Jonathon Shlens and
+               Wei Hua and
+               Li{-}Jia Li and
+               Li Fei{-}Fei and
+               Alan L. Yuille and
+               Jonathan Huang and
+               Kevin Murphy},
+  title     = {Progressive Neural Architecture Search},
+  volume    = {11205},
+  pages     = {19--35},
+  publisher = {European Conference on Computer Vision},
+  year      = {2018}
+}
+
+@article{DBLP:journals/taslp/FanTXQLL20,
+  author    = {Yang Fan and
+               Fei Tian and
+               Yingce Xia and
+               Tao Qin and
+               Xiang{-}Yang Li and
+               Tie{-}Yan Liu},
+  title     = {Searching Better Architectures for Neural Machine Translation},
+  journal   = {IEEE Transactions on Audio, Speech, and Language Processing},
+  volume    = {28},
+  pages     = {1574--1585},
+  year      = {2020}
+}
+
+@inproceedings{DBLP:conf/ijcai/ChenLQWLDDHLZ20,
+  author    = {Daoyuan Chen and
+               Yaliang Li and
+               Minghui Qiu and
+               Zhen Wang and
+               Bofang Li and
+               Bolin Ding and
+               Hongbo Deng and
+               Jun Huang and
+               Wei Lin and
+               Jingren Zhou},
+  title     = {AdaBERT: Task-Adaptive {BERT} Compression with Differentiable Neural
+               Architecture Search},
+  publisher = {International Joint Conference on Artificial Intelligence},
+  pages     = {2463--2469},
+  year      = {2020}
+}
+
+@inproceedings{DBLP:conf/acl/WangWLCZGH20,
+  author    = {Hanrui Wang and
+               Zhanghao Wu and
+               Zhijian Liu and
+               Han Cai and
+               Ligeng Zhu and
+               Chuang Gan and
+               Song Han},
+  title     = {{HAT:} Hardware-Aware Transformers for Efficient Natural Language
+               Processing},
+  pages     = {7675--7688},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2020}
+}
+
+@inproceedings{DBLP:conf/icml/CaiYZHY18,
+  author    = {Han Cai and
+               Jiacheng Yang and
+               Weinan Zhang and
+               Song Han and
+               Yong Yu},
+  title     = {Path-Level Network Transformation for Efficient Architecture Search},
+  volume    = {80},
+  pages     = {677--686},
+  publisher = {International Conference on Machine Learning},
+  year      = {2018}
+}
+
+@article{DBLP:journals/corr/abs-2003-03384,
+  author    = {Esteban Real and
+               Chen Liang and
+               David R. So and
+               Quoc V. Le},
+  title     = {AutoML-Zero: Evolving Machine Learning Algorithms From Scratch},
+  journal   = {CoRR},
+  volume    = {abs/2003.03384},
+  year      = {2020}
+}
+
+@article{Chollet2017XceptionDL,
+  title={Xception: Deep Learning with Depthwise Separable Convolutions},
+  author    = {Fran{\c{c}}ois Chollet},
+  journal={IEEE Conference on Computer Vision and Pattern Recognition},
+  year={2017},
+  pages={1800-1807}
+}
+
+@article{DBLP:journals/tnn/AngelineSP94,
+  author    = {Peter J. Angeline and
+               Gregory M. Saunders and
+               Jordan B. Pollack},
+  title     = {An evolutionary algorithm that constructs recurrent neural networks},
+  journal   = {IEEE Transactions on Neural Networks},
+  volume    = {5},
+  number    = {1},
+  pages     = {54--65},
+  year      = {1994}
+}
+
+@article{stanley2002evolving,
+  title={Evolving neural networks through augmenting topologies},
+  author={Stanley, Kenneth O and Miikkulainen, Risto},
+  journal={Evolutionary computation},
+  volume={10},
+  number={2},
+  pages={99--127},
+  year={2002},
+  publisher={MIT Press}
+}
+
+@article{DBLP:journals/alife/StanleyDG09,
+  author    = {Kenneth O. Stanley and
+               David B. D'Ambrosio and
+               Jason Gauci},
+  title     = {A Hypercube-Based Encoding for Evolving Large-Scale Neural Networks},
+  journal   = {Artificial Life},
+  volume    = {15},
+  number    = {2},
+  pages     = {185--212},
+  year      = {2009},
+  publisher = {MIT Press}
+}
+
+@inproceedings{DBLP:conf/ijcai/SuganumaSN18,
+  author    = {Masanori Suganuma and
+               Shinichi Shirakawa and
+               Tomoharu Nagao},
+  title     = {A Genetic Programming Approach to Designing Convolutional Neural Network
+               Architectures},
+  pages     = {5369--5373},
+  publisher = {International Joint Conference on Artificial Intelligence},
+  year      = {2018}
+}
+
+@inproceedings{DBLP:conf/iccv/XieY17,
+  author    = {Lingxi Xie and
+               Alan L. Yuille},
+  title     = {Genetic {CNN}},
+  pages     = {1388--1397},
+  publisher = {IEEE International Conference on Computer Vision},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:conf/cvpr/ZhongYWSL18,
+  author    = {Zhao Zhong and
+               Junjie Yan and
+               Wei Wu and
+               Jing Shao and
+               Cheng{-}Lin Liu},
+  title     = {Practical Block-Wise Neural Network Architecture Generation},
+  pages     = {2423--2432},
+  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
+  year      = {2018}
+}
+
+@inproceedings{DBLP:conf/icml/BergstraYC13,
+  author    = {James Bergstra and
+               Daniel Yamins and
+               David D. Cox},
+  title     = {Making a Science of Model Search: Hyperparameter Optimization in Hundreds
+               of Dimensions for Vision Architectures},
+  volume    = {28},
+  pages     = {115--123},
+  publisher = {International Conference on Machine Learning},
+  year      = {2013}
+}
+
+@inproceedings{DBLP:conf/ijcai/DomhanSH15,
+  author    = {Tobias Domhan and
+               Jost Tobias Springenberg and
+               Frank Hutter},
+  title     = {Speeding Up Automatic Hyperparameter Optimization of Deep Neural Networks
+               by Extrapolation of Learning Curves},
+  pages     = {3460--3468},
+  publisher = {International Joint Conference on Artificial Intelligence},
+  year      = {2015}
+}
+
+@inproceedings{DBLP:conf/icml/MendozaKFSH16,
+  author    = {Hector Mendoza and
+               Aaron Klein and
+               Matthias Feurer and
+               Jost Tobias Springenberg and
+               Frank Hutter},
+  title     = {Towards Automatically-Tuned Neural Networks},
+  volume    = {64},
+  pages     = {58--65},
+  publisher = {International Conference on Machine Learning},
+  year      = {2016}
+}
+
+@article{DBLP:journals/corr/abs-1807-06906,
+  author    = {Arber Zela and
+               Aaron Klein and
+               Stefan Falkner and
+               Frank Hutter},
+  title     = {Towards Automated Deep Learning: Efficient Joint Neural Architecture
+               and Hyperparameter Search},
+  journal   = {International Conference on Machine Learning},
+  year      = {2018}
+}
+
+@article{li2020automated,
+  title={Automated and Lightweight Network Design via Random Search for Remote Sensing Image Scene Classification},
+  author={Li, Jihao and Diao, Wenhui and Sun, Xian and Feng, Yingchao and Zhang, Wenkai and Chang, Zhonghan and Fu, Kun},
+  journal={The International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences},
+  volume={43},
+  pages={1217--1224},
+  year={2020}
+}
+
+@inproceedings{DBLP:conf/cvpr/BenderLCCCKL20,
+  author    = {Gabriel Bender and
+               Hanxiao Liu and
+               Bo Chen and
+               Grace Chu and
+               Shuyang Cheng and
+               Pieter{-}Jan Kindermans and
+               Quoc V. Le},
+  title     = {Can Weight Sharing Outperform Random Architecture Search? An Investigation
+               With TuNAS},
+  pages     = {14311--14320},
+  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
+  year      = {2020}
+}
+
+@inproceedings{DBLP:conf/aistats/KleinFBHH17,
+  author    = {Aaron Klein and
+               Stefan Falkner and
+               Simon Bartels and
+               Philipp Hennig and
+               Frank Hutter},
+  title     = {Fast Bayesian Optimization of Machine Learning Hyperparameters on
+               Large Datasets},
+  volume    = {54},
+  pages     = {528--536},
+  publisher = {International Conference on Artificial Intelligence and Statistics},
+  year      = {2017}
+}
+
+@article{DBLP:journals/corr/ChrabaszczLH17,
+  author    = {Patryk Chrabaszcz and
+               Ilya Loshchilov and
+               Frank Hutter},
+  title     = {A Downsampled Variant of ImageNet as an Alternative to the {CIFAR}
+               datasets},
+  journal   = {CoRR},
+  volume    = {abs/1707.08819},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:conf/aaai/CaiCZYW18,
+  author    = {Han Cai and
+               Tianyao Chen and
+               Weinan Zhang and
+               Yong Yu and
+               Jun Wang},
+  title     = {Efficient Architecture Search by Network Transformation},
+  pages     = {2787--2794},
+  publisher = {AAAI Conference on Artificial Intelligence},
+  year      = {2018}
+}
+
+@inproceedings{DBLP:conf/iclr/ElskenMH18,
+  author    = {Thomas Elsken and
+               Jan Hendrik Metzen and
+               Frank Hutter},
+  title     = {Simple and efficient architecture search for Convolutional Neural
+               Networks},
+  publisher = {International Conference on Learning Representations},
+  year      = {2018}
+}
+
+@inproceedings{DBLP:conf/icml/BenderKZVL18,
+  author    = {Gabriel Bender and
+               Pieter{-}Jan Kindermans and
+               Barret Zoph and
+               Vijay Vasudevan and
+               Quoc V. Le},
+  title     = {Understanding and Simplifying One-Shot Architecture Search},
+  volume    = {80},
+  pages     = {549--558},
+  publisher = {International Conference on Machine Learning},
+  year      = {2018}
+}
+
+@inproceedings{DBLP:conf/nips/SaxenaV16,
+  author    = {Shreyas Saxena and
+               Jakob Verbeek},
+  title     = {Convolutional Neural Fabrics},
+  publisher = {Advances in Neural Information Processing Systems},
+  pages     = {4053--4061},
+  year      = {2016}
+}
+
+@inproceedings{DBLP:conf/iclr/KleinFSH17,
+  author    = {Aaron Klein and
+               Stefan Falkner and
+               Jost Tobias Springenberg and
+               Frank Hutter},
+  title     = {Learning Curve Prediction with Bayesian Neural Networks},
+  publisher = {International Conference on Learning Representations},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:conf/iclr/BakerGRN18,
+  author    = {Bowen Baker and
+               Otkrist Gupta and
+               Ramesh Raskar and
+               Nikhil Naik},
+  title     = {Accelerating Neural Architecture Search using Performance Prediction},
+  publisher = {International Conference on Learning Representations},
+  year      = {2018}
+}
+
+@inproceedings{DBLP:conf/wmt/XiaTTGHCFGLLWWZ19,
+  author    = {Yingce Xia and
+               Xu Tan and
+               Fei Tian and
+               Fei Gao and
+               Di He and
+               Weicong Chen and
+               Yang Fan and
+               Linyuan Gong and
+               Yichong Leng and
+               Renqian Luo and
+               Yiren Wang and
+               Lijun Wu and
+               Jinhua Zhu and
+               Tao Qin and
+               Tie{-}Yan Liu},
+  title     = {Microsoft Research Asia's Systems for {WMT19}},
+  pages     = {424--433},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2019}
+}
+
+@inproceedings{DBLP:conf/iclr/RamachandranZL18,
+  author    = {Prajit Ramachandran and
+               Barret Zoph and
+               Quoc V. Le},
+  title     = {Searching for Activation Functions},
+  publisher = {International Conference on Learning Representations},
+  year      = {2018}
+}
+
+@article{DBLP:journals/corr/abs-2009-02070,
+  author    = {Wei Zhu and
+               Xiaoling Wang and
+               Xipeng Qiu and
+               Yuan Ni and
+               Guotong Xie},
+  title     = {AutoTrans: Automating Transformer Design via Reinforced Architecture
+               Search},
+  journal   = {CoRR},
+  volume    = {abs/2009.02070},
+  year      = {2020}
+}
+
+@inproceedings{DBLP:conf/acl/WangWLCZGH20,
+  author    = {Hanrui Wang and
+               Zhanghao Wu and
+               Zhijian Liu and
+               Han Cai and
+               Ligeng Zhu and
+               Chuang Gan and
+               Song Han},
+  title     = {{HAT:} Hardware-Aware Transformers for Efficient Natural Language
+               Processing},
+  pages     = {7675--7688},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2020}
+}
+
+@article{DBLP:journals/corr/abs-2008-06808,
+  author    = {Henry Tsai and
+               Jayden Ooi and
+               Chun{-}Sung Ferng and
+               Hyung Won Chung and
+               Jason Riesa},
+  title     = {Finding Fast Transformers: One-Shot Neural Architecture Search by
+               Component Composition},
+  journal   = {CoRR},
+  volume    = {abs/2008.06808},
+  year      = {2020}
+}
+
+@inproceedings{Wang2019ExploitingSC,
+  title={Exploiting Sentential Context for Neural Machine Translation},
+  author={Xing Wang and Zhaopeng Tu and Longyue Wang and Shuming Shi},
+  publisher={Annual Meeting of the Association for Computational Linguistics},
+  year={2019}
+}
+
+@inproceedings{Wei2020MultiscaleCD,
+  title={Multiscale Collaborative Deep Models for Neural Machine Translation},
+  author={Xiangpeng Wei and Heng Yu and Yue Hu and Yue Zhang and Rongxiang Weng and Weihua Luo},
+  publisher={Annual Meeting of the Association for Computational Linguistics},
+  year={2020}
+}
+
+@article{li2020shallow,
+  title={Shallow-to-Deep Training for Neural Machine Translation},
+  author={Li, Bei and Wang, Ziyang and Liu, Hui and Jiang, Yufan and Du, Quan and Xiao, Tong and Wang, Huizhen and Zhu, Jingbo},
+  journal={Conference on Empirical Methods in Natural Language Processing},
+  year={2020}
+}
+
+@article{DBLP:journals/corr/abs-2007-06257,
+  author    = {Hongfei Xu and
+               Qiuhui Liu and
+               Deyi Xiong and
+               Josef van Genabith},
+  title     = {Transformer with Depth-Wise {LSTM}},
+  journal   = {CoRR},
+  volume    = {abs/2007.06257},
+  year      = {2020}
+}
+
+@inproceedings{DBLP:conf/acl/XuLGXZ20,
+  author    = {Hongfei Xu and
+               Qiuhui Liu and
+               Josef van Genabith and
+               Deyi Xiong and
+               Jingyi Zhang},
+  title     = {Lipschitz Constrained Parameter Initialization for Deep Transformers},
+  pages     = {397--402},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2020}
+}
+
+@article{DBLP:journals/corr/abs-2006-10369,
+  author    = {Jungo Kasai and
+               Nikolaos Pappas and
+               Hao Peng and
+               James Cross and
+               Noah A. Smith},
+  title     = {Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff
+               in Machine Translation},
+  journal   = {CoRR},
+  volume    = {abs/2006.10369},
+  year      = {2020}
+}
+
+
+%%%%% chapter 15------------------------------------------------------
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%% chapter 16------------------------------------------------------
+@inproceedings{DBLP:conf/wmt/CurreyBH17,
+  author    = {Anna Currey and
+               Antonio Valerio Miceli Barone and
+               Kenneth Heafield},
+  title     = {Copied Monolingual Data Improves Low-Resource Neural Machine Translation},
+  pages     = {148--156},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:conf/emnlp/EdunovOAG18,
+  author    = {Sergey Edunov and
+               Myle Ott and
+               Michael Auli and
+               David Grangier},
+  title     = {Understanding Back-Translation at Scale},
+  pages     = {489--500},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2018}
+}
+@inproceedings{DBLP:conf/emnlp/FadaeeM18,
+  author    = {Marzieh Fadaee and
+               Christof Monz},
+  title     = {Back-Translation Sampling by Targeting Difficult Words in Neural Machine
+               Translation},
+  pages     = {436--446},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2018}
+}
+@inproceedings{DBLP:conf/nlpcc/XuLXLLXZ19,
+  author    = {Nuo Xu and
+               Yinqiao Li and
+               Chen Xu and
+               Yanyang Li and
+               Bei Li and
+               Tong Xiao and
+               Jingbo Zhu},
+  title     = {Analysis of Back-Translation Methods for Low-Resource Neural Machine
+               Translation},
+  volume    = {11839},
+  pages     = {466--475},
+  publisher = {Springer},
+  year      = {2019}
+}
+@inproceedings{DBLP:conf/wmt/CaswellCG19,
+  author    = {Isaac Caswell and
+               Ciprian Chelba and
+               David Grangier},
+  title     = {Tagged Back-Translation},
+  pages     = {53--63},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2019}
+}
+@inproceedings{DBLP:conf/emnlp/WangLWLS19,
+  author    = {Shuo Wang and
+               Yang Liu and
+               Chao Wang and
+               Huanbo Luan and
+               Maosong Sun},
+  title     = {Improving Back-Translation with Uncertainty-based Confidence Estimation},
+  pages     = {791--802},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2019}
+}
+@article{DBLP:journals/corr/abs200111327,
+  author    = {Idris Abdulmumin and
+               Bashir Shehu Galadanci and
+               Abubakar Isa},
+  title     = {Iterative Batch Back-Translation for Neural Machine Translation: {A}
+               Conceptual Model},
+  journal   = {CoRR},
+  year      = {2020}
+}
+@article{DBLP:journals/corr/abs200403672,
+  author    = {Zi-Yi Dou and
+               Antonios Anastasopoulos and
+               Graham Neubig},
+  title     = {Dynamic Data Selection and Weighting for Iterative Back-Translation},
+  journal   = {CoRR},
+  year      = {2020}
+}
+@inproceedings{DBLP:conf/emnlp/WuZHGQLL19,
+  author    = {Lijun Wu and
+               Jinhua Zhu and
+               Di He and
+               Fei Gao and
+               Tao Qin and
+               Jianhuang Lai and
+               Tie-Yan Liu},
+  title     = {Machine Translation With Weakly Paired Documents},
+  pages     = {4374--4383},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2019}
+}
+@article{DBLP:journals/corr/abs-1901-09069,
+  author    = {Felipe Almeida and
+               Geraldo Xex{\'{e}}o},
+  title     = {Word Embeddings: {A} Survey},
+  journal   = {CoRR},
+  year      = {2019}
+}
+@article{DBLP:journals/corr/abs-2002-06823,
+  author    = {Jinhua Zhu and
+               Yingce Xia and
+               Lijun Wu and
+               Di He and
+               Tao Qin and
+               Wengang Zhou and
+               Houqiang Li and
+               Tie-Yan Liu},
+  title     = {Incorporating {BERT} into Neural Machine Translation},
+  journal   = {CoRR},
+  year      = {2020}
+}
+@inproceedings{song2019mass,
+  author    = {Kaitao Song and
+               Xu Tan and
+               Tao Qin and
+               Jianfeng Lu and
+               Tie-Yan Liu},
+  title     = {{MASS:} Masked Sequence to Sequence Pre-training for Language Generation},
+  volume    = {97},
+  pages     = {5926--5936},
+  publisher = {{PMLR}},
+  year      = {2019}
+}
+@article{DBLP:journals/corr/Ruder17a,
+  author    = {Sebastian Ruder},
+  title     = {An Overview of Multi-Task Learning in Deep Neural Networks},
+  journal   = {CoRR},
+  volume    = {abs/1706.05098},
+  year      = {2017}
+}
+@inproceedings{DBLP:conf/emnlp/DomhanH17,
+  author    = {Tobias Domhan and
+               Felix Hieber},
  title     = {Using Target-side Monolingual Data for Neural Machine Translation
               through Multi-task Learning},
  pages     = {1500--1505},
@@ -7968,7 +8985,7 @@ author    = {Zhuang Liu and
               Trevor Cohn},
  title     = {Iterative Back-Translation for Neural Machine Translation},
  pages     = {18--24},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2018}
 }
 @inproceedings{DBLP:conf/icml/OttAGR18,
@@ -7988,7 +9005,7 @@ author    = {Zhuang Liu and
               Christof Monz},
  title     = {Data Augmentation for Low-Resource Neural Machine Translation},
  pages     = {567--573},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2017}
 }
 @inproceedings{finding2006adafre,
@@ -8136,7 +9153,7 @@ author    = {Zhuang Liu and
  author    = {Ivan Vulic and
               Anna Korhonen},
  title     = {On the Role of Seed Lexicons in Learning Bilingual Word Embeddings},
-  publisher = {The Association for Computer Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2016}
 }
 @inproceedings{DBLP:conf/iclr/SmithTHH17,
@@ -8155,7 +9172,7 @@ author    = {Zhuang Liu and
               Eneko Agirre},
  title     = {Learning bilingual word embeddings with (almost) no bilingual data},
  pages     = {451--462},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2017}
 }
 @article{1966ASchnemann,
@@ -8185,7 +9202,7 @@ author    = {Zhuang Liu and
               Maosong Sun},
  title     = {Adversarial Training for Unsupervised Bilingual Lexicon Induction},
  pages     = {1959--1970},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2017}
 }
 @inproceedings{DBLP:conf/emnlp/XuYOW18,
@@ -8195,7 +9212,7 @@ author    = {Zhuang Liu and
               Yuexin Wu},
  title     = {Unsupervised Cross-lingual Transfer of Word Embedding Spaces},
  pages     = {2465--2474},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2018}
 }
 @inproceedings{DBLP:conf/emnlp/Alvarez-MelisJ18,
@@ -8203,7 +9220,7 @@ author    = {Zhuang Liu and
               Tommi S. Jaakkola},
  title     = {Gromov-Wasserstein Alignment of Word Embedding Spaces},
  pages     = {1881--1890},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2018}
 }
 @inproceedings{DBLP:conf/lrec/GarneauGBDL20,
@@ -8227,7 +9244,7 @@ author    = {Zhuang Liu and
  title     = {Normalized Word Embedding and Orthogonal Transform for Bilingual Word
               Translation},
  pages     = {1006--1011},
-  publisher = {The Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2015}
 }
 @inproceedings{DBLP:conf/iclr/SmithTHH17,
@@ -8247,7 +9264,7 @@ author    = {Zhuang Liu and
               Anna Korhonen},
  title     = {Do We Really Need Fully Unsupervised Cross-Lingual Embeddings?},
  pages     = {4406--4417},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
 @inproceedings{DBLP:conf/acl/SogaardVR18,
@@ -8256,7 +9273,7 @@ author    = {Zhuang Liu and
               Ivan Vulic},
  title     = {On the Limitations of Unsupervised Bilingual Dictionary Induction},
  pages     = {778--788},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2018}
 }
 @article{DBLP:journals/talip/MarieF20,
@@ -8276,7 +9293,7 @@ author    = {Zhuang Liu and
               Eneko Agirre},
  title     = {An Effective Approach to Unsupervised Machine Translation},
  pages     = {194--203},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
 @inproceedings{DBLP:conf/acl/PourdamghaniAGK19,
@@ -8288,7 +9305,7 @@ author    = {Zhuang Liu and
  title     = {Translating Translationese: {A} Two-Step Approach to Unsupervised
               Machine Translation},
  pages     = {3057--3062},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
 @inproceedings{DBLP:conf/iclr/LampleCDR18,
@@ -8745,7 +9762,7 @@ author    = {Zhuang Liu and
  title     = {Pivot-based Transfer Learning for Neural Machine Translation between
               Non-English Languages},
  pages     = {866--876},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
 @inproceedings{DBLP:conf/acl/ChenLCL17,
@@ -8755,7 +9772,7 @@ author    = {Zhuang Liu and
               Victor O. K. Li},
  title     = {A Teacher-Student Framework for Zero-Resource Neural Machine Translation},
  pages     = {1925--1935},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2017}
 }
 @article{DBLP:journals/mt/WuW07,
@@ -8782,7 +9799,7 @@ author    = {Zhuang Liu and
  title     = {Using Context Vectors in Improving a Machine Translation System with
               Bridge Language},
  pages     = {318--322},
-  publisher = {The Association for Computer Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2013}
 }
 @inproceedings{DBLP:conf/emnlp/ZhuHWZWZ14,
@@ -8806,7 +9823,7 @@ author    = {Zhuang Liu and
               Satoshi Nakamura},
  title     = {Improving Pivot Translation by Remembering the Pivot},
  pages     = {573--577},
-  publisher = {The Association for Computer Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2015}
 }
 @inproceedings{DBLP:conf/acl/CohnL07,
@@ -8814,7 +9831,7 @@ author    = {Zhuang Liu and
               Mirella Lapata},
  title     = {Machine Translation by Triangulation: Making Effective Use of Multi-Parallel
               Corpora},
-  publisher = {The Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2007}
 }
 @article{DBLP:journals/mt/WuW07,
@@ -8832,7 +9849,7 @@ author    = {Zhuang Liu and
               Haifeng Wang},
  title     = {Revisiting Pivot Language Approach for Machine Translation},
  pages     = {154--162},
-  publisher = {The Association for Computer Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2009}
 }
 @article{DBLP:journals/corr/ChengLYSX16,
@@ -8868,7 +9885,7 @@ author    = {Zhuang Liu and
  title     = {A Comparison of Pivot Methods for Phrase-Based Statistical Machine
               Translation},
  pages     = {484--491},
-  publisher = {The Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2007}
 }
 @inproceedings{DBLP:conf/ijcnlp/Costa-JussaHB11,
@@ -8877,7 +9894,7 @@ author    = {Zhuang Liu and
               Rafael E. Banchs},
  title     = {Enhancing scarce-resource language translation through pivot combinations},
  pages     = {1361--1365},
-  publisher = {The Association for Computer Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2011}
 }
 @article{DBLP:journals/corr/HintonVD15,
@@ -8902,7 +9919,7 @@ author    = {Zhuang Liu and
               Victor O. K. Li},
  title     = {Universal Neural Machine Translation for Extremely Low Resource Languages},
  pages     = {344--354},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2018}
 }
 @inproceedings{DBLP:conf/icml/FinnAL17,
@@ -8924,7 +9941,7 @@ author    = {Zhuang Liu and
               Haifeng Wang},
  title     = {Multi-Task Learning for Multiple Language Translation},
  pages     = {1723--1732},
-  publisher = {The Association for Computer Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2015}
 }
 @article{DBLP:journals/tacl/LeeCH17,
@@ -9022,7 +10039,7 @@ author    = {Zhuang Liu and
               Ondrej Bojar},
  title     = {Trivial Transfer Learning for Low-Resource Neural Machine Translation},
  pages     = {244--252},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2018}
 }
 @inproceedings{DBLP:conf/acl/ZhangWTS20,
@@ -9033,7 +10050,7 @@ author    = {Zhuang Liu and
  title     = {Improving Massively Multilingual Neural Machine Translation and Zero-Shot
               Translation},
  pages     = {1628--1639},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2020}
 }
 @inproceedings{DBLP:conf/naacl/PaulYSN09,
@@ -9044,7 +10061,7 @@ author    = {Zhuang Liu and
  title     = {On the Importance of Pivot Language Selection for Statistical Machine
               Translation},
  pages     = {221--224},
-  publisher = {The Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2009}
 }
 @article{dabre2019brief,
@@ -9069,7 +10086,7 @@ author    = {Zhuang Liu and
               Anna Korhonen},
  title     = {Do We Really Need Fully Unsupervised Cross-Lingual Embeddings?},
  pages     = {4406--4417},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
 @article{DBLP:journals/corr/MikolovLS13,
@@ -9098,7 +10115,7 @@ author    = {Zhuang Liu and
               Yuexin Wu},
  title     = {Unsupervised Cross-lingual Transfer of Word Embedding Spaces},
  pages     = {2465--2474},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2018}
 }
 @inproceedings{DBLP:conf/iclr/LampleCRDJ18,
@@ -9129,7 +10146,7 @@ author    = {Zhuang Liu and
  title     = {Revisiting Adversarial Autoencoder for Unsupervised Word Translation
               with Cycle Consistency and Improved Training},
  pages     = {3857--3867},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }

@@ -9210,7 +10227,7 @@ author    = {Zhuang Liu and
  title     = {Parameter Sharing Methods for Multilingual Self-Attentional Translation
               Models},
  pages     = {261--271},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2018}
 }
 @inproceedings{DBLP:conf/wmt/LuKLBZS18,
@@ -9222,7 +10239,7 @@ author    = {Zhuang Liu and
               Jason Sun},
  title     = {A neural interlingua for multilingual machine translation},
  pages     = {84--92},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2018}
 }
 @inproceedings{DBLP:conf/acl/WangZZZXZ19,
@@ -9234,7 +10251,7 @@ author    = {Zhuang Liu and
               Chengqing Zong},
  title     = {A Compact and Language-Sensitive Multilingual Translation Method},
  pages     = {1213--1223},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
 @inproceedings{DBLP:conf/iclr/WangPAN19,
@@ -9309,7 +10326,7 @@ author    = {Zhuang Liu and
  title     = {Improved Zero-shot Neural Machine Translation via Ignoring Spurious
               Correlations},
  pages     = {1258--1268},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
 @inproceedings{DBLP:conf/emnlp/FiratSAYC16,
@@ -9337,7 +10354,7 @@ author    = {Zhuang Liu and
               Christof Monz},
  title     = {Data Augmentation for Low-Resource Neural Machine Translation},
  pages     = {567--573},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2017}
 }
 @inproceedings{DBLP:conf/emnlp/WangPDN18,
@@ -9377,7 +10394,7 @@ author    = {Zhuang Liu and
  title     = {Enhancement of Encoder and Attention Using Target Monolingual Corpora
               in Neural Machine Translation},
  pages     = {55--63},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2018}
 }
 @inproceedings{DBLP:conf/icml/VincentLBM08,
@@ -9440,7 +10457,7 @@ author    = {Zhuang Liu and
  title     = {Meteor++ 2.0: Adopt Syntactic Level Paraphrase Knowledge into Machine
               Translation Evaluation},
  pages     = {501--506},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
 @inproceedings{DBLP:conf/acl/ZhouSW19,
@@ -9449,7 +10466,7 @@ author    = {Zhuang Liu and
               Alexander H. Waibel},
  title     = {Paraphrases as Foreign Languages in Multilingual Neural Machine Translation},
  pages     = {113--122},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
 @inproceedings{DBLP:conf/eacl/LapataSM17,
@@ -9484,7 +10501,7 @@ author    = {Zhuang Liu and
  title     = {Extracting Parallel Sentences from Comparable Corpora using Document
               Level Alignment},
  pages     = {403--411},
-  publisher = {The Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2010}
 }
 @article{DBLP:journals/jair/RuderVS19,
@@ -9504,7 +10521,7 @@ author    = {Zhuang Liu and
               Xiaohua Liu and
               Hang Li},
  title     = {Modeling Coverage for Neural Machine Translation},
-  publisher = {The Association for Computer Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2016}
 }
 @article{DBLP:journals/tacl/TuLLLL17,
@@ -9531,7 +10548,7 @@ author    = {Zhuang Liu and
               Hongtao Yang},
  title     = {Sogou Neural Machine Translation Systems for {WMT17}},
  pages     = {410--415},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2017}
 }
 @article{ng2019facebook,
@@ -9552,7 +10569,7 @@ author    = {Zhuang Liu and
               Jingbo Zhu},
  title     = {The NiuTrans Machine Translation System for {WMT18}},
  pages     = {528--534},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2018}
 }
 @inproceedings{DBLP:conf/wmt/LiLXLLLWZXWFCLL19,
@@ -9575,7 +10592,7 @@ author    = {Zhuang Liu and
               Jingbo Zhu},
  title     = {The NiuTrans Machine Translation Systems for {WMT19}},
  pages     = {257--266},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
 @inproceedings{DBLP:conf/nips/DaiL15,
@@ -9635,7 +10652,7 @@ author    = {Zhuang Liu and
               Russell Power},
  title     = {Semi-supervised sequence tagging with bidirectional language models},
  pages     = {1756--1765},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2017}
 }
 @inproceedings{DBLP:conf/naacl/PetersNIGCLZ18,
@@ -9648,7 +10665,7 @@ author    = {Zhuang Liu and
               Luke Zettlemoyer},
  title     = {Deep Contextualized Word Representations},
  pages     = {2227--2237},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2018}
 }
 @inproceedings{DBLP:conf/naacl/PetersNIGCLZ18,
@@ -9661,7 +10678,7 @@ author    = {Zhuang Liu and
               Luke Zettlemoyer},
  title     = {Deep Contextualized Word Representations},
  pages     = {2227--2237},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2018}
 }
 @inproceedings{DBLP:conf/naacl/PetersNIGCLZ18,
@@ -9683,7 +10700,7 @@ author    = {Zhuang Liu and
               Vassilina Nikoulina},
  title     = {On the use of {BERT} for Neural Machine Translation},
  pages     = {108--117},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
 @inproceedings{DBLP:conf/emnlp/ImamuraS19,
@@ -9691,7 +10708,7 @@ author    = {Zhuang Liu and
               Eiichiro Sumita},
  title     = {Recycling a Pre-trained {BERT} Encoder for Neural Machine Translation},
  pages     = {23--31},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
 @inproceedings{DBLP:conf/naacl/EdunovBA19,
@@ -10555,3 +11572,429 @@ author    = {Zhuang Liu and

 %%%%% chapter 18------------------------------------------------------
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%% chapter appendix-A------------------------------------------------------
+@inproceedings{Tong2012NiuTrans,
+  author    = {Tong Xiao and
+               Jingbo Zhu and
+               Hao Zhang and
+               Qiang Li},
+  title     = {NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based
+               Machine Translation},
+  pages     = {19--24},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2012}
+}
+
+@inproceedings{Li2010Joshua,
+  author    = {Zhifei Li and
+               Chris Callison-Burch and
+               Chris Dyer and
+               Sanjeev Khudanpur and
+               Lane Schwartz and
+               Wren N. G. Thornton and
+               Jonathan Weese and
+               Omar Zaidan},
+  title     = {Joshua: An Open Source Toolkit for Parsing-Based Machine Translation},
+  pages     = {135--139},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2009}
+}
+
+@inproceedings{iglesias2009hierarchical,
+  author    = {Gonzalo Iglesias and
+               Adri{\`{a}} de Gispert and
+               Eduardo Rodr{\'{\i}}guez Banga and
+               William J. Byrne},
+  title     = {Hierarchical Phrase-Based Translation with Weighted Finite State Transducers},
+  pages     = {433--441},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2009}
+}
+
+@inproceedings{dyer2010cdec,
+  author    = {Chris Dyer and
+               Adam Lopez and
+               Juri Ganitkevitch and
+               Jonathan Weese and
+               Ferhan T{\"{u}}re and
+               Phil Blunsom and
+               Hendra Setiawan and
+               Vladimir Eidelman and
+               Philip Resnik},
+  title     = {cdec: {A} Decoder, Alignment, and Learning Framework for Finite-State
+               and Context-Free Translation Models},
+  pages     = {7--12},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2010}
+}
+
+@inproceedings{Cer2010Phrasal,
+  author    = {Daniel M. Cer and
+               Michel Galley and
+               Daniel Jurafsky and
+               Christopher D. Manning},
+  title     = {Phrasal: {A} Statistical Machine Translation Toolkit for Exploring
+               New Model Features},
+  pages     = {9--12},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2010}
+}
+
+@article{vilar2012jane,
+  title={Jane: an advanced freely available hierarchical machine translation toolkit},
+  author={Vilar, David and Stein, Daniel and Huck, Matthias and Ney, Hermann},
+  publisher={Machine Translation},
+  volume={26},
+  number={3},
+  pages={197--216},
+  year={2012}
+}
+
+@inproceedings{DBLP:conf/naacl/DyerCS13,
+  author    = {Chris Dyer and
+               Victor Chahuneau and
+               Noah A. Smith},
+  title     = {A Simple, Fast, and Effective Reparameterization of {IBM} Model 2},
+  pages     = {644--648},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2013}
+}
+
+@article{al2016theano,
+  author    = {Rami Al-Rfou and
+               Guillaume Alain and
+               Amjad Almahairi and
+               Christof Angerm{\"{u}}ller and
+               Dzmitry Bahdanau and
+               Nicolas Ballas and
+               Fr{\'{e}}d{\'{e}}ric Bastien and
+               Justin Bayer and
+               Anatoly Belikov and
+               Alexander Belopolsky and
+               Yoshua Bengio and
+               Arnaud Bergeron and
+               James Bergstra and
+               Valentin Bisson and
+               Josh Bleecher Snyder and
+               Nicolas Bouchard and
+               Nicolas Boulanger-Lewandowski and
+               Xavier Bouthillier and
+               Alexandre de Br{\'{e}}bisson and
+               Olivier Breuleux and
+               Pierre Luc Carrier and
+               Kyunghyun Cho and
+               Jan Chorowski and
+               Paul F. Christiano and
+               Tim Cooijmans and
+               Marc-Alexandre C{\^{o}}t{\'{e}} and
+               Myriam C{\^{o}}t{\'{e}} and
+               Aaron C. Courville and
+               Yann N. Dauphin and
+               Olivier Delalleau and
+               Julien Demouth and
+               Guillaume Desjardins and
+               Sander Dieleman and
+               Laurent Dinh and
+               Melanie Ducoffe and
+               Vincent Dumoulin and
+               Samira Ebrahimi Kahou and
+               Dumitru Erhan and
+               Ziye Fan and
+               Orhan Firat and
+               Mathieu Germain and
+               Xavier Glorot and
+               Ian J. Goodfellow and
+               Matthew Graham and
+               {\c{C}}aglar G{\"{u}}l{\c{c}}ehre and
+               Philippe Hamel and
+               Iban Harlouchet and
+               Jean-Philippe Heng and
+               Bal{\'{a}}zs Hidasi and
+               Sina Honari and
+               Arjun Jain and
+               S{\'{e}}bastien Jean and
+               Kai Jia and
+               Mikhail Korobov and
+               Vivek Kulkarni and
+               Alex Lamb and
+               Pascal Lamblin and
+               Eric Larsen and
+               C{\'{e}}sar Laurent and
+               Sean Lee and
+               Simon Lefran{\c{c}}ois and
+               Simon Lemieux and
+               Nicholas L{\'{e}}onard and
+               Zhouhan Lin and
+               Jesse A. Livezey and
+               Cory Lorenz and
+               Jeremiah Lowin and
+               Qianli Ma and
+               Pierre-Antoine Manzagol and
+               Olivier Mastropietro and
+               Robert McGibbon and
+               Roland Memisevic and
+               Bart van Merri{\"{e}}nboer and
+               Vincent Michalski and
+               Mehdi Mirza and
+               Alberto Orlandi and
+               Christopher Joseph Pal and
+               Razvan Pascanu and
+               Mohammad Pezeshki and
+               Colin Raffel and
+               Daniel Renshaw and
+               Matthew Rocklin and
+               Adriana Romero and
+               Markus Roth and
+               Peter Sadowski and
+               John Salvatier and
+               Fran{\c{c}}ois Savard and
+               Jan Schl{\"{u}}ter and
+               John Schulman and
+               Gabriel Schwartz and
+               Iulian Vlad Serban and
+               Dmitriy Serdyuk and
+               Samira Shabanian and
+               {\'{E}}tienne Simon and
+               Sigurd Spieckermann and
+               S. Ramana Subramanyam and
+               Jakub Sygnowski and
+               J{\'{e}}r{\'{e}}mie Tanguay and
+               Gijs van Tulder and
+               Joseph P. Turian and
+               Sebastian Urban and
+               Pascal Vincent and
+               Francesco Visin and
+               Harm de Vries and
+               David Warde-Farley and
+               Dustin J. Webb and
+               Matthew Willson and
+               Kelvin Xu and
+               Lijun Xue and
+               Li Yao and
+               Saizheng Zhang and
+               Ying Zhang},
+  title     = {Theano: {A} Python framework for fast computation of mathematical
+               expressions},
+  journal   = {CoRR},
+  volume    = {abs/1605.02688},
+  year      = {2016}
+}
+
+@inproceedings{DBLP:journals/corr/SennrichFCBHHJL17,
+  author    = {Rico Sennrich and
+               Orhan Firat and
+               Kyunghyun Cho and
+               Barry Haddow and
+			   Alexandra Birch and
+               Julian Hitschler and
+               Marcin Junczys-Dowmunt and
+               Samuel L{\"{a}}ubli and
+               Antonio Valerio Miceli Barone and
+               Jozef Mokry and
+               Maria Nadejde},
+  title     = {Nematus: a Toolkit for Neural Machine Translation},
+  publisher = {European Association of Computational Linguistics},
+  pages     = {65--68},
+  year      = {2017}
+}
+
+@inproceedings{Koehn2007Moses,
+  author    = {Philipp Koehn and
+               Hieu Hoang and
+			    Alexandra Birch and
+               Chris Callison-Burch and
+               Marcello Federico and
+               Nicola Bertoldi and
+               Brooke Cowan and
+               Wade Shen and
+               Christine Moran and
+               Richard Zens and
+               Chris Dyer and
+               Ondrej Bojar and
+               Alexandra Constantin and
+               Evan Herbst},
+  title     = {Moses: Open Source Toolkit for Statistical Machine Translation},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2007}
+}
+
+@inproceedings{zollmann2007the,
+  author    = {Andreas Zollmann and
+               Ashish Venugopal and
+               Matthias Paulik and
+               Stephan Vogel},
+  title     = {The Syntax Augmented {MT} {(SAMT)} System at the Shared Task for the
+               2007 {ACL} Workshop on Statistical Machine Translation},
+  pages     = {216--219},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2007}
+}
+
+@article{och2003systematic,
+  author    = {Franz Josef Och and
+               Hermann Ney},
+  title     = {A Systematic Comparison of Various Statistical Alignment Models},
+  journal   = {Computational Linguistics},
+  volume    = {29},
+  number    = {1},
+  pages     = {19--51},
+  year      = {2003}
+}
+
+@inproceedings{zoph2016simple,
+  author    = {Barret Zoph and
+               Ashish Vaswani and
+               Jonathan May and
+               Kevin Knight},
+  title     = {Simple, Fast Noise-Contrastive Estimation for Large {RNN} Vocabularies},
+  pages     = {1217--1222},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2016}
+}
+
+@inproceedings{Ottfairseq,
+  author    = {Myle Ott and
+               Sergey Edunov and
+               Alexei Baevski and
+               Angela Fan and
+               Sam Gross and
+               Nathan Ng and
+               David Grangier and
+               Michael Auli},
+  title     = {fairseq: {A} Fast, Extensible Toolkit for Sequence Modeling},
+  pages     = {48--53},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2019}
+}
+
+@inproceedings{Vaswani2018Tensor2TensorFN,
+   author    = {Ashish Vaswani and
+               Samy Bengio and
+               Eugene Brevdo and
+               Fran{\c{c}}ois Chollet and
+               Aidan N. Gomez and
+               Stephan Gouws and
+               Llion Jones and
+               Lukasz Kaiser and
+               Nal Kalchbrenner and
+               Niki Parmar and
+               Ryan Sepassi and
+               Noam Shazeer and
+               Jakob Uszkoreit},
+  title     = {Tensor2Tensor for Neural Machine Translation},
+  pages     = {193--199},
+  publisher = {Association for Machine Translation in the Americas},
+  year      = {2018}
+}
+
+@inproceedings{KleinOpenNMT,
+  author    = {Guillaume Klein and
+               Yoon Kim and
+               Yuntian Deng and
+               Jean Senellart and
+               Alexander M. Rush},
+  title     = {OpenNMT: Open-Source Toolkit for Neural Machine Translation},
+  pages     = {67--72},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2017}
+}
+
+@inproceedings{luong2016acl_hybrid,
+  author    = {Minh-Thang Luong and
+               Christopher D. Manning},
+  title     = {Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character
+               Models},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2016}
+}
+
+@article{ZhangTHUMT,
+  author    = {Jiacheng Zhang and
+               Yanzhuo Ding and
+               Shiqi Shen and
+               Yong Cheng and
+               Maosong Sun and
+               Huan-Bo Luan and
+               Yang Liu},
+  title     = {{THUMT:} An Open Source Toolkit for Neural Machine Translation},
+  journal   = {CoRR},
+  volume    = {abs/1706.06415},
+  year      = {2017}
+}
+
+@inproceedings{JunczysMarian,
+  author    = {Marcin Junczys-Dowmunt and
+               Roman Grundkiewicz and
+               Tomasz Dwojak and
+               Hieu Hoang and
+               Kenneth Heafield and
+               Tom Neckermann and
+               Frank Seide and
+               Ulrich Germann and
+               Alham Fikri Aji and
+               Nikolay Bogoychev and
+               Andr{\'{e}} F. T. Martins and
+               Alexandra Birch},
+  title     = {Marian: Fast Neural Machine Translation in {C++}},
+  pages     = {116--121},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2018}
+}
+
+@article{hieber2017sockeye,
+  author    = {Felix Hieber and
+               Tobias Domhan and
+               Michael Denkowski and
+               David Vilar and
+               Artem Sokolov and
+               Ann Clifton and
+               Matt Post},
+  title     = {Sockeye: {A} Toolkit for Neural Machine Translation},
+  journal   = {CoRR},
+  volume    = {abs/1712.05690},
+  year      = {2017}
+}
+
+@inproceedings{WangCytonMT,
+  author    = {Xiaolin Wang and
+               Masao Utiyama and
+               Eiichiro Sumita},
+  title     = {CytonMT: an Efficient Neural Machine Translation Open-source Toolkit
+               Implemented in {C++}},
+  pages     = {133--138},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2018}
+}
+
+@article{DBLP:journals/corr/abs-1805-10387,
+  author    = {Oleksii Kuchaiev and
+               Boris Ginsburg and
+               Igor Gitman and
+               Vitaly Lavrukhin and
+               Carl Case and
+               Paulius Micikevicius},
+  title     = {OpenSeq2Seq: extensible toolkit for distributed and mixed precision
+               training of sequence-to-sequence models},
+  journal   = {CoRR},
+  volume    = {abs/1805.10387},
+  year      = {2018}
+}
+
+@article{nmtpy2017,
+  author    = {Ozan Caglayan and
+               Mercedes Garc{\'{\i}}a-Mart{\'{\i}}nez and
+               Adrien Bardet and
+               Walid Aransa and
+               Fethi Bougares and
+               Lo{\"{\i}}c Barrault},
+  title     = {{NMTPY:} {A} Flexible Toolkit for Advanced Neural Machine Translation
+               Systems},
+  journal   = {Prague Bull. Math. Linguistics},
+  volume    = {109},
+  pages     = {15--28},
+  year      = {2017}
+}
+%%%%% chapter appendix-A------------------------------------------------------
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
--- a/structure.tex
+++ b/structure.tex
@@ -527,7 +527,7 @@ innerbottommargin=5pt]{cBox}
 %----------------------------------------------------------------------------------------

 \usepackage{hyperref}
-\hypersetup{hidelinks,backref=true,pagebackref=true,hyperindex=true,colorlinks=false,breaklinks=true,urlcolor=ocre,bookmarks=true,bookmarksopen=true}
+\hypersetup{hidelinks,colorlinks=false,breaklinks=true,urlcolor=ocre,bookmarksopen=true}
 %backref反向引用
 %pagebackref反向引用页码
 %hyperindex索引链接