合并分支 'caorunzhe' 到 'master'

Caorunzhe 查看合并请求 !776

合并分支 'caorunzhe' 到 'master'
Caorunzhe 查看合并请求 !776
6797dee8 · 曹润柘 · 154466b9 · 484276b0 · 6797dee8 · 6797dee8
Commit 6797dee8 authored Jan 04, 2021 by 曹润柘
--- a/Chapter14/chapter14.tex
+++ b/Chapter14/chapter14.tex
@@ -105,7 +105,7 @@

 \parinterval 机器翻译有两种常用的推断方式\ \dash \ 自左向右推断和自右向左推断。自左向右推断符合现实世界中人类的语言使用规律，因为人在翻译一个句子时，总是习惯从句子开始的部分往后生成\footnote{有些语言中，文字是自右向左书写，这时自右向左推断更符合人类使用这种语言的习惯。}。不过，有时候人也会使用当前单词后面的译文信息。也就是说，翻译也需要“未来” 的文字信息。于是很容易想到使用自右向左的方法对译文进行生成。

-\parinterval 以上两种推断方式在神经机器翻译中都有应用，对于源语言句子$\seq{x}=\{x_1,x_2,\dots,x_m\}$和目标语言句子$\seq{y}=\{y_1,y_2,\dots,y_n\}$，自左向右的翻译可以被描述为公式\eqref{eq:14-1}：
+\parinterval 以上两种推断方式在神经机器翻译中都有应用，对于源语言句子$\seq{x}=\{x_1,\dots,x_m\}$和目标语言句子$\seq{y}=\{y_1,\dots,y_n\}$，自左向右的翻译可以被描述为公式\eqref{eq:14-1}：

 \begin{eqnarray}
 \funp{P}(\seq{y}\vert\seq{x}) &=& \prod_{j=1}^n \funp{P}(y_j\vert\seq{y}_{<j},\seq{x})
@@ -118,7 +118,7 @@
 \label{eq:14-2}
 \end{eqnarray}

-\noindent 其中，$\seq{y}_{<j}=\{y_1,y_2,\dots,y_{j-1}\}$，$\seq{y}_{>j}=\{y_{j+1},y_{j+2},\dots,y_n\}$。可以看到，自左向右推断和自右向左推断本质上是一样的。{\chapterten}到{\chaptertwelve}均使用了自左向右的推断方法。自右向左推断比较简单的实现方式是：在训练过程中直接将双语数据中的目标语言句子进行反转，之后仍然使用原始的模型进行训练即可。在推断的时候，生成的目标语言词串也需要进行反转得到最终的译文。有时候，使用自右向左的推断方式会取得更好的效果\upcite{DBLP:conf/wmt/SennrichHB16}。不过更多情况下需要同时使用词串左端（历史）和右端（未来）的信息。有多种思路可以融合左右两端信息：
+\noindent 其中，$\seq{y}_{<j}=\{y_1,\dots,y_{j-1}\}$，$\seq{y}_{>j}=\{y_{j+1},\dots,y_n\}$。可以看到，自左向右推断和自右向左推断本质上是一样的。{\chapterten}到{\chaptertwelve}均使用了自左向右的推断方法。自右向左推断比较简单的实现方式是：在训练过程中直接将双语数据中的目标语言句子进行反转，之后仍然使用原始的模型进行训练即可。在推断的时候，生成的目标语言词串也需要进行反转得到最终的译文。有时候，使用自右向左的推断方式会取得更好的效果\upcite{DBLP:conf/wmt/SennrichHB16}。不过更多情况下需要同时使用词串左端（历史）和右端（未来）的信息。有多种思路可以融合左右两端信息：

 \begin{itemize}
 \vspace{0.5em}

--- a/Chapter16/Figures/figure-comparison-of-structure-between-gpt-and-bert-model.tex
+++ b/Chapter16/Figures/figure-comparison-of-structure-between-gpt-and-bert-model.tex
@@ -103,7 +103,7 @@
 \node [anchor=north] (pos1) at ([xshift=1.5em,yshift=-1.0em]node0-2.south) {\small{(a) GPT模型结构}};
 \node [anchor=north] (pos2) at ([xshift=1.5em,yshift=-1.0em]node0-6.south) {\small{(b) BERT模型结构}};

-\node [anchor=south] (ex) at ([xshift=2.1em,yshift=0.5em]node3-1.north) {\small{TRM：transformer}};
+\node [anchor=south] (ex) at ([xshift=2.1em,yshift=0.5em]node3-1.north) {\small{TRM：Transformer}};




--- a/Chapter16/Figures/figure-example-of-iterative-back-translation.tex
+++ b/Chapter16/Figures/figure-example-of-iterative-back-translation.tex
@@ -60,7 +60,7 @@
 \node [anchor=west,fill=red!20,minimum width=1.5em](d2-1) at ([xshift=-0.0em]d2.east){};
 \node [anchor=west,fill=yellow!20,minimum width=1.5em](d3-1) at ([xshift=-0.0em]d3.east){};
 \node [anchor=north] (d4) at ([xshift=1em]d1.south) {\small{训练：}};
-\node [anchor=north] (d5) at ([xshift=0.5em]d2.south) {\small{推理：}};
+\node [anchor=north] (d5) at ([xshift=0.5em]d2.south) {\small{推断：}};
 \draw [->,thick] ([xshift=0em]d4.east)--([xshift=1.5em]d4.east);
 \draw [->,thick,dashed] ([xshift=0em]d5.east)--([xshift=1.5em]d5.east);


--- a/Chapter16/Figures/figure-examples-of-comparable-corpora.tex
+++ b/Chapter16/Figures/figure-examples-of-comparable-corpora.tex
 \begin{tikzpicture}
 \begin{scope}
-\node [anchor=center] (node1) at (0,0) {\textbf{Machine translation}, sometiomes referred to by the abbreviation \textbf{MT} (not to be };
-\node [anchor=north] (node2) at (node1.south) {confused with computer-aided translation,,machine-aided human translation inter};
+\node [anchor=center] (node1) at (0,0) {\textbf{Machine Translation}, sometimes referred to by the abbreviation \textbf{MT} (not to be };
+\node [anchor=north] (node2) at (node1.south) {confused with computer-aided translation,machine-aided human translation inter};
 \node [anchor=north] (node3) at (node2.south) {-active translation), is a subfield of computational linguistics that investigates the};
 \node [anchor=north] (node4) at ([xshift=-1.8em]node3.south) {use of software to translate text or speech from one language to another.};
 \node [anchor=south] (node5) at ([xshift=-12.8em,yshift=0.5em]node1.north) {\Large{WIKIPEDIA}};

--- a/Chapter16/Figures/figure-parameter-initialization-method-diagram.tex
+++ b/Chapter16/Figures/figure-parameter-initialization-method-diagram.tex
@@ -12,8 +12,8 @@
 \node[node,anchor=west,minimum width=6em,minimum height=2.4em,fill=blue!20,line width=0.6pt] (decoder2) at ([xshift=4em,yshift=0em]decoder1.east){\small 解码器};
 \node[node,anchor=west,minimum width=6em,minimum height=2.4em,fill=blue!30,line width=0.6pt] (decoder3) at ([xshift=3em]decoder2.east){\small 解码器};

-\node[anchor=north,font=\scriptsize,fill=yellow!20] (w1) at ([yshift=-1.6em]decoder1.south){知识 \ 就是 \ 力量 \ 。 \ <EOS>};
-\node[anchor=north,font=\scriptsize,fill=green!20] (w3) at ([yshift=-1.6em]decoder3.south){Wissen  \ ist \ Machit \ . \ <EOS>};
+\node[anchor=north,font=\scriptsize,fill=yellow!20] (w1) at ([yshift=-1.6em]decoder1.south){知识 \ 就是 \ 力量 \ 。 \ <eos>};
+\node[anchor=north,font=\scriptsize,fill=green!20] (w3) at ([yshift=-1.6em]decoder3.south){Wissen  \ ist \ Machit \ . \ <eos>};
 \node[anchor=south,font=\scriptsize,fill=orange!20] (w2) at ([yshift=1.6em]encoder1.north){Knowledge \ is \ power \ . };
 \node[anchor=south,font=\scriptsize,fill=orange!20] (w4) at ([yshift=1.6em]encoder3.north){Knowledge \ is \ power \ . };


--- a/Chapter16/chapter16.tex
+++ b/Chapter16/chapter16.tex
@@ -22,7 +22,7 @@
 %----------------------------------------------------------------------------------------
 \chapter{低资源神经机器翻译}

-\parinterval 神经机器翻译带来的性能提升是显著的，但随之而来的问题是对海量双语训练数据的依赖。但是，不同语言可以使用的数据规模是不同的。比如汉语、英语这种使用范围广泛的语言，存在着大量的双语平行句对，这些语言被称为{\small\bfnew{富资源语言}}\index{富资源语言}（High-resource Language\index{High-resource Language}）。而对于其它一些使用范围稍小的语言，如斐济语、古吉拉特语等，相关的数据非常稀少，这些语言被称为{\small\bfnew{低资源语言}}\index{低资源语言}（Low-resource Language\index{Low-resource Language}）。世界上现存语言超过5000种，仅有很少一部分为富资源语言，绝大多数均为低资源语言。即使在富资源语言中，对于一些特定的领域，双语平行语料也是十分稀缺的。有时，一些特殊的语种或者领域甚至会面临“零资源”的问题。因此，{\small\bfnew{低资源机器翻译}}\index{低资源机器翻译}（Low-resource Machine Translation）是当下急需解决且颇具挑战的问题。
+\parinterval 神经机器翻译带来的性能提升是显著的，但随之而来的问题是对海量双语训练数据的依赖。但是，不同语言可使用的数据规模是不同的。比如汉语、英语这种使用范围广泛的语言，存在着大量的双语平行句对，这些语言被称为{\small\bfnew{富资源语言}}\index{富资源语言}（High-resource Language\index{High-resource Language}）。而对于其它一些使用范围稍小的语言，如斐济语、古吉拉特语等，相关的数据非常稀少，这些语言被称为{\small\bfnew{低资源语言}}\index{低资源语言}（Low-resource Language\index{Low-resource Language}）。世界上现存语言超过5000种，仅有很少一部分为富资源语言，绝大多数均为低资源语言。即使在富资源语言中，对于一些特定的领域，双语平行语料也是十分稀缺的。有时，一些特殊的语种或者领域甚至会面临“零资源”的问题。因此，{\small\bfnew{低资源机器翻译}}\index{低资源机器翻译}（Low-resource Machine Translation）是当下急需解决且颇具挑战的问题。

 \parinterval 本章将对低资源神经机器翻译的相关问题、模型和方法展开介绍，内容涉及数据的有效使用、双向翻译模型、多语言翻译建模、无监督机器翻译、领域适应五个方面。

@@ -32,7 +32,7 @@

 \section{数据的有效使用}\label{effective-use-of-data}

-\parinterval 数据稀缺是低资源机器翻译所面临的主要问题。充分使用既有数据是一种解决问题的思路。比如，在双语训练不充足的时候，可以对双语数据的部分单词用近义词进行替换，达到丰富双语数据的目的\upcite{DBLP:conf/acl/FadaeeBM17a,DBLP:conf/emnlp/WangPDN18}，也可以考虑用转述等方式生成更多的双语训练数据\upcite{DBLP:conf/emnlp/MartonCR09,DBLP:conf/eacl/LapataSM17}。
+\parinterval 数据稀缺是低资源机器翻译所面临的主要问题，充分使用既有数据是一种解决问题的思路。比如，在双语训练不充足的时候，可以对双语数据的部分单词用近义词进行替换，达到丰富双语数据的目的\upcite{DBLP:conf/acl/FadaeeBM17a,DBLP:conf/emnlp/WangPDN18}，也可以考虑用转述等方式生成更多的双语训练数据\upcite{DBLP:conf/emnlp/MartonCR09,DBLP:conf/eacl/LapataSM17}。

 \parinterval 另一种思路是使用更容易获取的单语数据。实际上，在统计机器翻译时代，使用单语数据训练语言模型是构建机器翻译系统的关键步骤，好的语言模型往往会带来性能的增益。而这个现象在神经机器翻译中似乎并不明显，因为在大多数神经机器翻译的范式中，并不要求使用大规模单语数据来帮助机器翻译系统。甚至，连语言模型都不会作为一个独立的模块。这一方面是由于神经机器翻译系统的解码端本身就起着语言模型的作用，另一方面是由于双语数据的增多使得翻译模型可以很好地捕捉目标语言的规律。但是，双语数据总是有限的，很多场景下，单语数据的规模会远大于双语数据，如果能够让这些单语数据发挥作用，显然是一种非常好的选择。针对以上问题，下面将从数据增强、基于语言模型的单语数据使用等方面展开讨论。

@@ -49,7 +49,7 @@
 \subsubsection{1. 回译}

 \parinterval {\small\bfnew{回译}}\index{回译}（Back Translation, BT\index{Back Translation}）是目前机器翻译任务上最常用的一种数据增强方法\upcite{Sennrich2016ImprovingNM,DBLP:conf/emnlp/EdunovOAG18,DBLP:conf/aclnmt/HoangKHC18}。回译的主要思想是：利用目标语言-源语言翻译模型（反向翻译模型）来生成伪双语句对，用于训练源语言-目标语言翻译模型（正向翻译模型）。假设现在需要训练一个英汉翻译模型。首先，使用双语数据训练汉英翻译模型，即反向翻译模型。然后通过该模型将额外的汉语单语句子翻译为英语句子，从而得到大量的英语- 真实汉语伪双语句对。然后，将回译得到的伪双语句对和真实双语句对混合，训练得到最终的英汉翻译模型。
-回译方法是模型无关的，只需要训练一个反向翻译模型，就可以利用单语数据来增加训练数据的数量，因此得到了广泛使用\upcite{Hassan2018AchievingHP,DBLP:conf/iclr/LampleCDR18,DBLP:conf/emnlp/LampleOCDR18}。图\ref{fig:16-1} 给出了回译方法的一个简要流程。
+回译方法只需要训练一个反向翻译模型，就可以利用单语数据来增加训练数据的数量，因此得到了广泛使用\upcite{Hassan2018AchievingHP,DBLP:conf/iclr/LampleCDR18,DBLP:conf/emnlp/LampleOCDR18}。图\ref{fig:16-1} 给出了回译方法的一个简要流程。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -73,9 +73,9 @@
 \end{figure}
 %----------------------------------------------

-\parinterval 进一步，研究人员发现，在低资源场景中，由于缺乏双语数据，高质量的伪双语数据对于模型来说更有帮助。而在富资源场景中，在回译产生的源语言句子中添加一些噪声，提高翻译结果的多样性，反而可以达到更好的效果，比较常用的方法是使用采样解码、Top-$k$解码和加噪\upcite{DBLP:conf/emnlp/EdunovOAG18,DBLP:conf/aclnmt/ImamuraFS18,DBLP:conf/emnlp/WuWXQLL19}。回译中常用的解码方式为束搜索，在生成每个词的时候只考虑预测概率最高的前几个词，因此生成的翻译结果质量更高，但导致的问题是翻译结果主要集中在部分高频词上，生成的伪数据缺乏多样性，也就很难去准确地覆盖真实的数据分布\upcite{DBLP:conf/icml/OttAGR18}。采样解码是指在解码过程中，对词表中所有的词按照预测概率进行随机采样，因此整个词表中的词都有可能被选中，从而使生成结果多样性更强，但翻译质量和流畅度也会明显下降。Top-$k$解码是对束搜索和采样解码的一个折中方法。在解码过程中，Top-$k$解码对词表中预测概率最高的前$k$个词进行随机采样，这样在保证翻译结果准确的前提下，提高了结果的多样性。加噪方法在束搜索的解码结果加入一些噪声，如丢掉或掩码部分词、打乱句子顺序等。这些方法在生成的源语言句子中引入了噪声，不仅增加了对包含低频词或噪声句子的训练次数，同时也提高了模型的健壮性和泛化能力\upcite{DBLP:conf/icml/VincentLBM08}。
+\parinterval 进一步，研究人员发现，在低资源场景中，由于缺乏双语数据，高质量的伪双语数据对于模型来说更有帮助。而在富资源场景中，在回译产生的源语言句子中添加一些噪声，提高翻译结果的多样性，反而可以达到更好的效果，比较常用的方法是使用采样解码、Top-$k$解码和加噪\upcite{DBLP:conf/emnlp/EdunovOAG18,DBLP:conf/aclnmt/ImamuraFS18,DBLP:conf/emnlp/WuWXQLL19}。回译中常用的解码方式为束搜索，在生成每个词的时候只考虑预测概率最高的几个词，因此生成的翻译结果质量更高，但导致的问题是翻译结果主要集中在部分高频词上，生成的伪数据缺乏多样性，也就很难去准确地覆盖真实的数据分布\upcite{DBLP:conf/icml/OttAGR18}。采样解码是指在解码过程中，对词表中所有的词按照预测概率进行随机采样，因此整个词表中的词都有可能被选中，从而使生成结果多样性更强，但翻译质量和流畅度也会明显下降。Top-$k$解码是对束搜索和采样解码的一个折中方法。在解码过程中，Top-$k$解码对词表中预测概率最高的前$k$个词进行随机采样，这样在保证翻译结果准确的前提下，提高了结果的多样性。加噪方法在束搜索的解码结果加入一些噪声，如丢掉或屏蔽部分词、打乱句子顺序等。这些方法在生成的源语言句子中引入了噪声，不仅增加了对包含低频词或噪声句子的训练次数，同时也提高了模型的健壮性和泛化能力\upcite{DBLP:conf/icml/VincentLBM08}。

-\parinterval 与回译方法类似，源语言单语数据也可以通过一个双语数据训练的正向翻译模型获得对应的目标语言数据，从而构造正向翻译的伪数据\upcite{DBLP:conf/emnlp/ZhangZ16}。与回译方法相反，这时的伪数据中源语言句子是真实的，而目标语言句子是自动生成的，构造的伪数据对译文的流畅性并没有太大帮助，其主要作用是提升编码器的特征提取能力。然而，由于伪数据中生成的译文质量很难保证，因此利用正向翻译模型生成伪数据的方法带来的性能提升效果要弱于回译，甚至可能是有害的\upcite{DBLP:conf/emnlp/WuWXQLL19}。
+\parinterval 与回译方法类似，源语言单语数据也可以通过一个双语数据训练的正向翻译模型获得对应的目标语言翻译结果，从而构造正向翻译的伪数据\upcite{DBLP:conf/emnlp/ZhangZ16}。与回译方法相反，这时的伪数据中源语言句子是真实的，而目标语言句子是自动生成的，构造的伪数据对译文的流畅性并没有太大帮助，其主要作用是提升编码器的特征提取能力。然而，由于伪数据中生成的译文质量很难保证，因此利用正向翻译模型生成伪数据的方法带来的性能提升效果要弱于回译，甚至可能是有害的\upcite{DBLP:conf/emnlp/WuWXQLL19}。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SUB-SECTION
@@ -90,14 +90,14 @@
    \vspace{0.5em}
    \item 丢掉单词：句子中的每个词均有$\funp{P}_{\rm{Drop}}$的概率被丢弃。
    \vspace{0.5em}
-    \item 掩码单词：句子中的每个词均有$\funp{P}_{\rm{Mask}}$的概率被替换为一个额外的[Mask]词。[Mask]的作用类似于占位符，可以理解为一个句子中的部分词被涂抹掉，无法得知该位置词的准确含义。
+    \item 掩码单词：句子中的每个词均有$\funp{P}_{\rm{Mask}}$的概率被替换为一个额外的[Mask]词。[Mask]的作用类似于占位符，可以理解为一个句子中的部分词被屏蔽掉，无法得知该位置词的准确含义。
    \vspace{0.5em}
    \item 打乱顺序：将句子中距离较近的某些词的位置进行随机交换。
    \vspace{0.5em}
 \end{itemize}
 %----------------------------------------------

-\parinterval 图\ref{fig:16-3}展示了三种加噪方法的示例。这里，$\funp{P}_{\rm{Drop}}$和$\funp{P}_{\rm{Mask}}$均设置为0.1，表示每个词有$10\%$的概率被丢弃或掩码。打乱顺序的操作略微复杂，一种实现方法是，通过一个数字来表示每个词在句子中的位置，如“我”是第一个词，“你”是第三个词，然后，在每个位置生成一个$1$到$n$的随机数，$n$一般设置为3，然后将每个词的位置数和对应的随机数相加，即图中的$\seq{S}$。 对$\seq{S}$ 按照从小到大排序，根据排序后每个位置的索引从原始句子中选择对应的词，从而得到最终打乱顺序后的结果。比如，在排序后，$S_2$的值小于$S_1$的值，其余词则保持递增顺序，则将原始句子中的第一个词和第二个词进行交换，其他词保持不变。
+\parinterval 图\ref{fig:16-3}展示了三种加噪方法的示例。这里，$\funp{P}_{\rm{Drop}}$和$\funp{P}_{\rm{Mask}}$均设置为0.1，表示每个词有$10\%$的概率被丢弃或掩码。打乱句子内部顺序的操作略微复杂，一种实现方法是：通过一个数字来表示每个词在句子中的位置，如“我”是第一个词，“你”是第三个词，然后，在每个位置生成一个$1$到$n$的随机数，$n$一般设置为3，然后将每个词的位置数和对应的随机数相加，即图中的$\seq{S}$。 对$\seq{S}$ 按照从小到大排序，根据排序后每个位置的索引从原始句子中选择对应的词，从而得到最终打乱顺序后的结果。比如，在计算后，除了$S_2$的值小于$S_1$外，其余单词的$S$值均为递增顺序，则将原句中第一个词和第二个词进行交换，其他词保持不变。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -121,7 +121,7 @@
 \end{itemize}
 %----------------------------------------------

-\parinterval 另外一种加噪方法是进行词替换。将一个句子中的某个词替换为其他词，可能并不会影响句子的合理性和流畅度。比如，对于“我/出去/玩。”这句话，将“我”替换为“你”、“他”、“我们”。或者，将“玩”替换为“骑车”、“学习”、“吃饭”等，虽然改变了语义，但句子在语法上仍然是合理的。词替换方法即是将双语数据中的部分词替换为词表中的其他词，在保证句子的语义或语法正确性的前提下，增加了训练数据的多样性。
+\parinterval 另外一种加噪方法是进行词替换：将双语数据中的部分词替换为词表中的其他词，在保证句子的语义或语法正确性的前提下，增加了训练数据的多样性。比如，对于“我/出去/玩。”这句话，将“我”替换为“你”、“他”、“我们”。或者，将“玩”替换为“骑车”、“学习”、“吃饭”等，虽然改变了语义，但句子在语法上仍然是合理的。

 \parinterval 词替换的另一种策略是将源语言中的稀有词替换为语义相近的词\upcite{DBLP:conf/acl/FadaeeBM17a}。词表中的稀有词由于出现次数较少，很容易导致训练不充分问题\upcite{DBLP:conf/acl/SennrichHB16a}。通过语言模型将源语言句子中的某个词替换为满足语法或语义条件的稀有词，再通过词对齐工具找到源语言句子中被替换的词在目标语言句子中对应的位置，借助翻译词典将这个目标语言位置的单词替换为词典中的翻译结果，从而得到伪双语数据。

@@ -145,16 +145,16 @@
 \end{figure}
 %----------------------------------------------

-\parinterval 可比语料大多存在于网页中，内容较为复杂，可能会存在较大比例的噪声，如HTML字符、乱码等。首先需要对内容进行充分的数据清洗操作，得到干净的可比语料，然后从中抽取出可用的双语句对。传统的抽取方法一般通过统计模型或双语词典来得到双语句对。比如，通过计算两个不同语言句子之间的单词重叠数或BLEU值\upcite{finding2006adafre,method2008keiji}；或者通过排序模型或二分类器判断一个目标语言句子和一个源语言句子互译的可能性\upcite{DBLP:journals/coling/MunteanuM05,DBLP:conf/naacl/SmithQT10}。
+\parinterval 可比语料大多存在于网页中，内容较为复杂，可能会存在较大比例的噪声，如HTML字符、乱码等。首先需要对内容进行充分的数据清洗，得到干净的可比语料，然后从中抽取出可用的双语句对。传统的抽取方法一般通过统计模型或双语词典来得到双语句对。比如，通过计算两个不同语言句子之间的单词重叠数或BLEU值\upcite{finding2006adafre,method2008keiji}；或者通过排序模型或二分类器判断一个目标语言句子和一个源语言句子互译的可能性\upcite{DBLP:journals/coling/MunteanuM05,DBLP:conf/naacl/SmithQT10}。

-\parinterval 另外一种比较有效的方法是根据两种语言中每个句子的表示向量来抽取数据\upcite{DBLP:conf/emnlp/WuZHGQLL19}。首先，对于两种语言的每个句子，分别使用词嵌入加权平均等方法计算得到句子的表示向量，然后计算每个源语言句子和目标语言句子之间的余弦相似度，相似度大于一定阈值的句对则认为是可用的双语句对\upcite{DBLP:conf/emnlp/WuZHGQLL19}。然而，不同语言单独训练得到的词嵌入可能对应不同的表示空间，因此得到的表示向量无法用于衡量两个句子的相似度\upcite{DBLP:journals/corr/MikolovLS13}。为了解决这个问题，一般使用在同一表示空间的跨语言词嵌入来表示两种语言的单词\upcite{DBLP:journals/jair/RuderVS19}。在跨语言词嵌入中，不同语言相同意思的词对应的词嵌入具有较高的相似性，因此得到的句向量也就可以用于衡量两个句子是否表示相似的语义\upcite{DBLP:conf/icml/LeM14}。关于跨语言词嵌入的具体内容，可以参考\ref{unsupervised-dictionary-induction}节的内容。
+\parinterval 另外一种比较有效的方法是根据两种语言中每个句子的表示向量来抽取数据\upcite{DBLP:conf/emnlp/WuZHGQLL19}。首先，对于两种语言的每个句子，分别使用词嵌入加权平均等方法计算得到句子的表示向量，然后计算每个源语言句子和目标语言句子之间的余弦相似度，相似度大于一定阈值的句对则认为是可用的双语句对\upcite{DBLP:conf/emnlp/WuZHGQLL19}。然而，不同语言单独训练得到的词嵌入可能对应不同的表示空间，因此得到的表示向量无法用于衡量两个句子的相似度\upcite{DBLP:journals/corr/MikolovLS13}。为了解决这个问题，一般使用在同一表示空间的跨语言词嵌入来表示两种语言的单词\upcite{DBLP:journals/jair/RuderVS19}。在跨语言词嵌入中，不同语言相同意思的词对应的词嵌入具有较高的相似性，因此得到的句子表示向量也就可以用于衡量两个句子是否表示相似的语义\upcite{DBLP:conf/icml/LeM14}。关于跨语言词嵌入的具体内容，可以参考\ref{unsupervised-dictionary-induction}节的内容。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
 %----------------------------------------------------------------------------------------
 \subsection{基于语言模型的方法}

-\parinterval 除了构造双语数据进行数据增强，直接利用单语数据也是机器翻译中的常用方法。通常，单语数据会被用于语言模型的学习（见{\chaptertwo}）。对于机器翻译系统，使用语言模型也是一件十分自然的事情，在目标语言端，语言模型可以帮助系统选择更加流畅的译文输出；在源语言端，语言模型也可以用于句子编码，进而更好地生成句子的表示结果。在传统方法中，语言模型更多地被使用在目标语言端。不过，近些年来随着预训练技术的发展，语言模型也被使用在神经机器翻译的编码端。下面将从语言模型在目标语言端的融合、预训练词嵌入、预训练编码器和多任务学习四方面介绍基于语言模型的单语数据使用方法。
+\parinterval 除了构造双语数据进行数据增强，直接利用单语数据也是机器翻译中的常用方法。通常，单语数据会被用于语言模型的训练（见{\chaptertwo}）。对于机器翻译系统，使用语言模型也是一件十分自然的事情，在目标语言端，语言模型可以帮助系统选择更加流畅的译文结果；在源语言端，语言模型也可以用于句子编码，进而更好地生成句子的表示结果。在传统方法中，语言模型更多地被使用在目标语言端。不过，近些年来随着预训练技术的发展，语言模型也被使用在神经机器翻译的编码器端。下面将从语言模型在解码器端的融合、预训练词嵌入、预训练编码器和多任务学习四方面介绍基于语言模型的单语数据使用方法。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SUB-SECTION
@@ -172,7 +172,7 @@

 \parinterval 浅融合方法独立训练翻译模型和语言模型，在生成每个词的时候，对两个模型的预测概率进行加权求和得到最终的预测概率。浅融合的不足在于，解码过程对每个词均采用相同的语言模型权重，缺乏灵活性。针对这个问题，深融合联合翻译模型和语言模型进行训练，从而在解码过程中动态地计算语言模型的权重，更好地融合翻译模型和语言模型来计算预测概率。

-\parinterval 大多数情况下，目标语言端语言模型的使用可以提高译文的流畅度。不过，它并不会增加翻译结果对源语言句子表达的充分性，即源语言句子的信息是否被充分体现到了译文中。也有一些研究发现，神经机器翻译过于关注译文的流畅度，但是充分性的问题没有得到很好考虑，比如，神经机器翻译系统的结果中经常出现漏译等问题。也有一些研究人员提出控制翻译充分性的方法，让译文在流畅度和充分性之间达到平衡\upcite{DBLP:conf/acl/TuLLLL16,li-etal-2018-simple,DBLP:journals/tacl/TuLLLL17}。
+\parinterval 大多数情况下，目标语言端语言模型的使用可以提高译文的流畅度。不过，它并不会增加翻译结果对源语言句子表达的充分性，即源语言句子的信息是否被充分体现到了译文中。也有一些研究发现，神经机器翻译过于关注译文的流畅度，但是充分性的问题没有得到很好考虑，比如，神经机器翻译系统的结果中经常出现漏译等问题。也有一些研究人员提出控制翻译充分性的方法，让译文在流畅度和充分性之间达到平衡\upcite{TuModeling,li-etal-2018-simple,DBLP:journals/tacl/TuLLLL17}。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SUB-SECTION
@@ -181,13 +181,13 @@

 \parinterval 神经机器翻译模型所使用的编码器-解码器框架天然就包含了对输入（源语言）和输出（目标语言）进行表示学习的过程。在编码端，需要学习一种分布式表示来表示源语言句子的信息，这种分布式表示可以包含序列中每个位置的表示结果（见{\chapternine}）。从结构上看，神经机器翻译所使用的编码器与语言模型无异，或者说神经机器翻译的编码器其实就是一个源语言的语言模型。唯一的区别在于，神经机器翻译的编码器并不直接输出源语言句子的生成概率，而传统语言模型是建立在序列生成任务上的。既然神经机器翻译的编码器可以与解码器一起在双语数据上联合训练，那为什么不使用更大规模的数据单独对编码器进行训练呢？或者说，直接使用一个预先训练好的编码器，与机器翻译的解码器配合完成翻译过程。

-\parinterval 实现上述想法的一种手段是{\small\sffamily\bfnew{预训练}}\index{预训练}（Pre-training）\index{Pre-training}\upcite{DBLP:conf/nips/DaiL15,DBLP:journals/corr/abs-1802-05365,radford2018improving,devlin2019bert}。预训练的做法相当于将表示模型的学习任务从目标任务中分离出来，这样可以利用额外的更大规模的数据进行学习。常用的一种方法是使用语言建模等方式在大规模单语数据上进行训练，来得到神经机器翻译模型中的一部分（比如词嵌入和编码器等）的模型参数初始值。然后，神经机器翻译模型在双语数据上进行{\small\sffamily\bfnew{微调}}\index{微调}（Fine-tuning）\index{Fine-tuning}，以得到最终的翻译模型。
+\parinterval 实现上述想法的一种手段是{\small\sffamily\bfnew{预训练}}\index{预训练}（Pre-training）\index{Pre-training}\upcite{DBLP:conf/nips/DaiL15,Peters2018DeepCW,radford2018improving,devlin2019bert}。预训练的做法相当于将源语言表示的学习任务从目标任务中分离出来，这样可以利用额外的更大规模的数据进行学习。常用的一种方法是使用语言建模等方式在大规模单语数据上进行训练，来得到神经机器翻译模型中的一部分（比如词嵌入和编码器等）的模型参数初始值。然后，神经机器翻译模型在双语数据上进行{\small\sffamily\bfnew{微调}}\index{微调}（Fine-tuning）\index{Fine-tuning}，以得到最终的翻译模型。

-\parinterval 词嵌入可以被看作是对每个独立单词进行的表示学习，在自然语言处理的众多任务中都扮演着重要角色\upcite{DBLP:conf/icml/CollobertW08,2011Natural,DBLP:journals/corr/abs-1901-09069}。到目前为止已经有大量的词嵌入学习方法被提出（见{\chapternine}），因此可以直接应用这些方法在海量的单语数据上训练得到词嵌入，用来初始化神经机器翻译模型的词嵌入参数矩阵\upcite{DBLP:conf/aclwat/NeishiSTIYT17,2018When}。
+\parinterval 词嵌入可以被看作是对每个独立单词进行的表示学习的结果，在自然语言处理的众多任务中都扮演着重要角色\upcite{DBLP:conf/icml/CollobertW08,2011Natural,DBLP:journals/corr/abs-1901-09069}。到目前为止已经有大量的词嵌入学习方法被提出（见{\chapternine}），因此可以直接应用这些方法在海量的单语数据上训练得到词嵌入，用来初始化神经机器翻译模型的词嵌入参数矩阵\upcite{DBLP:conf/aclwat/NeishiSTIYT17,2018When}。

 \parinterval 需要注意的是，在神经机器翻译中使用预训练词嵌入有两种方法。一种方法是直接将词嵌入作为固定的输入，也就是在训练神经机器翻译模型的过程中，并不调整词嵌入的参数。这样做的目的是完全将词嵌入模块独立出来，机器翻译可以被看作是在固定的词嵌入输入上进行的建模，从而降低了机器翻译模型学习的难度。另一种方法是仍然遵循``预训练+微调''的策略，将词嵌入作为机器翻译模型部分参数的初始值。在之后机器翻译训练过程中，词嵌入模型结果会被进一步更新。近些年，在词嵌入预训练的基础上进行微调的方法越来越受到研究者的青睐。因为在实践中发现，完全用单语数据学习的单词表示，与双语数据上的翻译任务并不完全匹配。同时目标语言的信息也会影响源语言的表示学习。

-\parinterval 虽然预训练词嵌入在海量的单语数据上学习到了丰富的表示，但词嵌入很主要的一个缺点是无法解决一词多义问题。在不同的上下文中，同一个单词经常表示不同的意思，但词嵌入是完全相同的。模型需要在编码过程中通过上下文去理解每个词在当前语境下的含义。因此，上下文词向量在近些年得到了广泛的关注\upcite{DBLP:conf/acl/PetersABP17,mccann2017learned,DBLP:journals/corr/abs-1802-05365}。上下文词嵌入是指一个词的表示不仅依赖于单词自身，还依赖于上下文语境。由于在不同的上下文中，每个词对应的词嵌入是不同的，因此无法简单地通过词嵌入矩阵来表示，通常的做法是使用海量的单语数据预训练语言模型任务，使模型具备丰富的特征提取能力\upcite{DBLP:journals/corr/abs-1802-05365,radford2018improving,devlin2019bert}。
+\parinterval 虽然预训练词嵌入在海量的单语数据上学习到了丰富的表示，但词嵌入很主要的一个缺点是无法解决一词多义问题。在不同的上下文中，同一个单词经常表示不同的意思，但它的词嵌入是完全相同的，模型需要在编码过程中通过上下文去理解每个词在当前语境下的含义。因此，上下文词向量在近些年得到了广泛的关注\upcite{DBLP:conf/acl/PetersABP17,mccann2017learned,Peters2018DeepCW}。上下文词嵌入是指一个词的表示不仅依赖于单词自身，还依赖于上下文语境。由于在不同的上下文中，每个词对应的词嵌入是不同的，因此无法简单地通过词嵌入矩阵来表示，通常的做法是使用海量的单语数据预训练语言模型任务，使模型具备丰富的特征提取能力\upcite{Peters2018DeepCW,radford2018improving,devlin2019bert}。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SUB-SECTION
@@ -205,7 +205,7 @@
 \end{figure}
 %----------------------------------------------

-\parinterval GPT\upcite{radford2018improving}通过Transformer模型自回归地训练单向语言模型，类似于神经机器翻译模型的解码器，相比双向LSTM等模型，Tranformer架构的表示能力更强。在大规模单语数据上预训练得到的模型结构只需要进行简单的修改，再通过任务特定的训练数据进行微调，就可以很好地适配到下游任务中。之后提出的BERT模型更是将预训练的作用提升到了新的水平\upcite{devlin2019bert}。GPT模型的一个缺陷在于模型只能进行单向编码，也就是前面的文本在建模时无法获取到后面的信息。而BERT提出了一种自编码的方式，使模型在预训练阶段可以通过双向编码的方式进行建模，进一步增强了模型的表示能力。
+\parinterval GPT\upcite{radford2018improving}通过Transformer模型自回归地训练单向语言模型，类似于神经机器翻译模型的解码器，相比双向LSTM等模型，Tranformer模型的表示能力更强。在大规模单语数据上预训练得到的模型结构只需要进行简单的修改，再通过任务特定的训练数据进行微调，就可以很好地适配到下游任务中。之后提出的BERT模型更是将预训练的作用提升到了新的水平\upcite{devlin2019bert}。GPT模型的一个缺陷在于模型只能进行单向编码，也就是前面的文本在建模时无法获取到后面的信息。而BERT提出了一种自编码的方式，使模型在预训练阶段可以通过双向编码的方式进行建模，进一步增强了模型的表示能力。

 \parinterval BERT的核心思想是通过{\small\bfnew{掩码语言模型}}（Masked Language Model，MLM）\index{掩码语言模型}\index{MLM}任务进行预训练。掩码语言模型的思想类似于完形填空，随机选择输入句子中的部分词掩码，之后让模型预测这些被掩码的词。掩码的具体做法是将被选中的词替换为一个特殊的词[Mask]，这样模型在训练过程中，无法得到掩码位置词的信息，需要联合上下文内容进行预测，因此提高了模型对上下文的特征提取能力。实验表明，相比在下游任务中仅利用上下文词嵌入，在大规模单语数据上预训练的模型具有更强的表示能力。而使用掩码的方式进行训练也给神经机器翻译提供了新的思路，在本章中也会使用到类似方法。

@@ -264,7 +264,7 @@

 \section{双向翻译模型}

-\parinterval 机器翻译是要学习一种语言到另外一种语言的翻译。显然这是一个双向任务。对于给定的双语数据，可以同时学习源语言到目标语言和目标语言到源语言的翻译模型。那么，两个方向的翻译模型能否联合起来，相辅相成呢？下面将从双向训练和对偶学习两方面对双向翻译模型进行介绍。这些方法被大量使用在低资源翻译系统中，比如，可以用双向翻译模型反复迭代构造伪数据。
+\parinterval 在机器翻译任务中，对于给定的双语数据，可以同时学习源语言到目标语言和目标语言到源语言的翻译模型，因此机器翻译可被视为一种双向任务。那么，两个方向的翻译模型能否联合起来，相辅相成呢？下面将从双向训练和对偶学习两方面对双向翻译模型进行介绍。这些方法被大量使用在低资源翻译系统中，比如，可以用双向翻译模型反复迭代构造伪数据。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SUB-SECTION
@@ -279,9 +279,9 @@

 \parinterval 这里可以把$\seq{x}$和$\seq{y}$都看作分布式的向量表示；$\seq{W}$应当是一个满秩矩阵，否则对于任意一个$\seq{x}$经过$\seq{W}$变换得到的$\seq{y}$只落在所有可能的$\seq{y}$的一个子空间内，即在给定$\seq{W}$的情况下有些$\seq{y}$不能被任何一个$\seq{x}$表达，而这不符合常识，因为不管是什么句子，总能找到它的一种译文。若$\seq{W}$是满秩矩阵说明$\seq{W}$可逆，也就是给定$\seq{x}$到$\seq{y}$的变换$\seq{W}$下，$\seq{y}$到$\seq{x}$的变换必然是$\seq{W}$的逆而不是其他矩阵。

-\parinterval 这个例子说明$\funp{P}(\seq{y}|\seq{x})$和$\funp{P}(\seq{x}|\seq{y})$直觉上应当存在联系。当然，$\seq{x}$和$\seq{y}$之间是否存在简单的线性变换关系并没有结论，但是上面的例子给出了一种对源语言句子和目标语言句子进行相互转化的思路。实际上，研究人员已经通过一些数学技巧用目标函数来把$\funp{P}(\seq{y}|\seq{x})$和$\funp{P}(\seq{x}|\seq{y})$联系起来，这样训练神经机器翻译系统一次就可以同时得到两个方向的翻译模型，使得训练变得更加高效\upcite{Hassan2018AchievingHP,DBLP:conf/aaai/Zhang0LZC18,DBLP:conf/wmt/SunJXHWW19}。双向联合训练的基本思想是：使用两个方向的翻译模型对单语数据进行解码，之后用解码后的翻译结果与原始的单语数据作为训练语料，通过多次迭代更新两个方向上的机器翻译模型。
+\parinterval 这个例子说明$\funp{P}(\seq{y}|\seq{x})$和$\funp{P}(\seq{x}|\seq{y})$直觉上应当存在联系。当然，$\seq{x}$和$\seq{y}$之间是否存在简单的线性变换关系并没有结论，但是上面的例子给出了一种对源语言句子和目标语言句子进行相互转化的思路。实际上，研究人员已经通过一些数学技巧用目标函数把$\funp{P}(\seq{y}|\seq{x})$和$\funp{P}(\seq{x}|\seq{y})$联系起来，这样训练神经机器翻译系统一次就可以同时得到两个方向的翻译模型，使得训练变得更加高效\upcite{Hassan2018AchievingHP,DBLP:conf/aaai/Zhang0LZC18,DBLP:conf/wmt/SunJXHWW19}。双向联合训练的基本思想是：使用两个方向的翻译模型对单语数据进行推断，之后用翻译结果与原始的单语数据作为训练语料，通过多次迭代更新两个方向上的机器翻译模型。

-\parinterval 图\ref{fig:16-9}给出了一个双向训练的流程，其中$M_{x \rightarrow y}^{k}$表示第$k$轮得到的$x$到$y$的翻译模型，$M_{y \rightarrow x}^{k}$表示第$k$轮得到的$y$到$x$的翻译模型。这里只展示了前两轮迭代。在第一次迭代开始之前，首先使用双语数据对两个初始翻译模型进行预训练。为了保持一致性，这里称之为第0 轮迭代。在第一轮迭代中，首先使用这两个翻译模型$M_{x \rightarrow y}^{0}$和$M_{y \rightarrow x}^{0}$ 翻译单语数据$X=\{ x_i \}$ 和$Y= \{ y_i \}$ 后得到译文$\{\hat{y}_i^{0} \}$和$\{ \hat{x}_i^{0}\}$。进一步，构建伪训练数据集$\{ x_i,\hat{y}_i^{0}\}$ 与$\{ \hat{x}_i^{0},y_i \}$。然后使用上面的两个伪训练集和原始双语数据混合训练得到模型$M_{x \rightarrow y}^{1}$和$M_{y \rightarrow x}^{1}$并进行参数更新，即用$\{ \hat{x}_i^{0},y_i\} \bigcup \{ x_i,y_i\}$训练$M_{x \rightarrow y}^{1}$，用$\{ \hat{y}_i^{0},x_i\} \bigcup \{ y_i,x_i\}$训练$M_{y \rightarrow x}^{1}$。第二轮迭代继续重复上述过程，使用更新参数后的翻译模型$M_{x \rightarrow y}^{1}$和$M_{y \rightarrow x}^{1}$ 得到新的伪数据集$\{ x_i,\hat{y}_i^{1}\}$ 与$\{ \hat{x}_i^{1},y_i \}$。然后，进一步得到翻译模型$M_{x \rightarrow y}^{2}$和$M_{y \rightarrow x}^{2}$。这种方式本质上也是一种自学习的过程，通过逐步生成更好的伪数据来提升模型质量。
+\parinterval 图\ref{fig:16-9}给出了一个双向训练的流程，其中$M_{x \rightarrow y}^{k}$表示第$k$轮得到的$x$到$y$的翻译模型，$M_{y \rightarrow x}^{k}$表示第$k$轮得到的$y$到$x$的翻译模型。这里只展示了前两轮迭代。在第一次迭代开始之前，首先使用双语数据对两个初始翻译模型进行训练。为了保持一致性，这里称之为第0 轮迭代。在第一轮迭代中，首先使用这两个翻译模型$M_{x \rightarrow y}^{0}$和$M_{y \rightarrow x}^{0}$ 翻译单语数据$X=\{ x_i \}$ 和$Y= \{ y_i \}$ 后得到译文$\{\hat{y}_i^{0} \}$和$\{ \hat{x}_i^{0}\}$。进一步，构建伪训练数据集$\{ x_i,\hat{y}_i^{0}\}$ 与$\{ \hat{x}_i^{0},y_i \}$。然后使用上面的两个伪训练数据集和原始双语数据混合，训练得到模型$M_{x \rightarrow y}^{1}$和$M_{y \rightarrow x}^{1}$并进行参数更新，即用$\{ \hat{x}_i^{0},y_i\} \bigcup \{ x_i,y_i\}$训练$M_{x \rightarrow y}^{1}$，用$\{ \hat{y}_i^{0},x_i\} \bigcup \{ y_i,x_i\}$训练$M_{y \rightarrow x}^{1}$。第二轮迭代继续重复上述过程，使用更新参数后的翻译模型$M_{x \rightarrow y}^{1}$和$M_{y \rightarrow x}^{1}$ 得到新的伪数据集$\{ x_i,\hat{y}_i^{1}\}$ 与$\{ \hat{x}_i^{1},y_i \}$。然后，进一步得到翻译模型$M_{x \rightarrow y}^{2}$和$M_{y \rightarrow x}^{2}$。这种方式本质上也是一种自学习的过程，通过逐步生成更好的伪数据来提升模型质量。

 %----------------------------------------------
 \begin{figure}[h]
@@ -300,7 +300,7 @@



-目前，对偶学习的思想已经广泛应用于低资源机器翻译领域，它不仅能够提升在有限双语资源下的翻译模型性能（{\small\bfnew{有监督对偶学习}}，Dual Supervised Learning\index{Dual Supervised Learning}）\upcite{DBLP:conf/icml/XiaQCBYL17,DBLP:conf/icml/XiaTTQYL18}，而且能够利用未标注的单语数据来进行学习（{\small\bfnew{无监督对偶学习}}，Dual Unsupervised Learning\index{Dual Unsupervised Learning}）\upcite{qin2020dual,DBLP:conf/nips/HeXQWYLM16,zhao2020dual}。下面将一一展开讨论。
+目前，对偶学习的思想已经广泛应用于低资源机器翻译领域，它不仅能够提升在有限双语资源下的翻译模型性能，而且能够利用未标注的单语数据来进行学习。下面将针对{\small\bfnew{有监督对偶学习}}（Dual Supervised Learning\index{Dual Supervised Learning}）\upcite{DBLP:conf/icml/XiaQCBYL17,DBLP:conf/icml/XiaTTQYL18}与{\small\bfnew{无监督对偶学习}}（Dual Unsupervised Learning\index{Dual Unsupervised Learning}）\upcite{qin2020dual,DBLP:conf/nips/HeXQWYLM16,zhao2020dual}两方面，对对偶学习的思想进行介绍。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SUB-SECTION
@@ -320,7 +320,7 @@
 \label{eq:16-4}
 \end{eqnarray}

-\parinterval 通过该正则化项，互为对偶的两个任务可以被放在一起学习，通过任务对偶性加强监督学习的过程，就是有监督对偶学习\upcite{DBLP:conf/icml/XiaQCBYL17,qin2020dual}。这里，$\funp{P}(\seq{x})$和$\funp{P}(\seq{y})$这两个语言模型是预先训练好的，并不参与翻译模型的训练。可以看到，对于单独的一个模型来说，其目标函数增加了与另外一个方向的模型相关的项。这样的形式与L1/L2正则化非常类似（见{\chapterthirteen}），因此可以把这个方法看作是一种正则化的手段（由翻译任务本身的性质所启发而来）。有监督对偶学习实际上要优化如下的损失函数:
+\parinterval 通过该正则化项，互为对偶的两个任务可以被放在一起学习，通过任务对偶性加强监督学习的过程，就是有监督对偶学习\upcite{DBLP:conf/icml/XiaQCBYL17,qin2020dual}。这里，$\funp{P}(\seq{x})$和$\funp{P}(\seq{y})$这两个语言模型是预先训练好的，并不参与翻译模型的训练。可以看到，对于单独的一个模型来说，其目标函数增加了与另外一个方向的模型相关的损失项。这样的形式与L1/L2正则化非常类似（见{\chapterthirteen}），因此可以把这个方法看作是一种正则化的手段（由翻译任务本身的性质所启发而来）。有监督对偶学习实际上要优化如下的损失函数:
 \begin{eqnarray}
 {L} & = &  \log{\funp{P}(\seq{y}|\seq{x})}+\log{\funp{P}(\seq{x}|\seq{y})}+{L}_{\rm{dual}}
 \label{eq:16-5}
@@ -333,9 +333,9 @@
 %----------------------------------------------------------------------------------------
 \subsubsection{2. 无监督对偶学习}

-\parinterval 有监督的对偶学习需要使用双语数据来训练两个翻译模型，但是有些低资源语言仅有少量双语数据可以训练。幸运的是，存在大量的单语数据可供使用。因此，如何使用这些单语数据来提升翻译模型的性能也是一个关键问题。
+\parinterval 有监督的对偶学习需要使用双语数据来训练两个翻译模型，但是有些低资源语言仅有少量双语数据可以训练。因此，如何使用资源相对丰富的单语数据来提升翻译模型的性能也是一个关键问题。

-\parinterval 无监督对偶学习提供了一个解决问题的思路\upcite{qin2020dual}。假设目前有两个比较弱的翻译模型，一个原始任务模型$f$将源语言句子$\seq{x}$翻译成目标语言句子$\seq{y}$，一个对偶任务模型$g$将目标语言句子$\seq{y}$翻译成源语言句子$\seq{x}$。翻译模型可由有限的双语训练或者使用无监督机器翻译的方法得到。如图\ref{fig:16-10}所示，无监督对偶学习的做法是，先通过原始任务模型$f$将一个源语言单语句子$x$翻译为目标语言句子$y$，由于没有参考译文，无法判断$y$的正确性。但通过语言模型，可以判断这个句子是否通顺、符合语法规范，这些信息可用来评估翻译模型$f$的翻译流畅性。随后，再通过对偶任务模型$g$将目标语言句子$y$翻译为源语言句子$x^{'}$。如果模型$f$和$g$的翻译性能较好，那么$x^{'}$和$x$会十分相似。通过计算二者的{\small\bfnew{重构损失}}\index{重构损失}（Reconstruction Loss）\index{Reconstruction Loss}，就可以优化模型$f$和$g$的参数。这个过程可以多次迭代，从大量的无标注单语数据上不断提升性能。
+\parinterval 无监督对偶学习提供了一个解决问题的思路\upcite{qin2020dual}。假设目前有两个比较弱的翻译模型，一个原始翻译模型$f$将源语言句子$\seq{x}$翻译成目标语言句子$\seq{y}$，一个对偶任务模型$g$将目标语言句子$\seq{y}$翻译成源语言句子$\seq{x}$。翻译模型可由有限的双语训练或者使用无监督机器翻译的方法得到。如图\ref{fig:16-10}所示，无监督对偶学习的做法是，先通过原始任务模型$f$将一个源语言单语句子$x$翻译为目标语言句子$y$，由于没有参考译文，无法判断$y$的正确性。但通过语言模型，可以判断这个句子是否通顺、符合语法规范，这些信息可用来评估翻译模型$f$的翻译流畅性。随后，再通过对偶任务模型$g$将目标语言句子$y$翻译为源语言句子$x^{'}$。如果模型$f$和$g$的翻译性能较好，那么$x^{'}$和$x$会十分相似。通过计算二者的{\small\bfnew{重构损失}}\index{重构损失}（Reconstruction Loss）\index{Reconstruction Loss}，就可以优化模型$f$和$g$的参数。这个过程可以多次迭代，从大量的无标注单语数据上不断提升性能。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -353,19 +353,7 @@
 %----------------------------------------------------------------------------------------
 \section{多语言翻译模型}\label{multilingual-translation-model}

-\parinterval 低资源机器翻译面临的主要挑战是缺乏大规模高质量的双语数据。这个问题往往伴随着多语言的翻译任务\upcite{dabre2020survey}。也就是，要同时开发多个不同语言之间的机器翻译系统，其中少部分语言是富资源语言，而其它语言是低资源语言。针对低资源语言双语数据稀少或者缺失的情况，一种常见的思路是利用富资源语言的数据或者系统帮助低资源机器翻译系统。这也构成了多语言翻译的思想，并延伸出大量的研究工作，其中有三个典型研究方向：
-
-\begin{itemize}
-\vspace{0.5em}
-\item 基于枢轴语言的方法，即以资源丰富的语言（通常为英语、汉语等）为中心，在语言对之间进行翻译\upcite{DBLP:journals/mt/WuW07}；
-\vspace{0.5em}
-\item 基于知识蒸馏的方法，即用枢轴语言到目标语言的训练指导源语言到目标语言的训练\upcite{DBLP:journals/corr/ChenLCL17}；
-\vspace{0.5em}
-\item 基于迁移学习的方法，即从富资源语言对中转移翻译知识以改善低资源语言的翻译\upcite{DBLP:conf/emnlp/KimPPKN19}，比如，将富资源的翻译知识迁移到零资源翻译模型上，即在没有双语训练数据的语言对之间进行翻译\upcite{DBLP:journals/tacl/JohnsonSLKWCTVW17}。
-\vspace{0.5em}
-\end{itemize}
-
-\parinterval 下面将对上面三种典型方法进行讨论。
+\parinterval 低资源机器翻译面临的主要挑战是缺乏大规模高质量的双语数据。这个问题往往伴随着多语言的翻译任务\upcite{dabre2020survey}。也就是，要同时开发多个不同语言之间的机器翻译系统，其中少部分语言是富资源语言，而其它语言是低资源语言。针对低资源语言双语数据稀少或者缺失的情况，一种常见的思路是利用富资源语言的数据或者系统帮助低资源机器翻译系统。这也构成了多语言翻译的思想，并延伸出大量的研究工作，其中有三个典型研究方向：基于枢轴语言的方法\upcite{DBLP:journals/mt/WuW07}、 基于知识蒸馏的方法\upcite{DBLP:journals/corr/ChenLCL17}、基于迁移学习的方法\upcite{DBLP:conf/emnlp/KimPPKN19,DBLP:journals/tacl/JohnsonSLKWCTVW17}，下面将对上面三种典型方法进行讨论。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -374,9 +362,9 @@
 \subsection{基于枢轴语言的方法}
 \label{sec:pivot-based-translation}

-\parinterval 传统的多语言翻译中，广泛使用的是{\small\bfnew{基于枢轴语言的翻译}}（Pivot-based Translation）\upcite{DBLP:conf/emnlp/KimPPKN19,DBLP:journals/mt/WuW07}。这种方法会使用一种数据丰富的语言作为{\small\bfnew{中介语言}}\index{中介语言}或者{\small\bfnew{枢轴语言}}\index{枢轴语言}（Pivot Language）\index{Pivot Language}，之后让源语言向枢轴语言进行翻译，枢轴语言向目标语言进行翻译。这样，通过资源丰富的枢轴语言将源语言和目标语言桥接在一起，达到解决源语言-目标语言双语数据缺乏的问题。比如，想要得到泰语到波兰语的翻译，可以通过英语做枢轴语言。通过“泰语$\to$英语$\to$波兰语”的翻译过程完成泰语到波兰语的转换。
+\parinterval 传统的多语言翻译中，广泛使用的是{\small\bfnew{基于枢轴语言的翻译}}（Pivot-based Translation）\upcite{DBLP:conf/emnlp/KimPPKN19,DBLP:journals/mt/WuW07}。这种方法会使用一种数据丰富的语言作为{\small\bfnew{枢轴语言}}\index{枢轴语言}（Pivot Language）\index{Pivot Language}，之后让源语言向枢轴语言进行翻译，枢轴语言向目标语言进行翻译。这样，通过资源丰富的枢轴语言将源语言和目标语言桥接在一起，达到解决原翻译任务中双语数据缺乏的问题。比如，想要得到泰语到波兰语的翻译，可以通过英语做枢轴语言。通过“泰语$\to$英语$\to$波兰语”的翻译过程完成泰语到波兰语的转换。

-\parinterval 在基于统计的机器翻译中，已经有很多方法建立了源语言到枢轴语言和枢轴语言到目标语言的短语/单词级别特征，并基于这些特征开发了源语言到目标语言的系统\upcite{DBLP:conf/naacl/UtiyamaI07,DBLP:conf/acl/ZahabiBK13,DBLP:conf/emnlp/ZhuHWZWZ14,DBLP:conf/acl/MiuraNSTN15}，这些系统也已经广泛用于翻译低资源语言对\upcite{DBLP:conf/acl/CohnL07,DBLP:journals/mt/WuW07,DBLP:conf/acl/WuW09,de2006catalan}。由于基于枢轴语言的方法与模型结构无关，该方法也适用于神经机器翻译，并且取得了不错的效果\upcite{DBLP:conf/emnlp/KimPPKN19,DBLP:journals/corr/ChengLYSX16}。
+\parinterval 在基于统计的机器翻译中，已经有很多方法建立了源语言到枢轴语言和枢轴语言到目标语言的短语/单词级别特征，并基于这些特征开发了源语言到目标语言的系统\upcite{DBLP:conf/naacl/UtiyamaI07,DBLP:conf/acl/ZahabiBK13,DBLP:conf/emnlp/ZhuHWZWZ14,DBLP:conf/acl/MiuraNSTN15}，这些系统也已经广泛用于低资源翻译任务\upcite{DBLP:conf/acl/CohnL07,DBLP:journals/mt/WuW07,DBLP:conf/acl/WuW09,de2006catalan}。由于基于枢轴语言的方法与模型结构无关，该方法也适用于神经机器翻译，并且取得了不错的效果\upcite{DBLP:conf/emnlp/KimPPKN19,DBLP:journals/corr/ChengLYSX16}。

 \parinterval 基于枢轴语言的方法可以被描述为如图\ref{fig:16-11}所示的过程。这里，使用虚线表示具有双语平行语料库的语言对，并使用带有箭头的实线表示翻译方向，令$\seq{x}$、$\seq{y}$和$\seq{p}$ 分别表示源语言、目标语言和枢轴语言，对于输入源语言句子$\seq{x}$和目标语言句子$\seq{y}$，其翻译过程可以被建模为公式\eqref{eq:16-7}：

@@ -404,7 +392,7 @@

 \subsection{基于知识蒸馏的方法}

-\parinterval 为了解决基于枢轴语言的方法中存在的错误传播等问题，可以采用基于知识蒸馏的方法\upcite{DBLP:journals/corr/ChenLCL17,DBLP:conf/iclr/TanRHQZL19}。知识蒸馏是一种常用的模型压缩方法\upcite{Hinton2015Distilling}，基于教师-学生框架，在第十三章已经进行了详细介绍。针对稀缺资源任务，基于教师-学生框架的方法基本思想如图\ref{fig:16-12}所示。其中，虚线表示具有平行语料库的语言对，带有箭头的实线表示翻译方向。这里，将枢轴语言（$\seq{p}$）到目标语言（$\seq{y}$）的翻译模型$\funp{P}(\seq{y}|\seq{p})$当作教师模型，源语言（$\seq{x}$）到目标语言（$\seq{y}$）的翻译模型$\funp{P}(\seq{y}|\seq{x})$当作学生模型。然后，用教师模型来指导学生模型的训练，这个过程中学习的目标就是让$\funp{P}(\seq{y}|\seq{x})$尽可能接近$\funp{P}(\seq{y}|\seq{p})$，这样学生模型就可以学习到源语言到目标语言的翻译知识。
+\parinterval 为了缓解基于枢轴语言的方法中存在的错误传播等问题，可以采用基于知识蒸馏的方法\upcite{DBLP:journals/corr/ChenLCL17,DBLP:conf/iclr/TanRHQZL19}。知识蒸馏是一种常用的模型压缩方法\upcite{Hinton2015Distilling}，基于教师-学生框架，在第十三章已经进行了详细介绍。针对低资源翻译任务，基于教师-学生框架的方法基本思想如图\ref{fig:16-12}所示。其中，虚线表示具有平行语料库的语言对，带有箭头的实线表示翻译方向。这里，将枢轴语言（$\seq{p}$）到目标语言（$\seq{y}$）的翻译模型$\funp{P}(\seq{y}|\seq{p})$当作教师模型，源语言（$\seq{x}$）到目标语言（$\seq{y}$）的翻译模型$\funp{P}(\seq{y}|\seq{x})$当作学生模型。然后，用教师模型来指导学生模型的训练，这个过程中学习的目标就是让$\funp{P}(\seq{y}|\seq{x})$尽可能接近$\funp{P}(\seq{y}|\seq{p})$，这样学生模型就可以学习到源语言到目标语言的翻译知识。举个例子，假设图\ref{fig:16-12}中$\seq{x}$为源语言德语 “hallo”，$\seq{p}$为中间语言英语 “hello”，$\seq{y}$为目标语言法语“bonjour”，则德语“hallo”翻译为法语“bonjour”的概率应该与英语“hello”翻译为法语“bonjour”的概率相近。
 %----------------------------------------------
 \begin{figure}[h]
 \centering
@@ -421,7 +409,7 @@
 \label{eq:16-8}
 \end{eqnarray}

-\parinterval 和基于枢轴语言的方法相比，基于教师-学生框架的方法无需训练源语言到枢轴语言的翻译模型，也就无需经历两次翻译过程，翻译效率有所提升，又避免了两次翻译所面临的错误传播问题。举个例子，假设图\ref{fig:16-12}中$\seq{x}$为源语言德语 “hallo”，$\seq{p}$为中间语言英语 “hello”，$\seq{y}$为目标语言法语“bonjour”，则德语“hallo”翻译为法语“bonjour”的概率应该与英语“hello”翻译为法语“bonjour”的概率相近。
+\parinterval 和基于枢轴语言的方法相比，基于知识蒸馏的方法无需训练源语言到枢轴语言的翻译模型，也就无需经历两次翻译过程，翻译效率有所提升，又避免了两次翻译所面临的错误传播问题。

 \parinterval 不过，基于知识蒸馏的方法仍然需要显性地使用枢轴语言进行桥接，因此仍然面临着“源语言$\to$枢轴语言$\to$目标语言”转换中信息丢失的问题。比如，当枢轴语言到目标语言翻译效果较差时，由于教师模型无法提供准确的指导，学生模型也无法取得很好的学习效果。

@@ -431,9 +419,9 @@

 \subsection{基于迁移学习的方法}

-\parinterval {\small\bfnew{迁移学习}}\index{迁移学习}（Transfer Learning）\index{Transfer Learning}是一种机器学习的方法，指的是一个预训练的模型被重新用在另一个任务中，而并不是从头训练一个新的模型\upcite{DBLP:journals/corr/HintonVD15}。迁移学习的目标是将某个领域或任务上学习到的知识应用到不同但相关的领域或问题中。在机器翻译中，可以用富资源语言的知识来改进低资源语言上的机器翻译性能，即将富资源语言中的知识迁移到低资源语言中。
+\parinterval {\small\bfnew{迁移学习}}\index{迁移学习}（Transfer Learning）\index{Transfer Learning}是一种机器学习的方法，指的是一个预训练的模型被重新用在另一个任务中，而并不是从头训练一个新的模型\upcite{Hinton2015Distilling}。迁移学习的目标是将某个领域或任务上学习到的知识应用到不同但相关的领域或问题中。在机器翻译中，可以用富资源语言的知识来改进低资源语言上的机器翻译性能，也就是将富资源语言中的知识迁移到低资源语言中。

-\parinterval 基于枢轴语言的方法需要显性地建立“源语言$\to$枢轴语言$\to$目标语言”的路径。这时，如果路径中某处出现了问题，就会成为整个路径的瓶颈。如果使用多个枢轴语言，这个问题就会更加严重。不同于基于枢轴语言的方法，迁移学习无需进行两步解码，也就避免了翻译路径中累积错误的问题。
+\parinterval 基于枢轴语言的方法需要显性地建立“源语言$\to$枢轴语言$\to$目标语言”的路径。这时，如果路径中某处出现了问题，就会成为整个路径的瓶颈。如果使用多个枢轴语言，这个问题就会更加严重。不同于基于枢轴语言的方法，迁移学习无需进行两步解码，也就避免了翻译路径中错误累积的问题。

 \parinterval 基于迁移学习的方法思想非常简单，如图\ref{fig:16-13}所示。这种方法无需像传统的机器学习一样为每个任务单独训练一个模型，它将所有任务分类为源任务和目标任务，目标就是将源任务中的知识迁移到目标任务当中。
 %----------------------------------------------
@@ -452,7 +440,7 @@
 %----------------------------------------------------------------------------------------
 \subsubsection{1. 参数初始化方法}

-\parinterval 在解决多语言翻译问题时，首先需要在富资源语言上训练一个翻译模型，将其称之为{\small\bfnew{父模型}}\index{父模型}（Parent Model）\index{Parent Model}。在对父模型的参数进行初始化的基础上，训练低资源语言的翻译模型，称之为{\small\bfnew{子模型}}\index{子模型}（Child Model）\index{Child Model}，这意味着低资源翻译模型将不会从随机初始化的参数开始学习，而是从父模型的参数开始\upcite{gu2018meta,DBLP:conf/icml/FinnAL17,DBLP:conf/naacl/GuHDL18}。这时，也可以把参数初始化看作是迁移学习。在图\ref{fig:16-14}中，左侧模型为父模型，右侧模型为子模型。这里假设从英语到汉语的翻译为富资源翻译，从英语到德语的翻译为低资源翻译，则首先用英中双语平行语料库训练出一个初始化的父模型，之后再用英语到德语的数据在父模型上微调得到子模型，这个子模型即为迁移学习的模型。此过程可以看作是在富资源语言训练模型上对低资源语言进行微调，将富资源语言中的知识迁移到低资源语言中，从而提升低资源语言的模型性能。
+\parinterval 在解决多语言翻译问题时，首先需要在富资源语言上训练一个翻译模型，将其称之为{\small\bfnew{父模型}}\index{父模型}（Parent Model）\index{Parent Model}。在对父模型的参数进行初始化的基础上，训练低资源语言的翻译模型，称之为{\small\bfnew{子模型}}\index{子模型}（Child Model）\index{Child Model}，这意味着低资源翻译模型将不会从随机初始化的参数开始学习，而是从父模型的参数开始\upcite{gu2018meta,DBLP:conf/icml/FinnAL17,DBLP:conf/naacl/GuHDL18}。这时，也可以把参数初始化看作是迁移学习。在图\ref{fig:16-14}中，左侧模型为父模型，右侧模型为子模型。这里假设从英语到汉语的翻译为富资源翻译，从英语到德语的翻译为低资源翻译，则首先用英中双语平行语料库训练出一个父模型，之后再用英语到德语的数据在父模型上微调得到子模型，这个子模型即为迁移学习的模型。此过程可以看作是在富资源语言训练模型上使用低资源语言的数据进行微调，将富资源语言中的知识迁移到低资源语言中，从而提升低资源语言的模型性能。

 %----------------------------------------------
 \begin{figure}[h]
@@ -470,7 +458,7 @@
 %----------------------------------------------------------------------------------------
 \subsubsection{2. 多语言单模型系统} \label{sec:multi-lang-single-model}

-\parinterval 多语言单模型方法也可以被看做是一种迁移学习。多语言单模型方法能够有效地改善低资源神经机器翻译性能\upcite{DBLP:journals/tacl/JohnsonSLKWCTVW17,DBLP:conf/lrec/RiktersPK18,dabre2020survey}，尤其适用于翻译方向较多的情况，因为为每一个翻译方向单独训练一个模型是不实际的，不仅因为设备资源和时间上的限制，还因为很多翻译方向都没有双语平行数据。比如，要翻译100个语言之间互译的系统，理论上就需要训练$100 \times 99$个翻译模型，代价是十分巨大的。这时就需要用到{\small\bfnew{多语言单模型方法}}\index{多语言单模型方法}（Multi-lingual Single Model-based Method\index{Multi-lingual Single Model-based Method}）。
+\parinterval {\small\bfnew{多语言单模型方法}}\index{多语言单模型方法}（Multi-lingual Single Model-based Method\index{Multi-lingual Single Model-based Method}）也可以被看做是一种迁移学习。多语言单模型方法能够有效地改善低资源神经机器翻译性能\upcite{DBLP:journals/tacl/JohnsonSLKWCTVW17,DBLP:conf/lrec/RiktersPK18,dabre2020survey}，尤其适用于翻译方向较多的情况，因为为每一个翻译方向单独训练一个模型是不实际的，不仅因为设备资源和时间上的限制，还因为很多翻译方向都没有双语平行数据。比如，要翻译100个语言之间互译的系统，理论上就需要训练$100 \times 99$个翻译模型，代价是十分巨大的。这时就需要用到多语言单模型方法。

 \parinterval 多语言单模型系统是指用单个模型训练具有多个语言翻译方向的系统。对于源语言集合$\seq{G}_x$和目标语言集合$\seq{G}_y$，多语言单模型的学习目标是学习一个单一的模型，这个模型可以进行任意源语言到任意目标语言的翻译，即同时支持所有$\{(l_x,l_y)|x \in \seq{G}_x,y \in \seq{G}_y)\}$的翻译。多语言单模型方法又可以进一步分为一对多\upcite{DBLP:conf/acl/DongWHYW15}、多对一\upcite{DBLP:journals/tacl/LeeCH17}和多对多\upcite{DBLP:conf/naacl/FiratCB16}的方法。不过这些方法本质上是相同的，因此这里以多对多翻译为例进行介绍。

@@ -496,7 +484,7 @@

 \section{无监督机器翻译}

-\parinterval 低资源机器翻译的一种极端情况是：没有任何可以用于模型训练的双语平行数据。一种思路是借用多语言翻译方面的技术（见\ref{multilingual-translation-model}节），利用基于枢轴语言或者零资源的方法构建翻译系统。但是，这类方法仍然需要多个语种的平行数据。对于某一个语言对，在只有源语言和目标语言单语数据的前提下，是否仍然可以训练一个有效的翻译模型呢？这里称这种不需要双语数据的机器翻译方法为{\small\bfnew{无监督机器翻译}}\index{无监督机器翻译}（Un-supervised Machine Translation\index{Un-supervised Machine Translation}）。
+\parinterval 低资源机器翻译的一种极端情况是：没有任何可以用于模型训练的双语平行数据。一种思路是借用多语言翻译方面的技术（见\ref{multilingual-translation-model}节），利用基于枢轴语言或者零资源的方法构建翻译系统。但是，这类方法仍然需要多个语种的平行数据。对于某一个语言对，在只有源语言和目标语言单语数据的前提下，是否仍然可以训练一个有效的翻译模型呢？这里称这种不需要双语数据的机器翻译方法为{\small\bfnew{无监督机器翻译}}\index{无监督机器翻译}（Unsupervised Machine Translation\index{Un-supervised Machine Translation}）。

 \parinterval 直接进行无监督机器翻译是很困难的。一个简单可行的思路是把问题进行分解，然后分别解决各个子问题，最后形成完整的解决方案。放到无监督机器翻译里面，可以首先使用无监督方法寻找词与词之间的翻译，然后在此基础上，进一步得到句子到句子的翻译模型。这种“由小到大”的建模思路十分类似于统计机器翻译中的方法（见\chapterseven）。

@@ -507,7 +495,7 @@

 \subsection{无监督词典归纳}\label{unsupervised-dictionary-induction}

-\parinterval {\small\bfnew{词典归纳}}\index{词典归纳}（Bilingual Dictionary Induction，BDI\index{Bilingual Dictionary Induction}），也叫{\small\bfnew{词典推断}}，是实现语种间单词级别翻译的任务。在统计机器翻译中，词典归纳是一项核心的任务，它从双语平行语料中发掘互为翻译的单词，是翻译知识的主要来源\upcite{黄书剑0统计机器翻译中的词对齐研究}。在端到端神经机器翻译中，词典归纳通常被用到无监督机器翻译、多语言机器翻译等任务中。在神经机器翻译中，单词通过实数向量来表示，即词嵌入。所有单词分布在一个多维空间中，而且研究人员发现：词嵌入空间在各种语言中显示出类似的结构，这使得直接利用词嵌入来构建双语词典成为可能\upcite{DBLP:journals/corr/MikolovLS13}。其基本想法是先将来自不同语言的词嵌入投影到共享嵌入空间中，然后在这个共享空间中归纳出双语词典，原理如图\ref{fig:16-16}所示。较早的尝试是使用一个包含数千词对的种子词典作为锚点来学习从源语言到目标语词言嵌入空间的线性映射，将两个语言的单词投影到共享的嵌入空间之后，执行一些对齐算法即可得到双语词典\upcite{DBLP:journals/corr/MikolovLS13}。最近的研究表明，词典归纳可以在更弱的监督信号下完成，这些监督信号来自更小的种子词典\upcite{DBLP:conf/acl/VulicK16}、 相同的字符串\upcite{DBLP:conf/iclr/SmithTHH17}，甚至仅仅是共享的数字\upcite{DBLP:conf/acl/ArtetxeLA17}。
+\parinterval {\small\bfnew{词典归纳}}\index{词典归纳}（Bilingual Dictionary Induction，BDI\index{Bilingual Dictionary Induction}）可用于处理语种间单词级别翻译的任务。在统计机器翻译中，词典归纳是一项核心的任务，它从双语平行语料中发掘互为翻译的单词，是翻译知识的主要来源\upcite{黄书剑0统计机器翻译中的词对齐研究}。在端到端神经机器翻译中，词典归纳通常被用到无监督机器翻译、多语言机器翻译等任务中。在神经机器翻译中，单词通过实数向量来表示，即词嵌入。所有单词分布在一个多维空间中，而且研究人员发现：词嵌入空间在各种语言中显示出类似的结构，这使得直接利用词嵌入来构建双语词典成为可能\upcite{DBLP:journals/corr/MikolovLS13}。其基本想法是先将来自不同语言的词嵌入投影到共享嵌入空间中，然后在这个共享空间中归纳出双语词典，原理如图\ref{fig:16-16}所示。较早的尝试是使用一个包含数千词对的种子词典作为锚点来学习从源语言到目标语词言嵌入空间的线性映射，将两个语言的单词投影到共享的嵌入空间之后，执行一些对齐算法即可得到双语词典\upcite{DBLP:journals/corr/MikolovLS13}。最近的研究表明，词典归纳可以在更弱的监督信号下完成，这些监督信号来自更小的种子词典\upcite{DBLP:conf/acl/VulicK16}、 相同的字符串\upcite{DBLP:conf/iclr/SmithTHH17}，甚至仅仅是共享的数字\upcite{DBLP:conf/acl/ArtetxeLA17}。
 %----------------------------------------------
 \begin{figure}[h]
 \centering
@@ -566,7 +554,7 @@

 \parinterval 在得到映射$\mathbi{W}$之后，对于$\mathbi{X}$中的任意一个单词$x_{i}$，通过$\mathbi{W} \mathbi{E}({x}_{i})$将其映射到空间$\mathbi{y}$中（$\mathbi{E}({x}_{i})$表示的是单词$x_{i}$的词嵌入向量），然后在$\mathbi{Y}$中找到该点的最近邻点$y_{j}$，于是$y_{j}$就是$x_{i}$的翻译词，重复该过程即可归纳出种子词典$D$，第一阶段结束。事实上，由于第一阶段缺乏监督信号，得到的种子词典$D$会包含大量的噪音，因此需要进行进一步的微调。

-\parinterval 微调的原理普遍基于普氏分析\upcite{DBLP:journals/corr/MikolovLS13}。假设现在有一个种子词典$D=\left\{x_{i}, y_{i}\right\}$其中${i \in\{1, n\}}$，和两个单语词嵌入$\mathbi{X}$和$\mathbi{Y}$，那么就可以将$D$作为{\small\bfnew{映射锚点}}\index{映射锚点}（Anchor\index{Anchor}）学习一个转移矩阵$\mathbi{W}$，使得$\mathbi{W} \mathbi{X}$与$\mathbi{Y}$这两个空间尽可能相近，此外通过对$\mathbi{W}$施加正交约束可以显著提高能\upcite{DBLP:conf/naacl/XingWLL15}，于是这个优化问题就转变成了{\small\bfnew{普鲁克问题}}\index{普鲁克问题}（Procrustes Problem\index{Procrustes Problem}）\upcite{DBLP:conf/iclr/SmithTHH17}，可以通过{\small\bfnew{奇异值分解}}\index{奇异值分解}（Singular Value Decomposition，SVD\index{Singular Value Decomposition}）来获得近似解：
+\parinterval 微调的原理普遍基于普氏分析\upcite{DBLP:journals/corr/MikolovLS13}。假设现在有一个种子词典$D=\left\{x_{i}, y_{i}\right\}$其中${i \in\{1, n\}}$，和两个单语词嵌入$\mathbi{X}$和$\mathbi{Y}$，那么就可以将$D$作为{\small\bfnew{映射锚点}}\index{映射锚点}（Anchor\index{Anchor}）学习一个转移矩阵$\mathbi{W}$，使得$\mathbi{W} \mathbi{X}$与$\mathbi{Y}$这两个空间尽可能相近，此外通过对$\mathbi{W}$施加正交约束可以显著提高性能\upcite{DBLP:conf/naacl/XingWLL15}，于是这个优化问题就转变成了{\small\bfnew{普鲁克问题}}\index{普鲁克问题}（Procrustes Problem\index{Procrustes Problem}）\upcite{DBLP:conf/iclr/SmithTHH17}，可以通过{\small\bfnew{奇异值分解}}\index{奇异值分解}（Singular Value Decomposition，SVD\index{Singular Value Decomposition}）来获得近似解：

 \begin{eqnarray}
 \mathbi{W}^{\star} & = &\underset{\mathbi{W} \in O_{d}(\mathbb{R})}{\operatorname{argmin}}\|\mathbi{W} \mathbi{X}'- \mathbi{Y}' \|_{\mathrm{F}} \nonumber \\
@@ -575,7 +563,7 @@
 \label{eq:16-10}
 \end{eqnarray}

-\noindent 其中， $\|\cdot\|_{\mathrm{F}}$表示矩阵的Frobenius范数，即矩阵元素绝对值的平方和再开方，$d$是embedding的维度，$\mathbb{R}$是实数，$O_d(\mathbb{R})$表示$d\times d$的实数空间，$\operatorname{SVD}(\cdot)$表示奇异值分解，$\mathbi{Y}'$和$\mathbi{X}'$中的单词来自$D$且行对齐。利用上式可以获得新的$\mathbi{W}$，通过$\mathbi{W}$可以归纳出新的$D$，如此迭代进行微调最后即可以得到收敛的$D$。
+\noindent 其中， $\|\cdot\|_{\mathrm{F}}$表示矩阵的Frobenius范数，即矩阵元素绝对值的平方和再开方，$d$是词嵌入的维度，$\mathbb{R}$是实数，$O_d(\mathbb{R})$表示$d\times d$的实数空间，$\operatorname{SVD}(\cdot)$表示奇异值分解，$\mathbi{Y}'$和$\mathbi{X}'$中的单词来自$D$且行对齐。利用上式可以获得新的$\mathbi{W}$，通过$\mathbi{W}$可以归纳出新的$D$，如此迭代进行微调最后即可以得到收敛的$D$。

 \parinterval 较早的无监督方法是基于GAN的方法\upcite{DBLP:conf/acl/ZhangLLS17,DBLP:conf/emnlp/ZhangLLS17,DBLP:conf/iclr/LampleCRDJ18}，这是一个很自然的想法，利用生成器产生映射然后用判别器来区别两个空间。然而研究表明GAN缺乏稳定性，容易在低资源语言对上失败\upcite{hartmann2018empirical}，因此有不少改进的工作，比如：利用{\small\bfnew{变分自编码器}}\index{变分自编码器}（Variational Autoencoders，VAEs）\index{Variational Autoencoders}来捕获更深层次的语义信息并结合对抗训练的方法\upcite{DBLP:conf/emnlp/DouZH18,DBLP:conf/naacl/MohiuddinJ19}；通过改进最近邻点的度量函数来提升性能的方法\upcite{DBLP:conf/acl/HuangQC19,DBLP:conf/emnlp/JoulinBMJG18}；利用多语言信号来提升性能的方法\upcite{DBLP:conf/emnlp/ChenC18,DBLP:conf/emnlp/TaitelbaumCG19,DBLP:journals/corr/abs-1811-01124,DBLP:conf/naacl/HeymanVVM19}；也有一些工作舍弃GAN，通过直接优化空间距离来进行单词的匹配\upcite{DBLP:conf/emnlp/HoshenW18,DBLP:conf/emnlp/XuYOW18,DBLP:conf/emnlp/Alvarez-MelisJ18,DBLP:conf/emnlp/MukherjeeYH18}。此外，也有一些工作旨在分析或提升无监督词典归纳的健壮性，例如，通过大量实验来分析无监督词典归纳任务的局限性、难点以及挑战\upcite{DBLP:conf/acl/SogaardVR18,DBLP:conf/acl/OrmazabalALSA19,DBLP:conf/emnlp/VulicGRK19,DBLP:conf/emnlp/HartmannKS18}；分析和对比目前各种无监督方法的性能\upcite{DBLP:conf/nips/HartmannKS19}；通过实验分析目前所用的数据集存在的问题\upcite{DBLP:conf/emnlp/Kementchedjhieva19}。

@@ -592,7 +580,7 @@
 \item 词典归纳依赖于基于大规模单语数据训练出来的词嵌入，而词嵌入会受到单语数据的来源、数量、词向量训练算法、超参数配置等多方面因素的影响，这很容易导致不同情况下词嵌入结果的差异很大。
 \vspace{0.5em}

-\item 词典归纳强烈依赖于词嵌入空间近似同构的假设，然而许多语言之间天然的差异导致该假设并不成立。因为无监督系统通常是基于两阶段的方法，起始阶段由于缺乏监督信号的引导很容易就失败，从而导致后面的阶段无法有效运行\upcite{DBLP:conf/acl/SogaardVR18,A2020Li}。
+\item 词典归纳强烈依赖于词嵌入空间近似同构的假设，然而许多语言之间天然的差异导致该假设并不成立。因为无监督系统通常是基于两阶段的方法，起始阶段由于缺乏监督信号很难得到质量较高的种子词典，进而导致后续阶段无法完成准确的词典归纳\upcite{DBLP:conf/acl/SogaardVR18,A2020Li}。
 \vspace{0.5em}

 \item 由于词嵌入这种表示方式的局限性，模型无法实现单词多对多的对齐，而且对于一些相似的词或者实体，模型也很难实现对齐。
@@ -607,7 +595,7 @@

 \subsection{无监督统计机器翻译}

-\parinterval 在无监督词典归纳的基础上，可以进一步得到句子间的翻译，实现无监督机器翻译\upcite{DBLP:journals/talip/MarieF20}。统计机器翻译作为机器翻译的主流方法，对其进行无监督学习可以帮助构建初始的无监督机器翻译系统。这样，它可以进一步被用于训练更为先进的无监督神经机器翻译系统。以基于短语的统计机器翻译为例，系统主要包含短语表、语言模型、调序模型以及权重调优等模块（见{\chapterseven}）。其中短语表和模型调优需要双语数据，而语言模型和调序模型只依赖于单语数据。因此，如果可以通过无监督的方法完成短语表和权重调优，那么就得到了无监督统计机器翻译系统\upcite{DBLP:conf/emnlp/ArtetxeLA18}。
+\parinterval 在无监督词典归纳的基础上，可以进一步得到句子间的翻译，实现无监督机器翻译\upcite{DBLP:journals/talip/MarieF20}。统计机器翻译作为机器翻译的主流方法，对其进行无监督学习可以帮助构建初始的无监督机器翻译系统，从而进一步帮助训练更为先进的无监督神经机器翻译系统。以基于短语的统计机器翻译为例，系统主要包含短语表、语言模型、调序模型以及权重调优等模块（见{\chapterseven}）。其中短语表和模型调优需要双语数据，而语言模型和调序模型只依赖于单语数据。因此，如果可以通过无监督的方法完成短语表和权重调优，那么就得到了无监督统计机器翻译系统\upcite{DBLP:conf/emnlp/ArtetxeLA18}。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SUB-SECTION
@@ -683,9 +671,9 @@
 %----------------------------------------------------------------------------------------
 \subsubsection{4. 其它问题}

-\parinterval 实际上无监督神经机器翻译模型的训练并不简单。一般可以认为，在生成的伪数据上优化模型会使模型变得更好，这时候对这个更好的模型使用数据增强的手段（如回译等）就可以生成更好的训练数据。这样的一个数据优化过程依赖于一个假设：模型经过参数优化后会生成比原始数据更好的数据。而在数据优化和参数优化的共同影响下，模型非常容易拟合数据中的简单模式，使得在数据优化过程中模型倾向产生包含这种简单模式的数据，然后模型对这种类型数据过拟合，最后训练模型的损失可以下降到很低，然而模型生成的结果却非常差。一个常见的问题就是模型对任何输入都输出相同的译文，这时候翻译模型无法产生任何有意义的结果，而它的训练过程则退化成普通的语言模型（数据优化产生的数据里无论什么目标语言对应的源语言都是同一个句子）。这种情况下翻译模型虽然能降低损失（训练语言模型），但是它不能学会任何源语言跟目标语言之间的对应关系，也就无法进行正确翻译。这个现象也反映出无监督机器翻译训练的脆弱性。
+\parinterval 实际上无监督神经机器翻译模型的训练并不简单。一般可以认为，在生成的伪数据上优化模型会使模型变得更好，这时候对这个更好的模型使用数据增强的手段（如回译等）就可以生成更好的训练数据。这样的一个数据优化过程依赖于一个假设：模型经过参数优化后会生成比原始数据更好的数据。而在数据优化和参数优化的共同影响下，模型非常容易拟合数据中的简单模式，使得在数据优化过程中模型倾向产生包含这种简单模式的数据，造成模型对这种类型数据过拟合的现象。一个常见的问题就是模型对任何输入都输出相同的译文，这时候翻译模型无法产生任何有意义的结果，而它的训练过程则退化成普通的语言模型（数据优化产生的数据里无论什么目标语言对应的源语言都是同一个句子）。这种情况下翻译模型虽然能降低损失（训练语言模型），但是它不能学会任何源语言跟目标语言之间的对应关系，也就无法进行正确翻译。这个现象也反映出无监督机器翻译训练的脆弱性。

-\parinterval 比较常见的解决方案是在双语数据对应的目标函数外增加一个语言模型的目标函数。因为，在初始阶段，由于数据中存在大量不通顺的句子，额外的语言模型目标函数能把部分句子纠正过来，使得模型逐渐生成更好的数据\upcite{DBLP:conf/emnlp/LampleOCDR18}。这个方法在实际中非常有效，尽管目前还没有太多理论上的支持。
+\parinterval 比较常见的解决方案是在双语数据对应的目标函数外增加一个语言模型的目标函数。因为，在初始阶段，由于数据中存在大量不通顺的句子，额外的语言模型目标函数能把部分句子纠正过来，使得模型逐渐生成更好的数据\upcite{DBLP:conf/emnlp/LampleOCDR18}。这个方法在实际应用中非常有效，尽管目前还没有太多理论上的支持。

 \parinterval 无监督神经机器翻译还有两个关键的技巧：
 \begin{itemize}
@@ -733,7 +721,7 @@
 \end{table}
 %----------------------------------------------

-\parinterval 实际当中三种形式的噪声函数都会被使用到，其中在交换方法中距离越相近的词越容易被交换，并且保证被交换的词的对数有限，而删除和空白方法里词的删除和替换概率通常都会设置的非常低，如$0.1$等。
+\parinterval 实际应用中以上三种形式的噪声函数都会被使用到，其中在交换方法中距离越相近的词越容易被交换，并且保证被交换的词的对数有限，而删除和空白方法里词的删除和替换概率通常都会设置的非常低，如$0.1$等。

 %----------------------------------------------------------------------------------------
 %    NEW SECTION 16.5
@@ -824,7 +812,7 @@
 %----------------------------------------------------------------------------------------
 \subsubsection{1. 多目标学习}

-\parinterval 在使用多领域数据时，混合多个相差较大的领域数据进行训练会使单个领域的翻译性能下降\upcite{DBLP:conf/eacl/NegriTFBF17}。 为了解决这一问题，可以对所有训练数据的来源领域进行区分，一个比较典型的做法是在使用多领域数据训练时，在神经机器翻译模型的编码器顶部中添加一个判别器\upcite{DBLP:conf/wmt/BritzLP17}，该判别器使用源语言句子$x$的编码器表示作为输入，预测句子所属的领域标签$d$，如图\ref{fig:16-20}所示。为了使预测领域标签$d$的正确概率$\funp{P(d|\mathbi{H})}$最大（其中$\mathbi{H}$为编码器的隐藏状态），模型在训练过程中最小化如下损失函数$\funp{L}_{\rm{disc}}$：
+\parinterval 在使用多领域数据时，混合多个相差较大的领域数据进行训练会使单个领域的翻译性能下降\upcite{DBLP:conf/eacl/NegriTFBF17}。 为了解决这一问题，可以对所有训练数据的来源领域进行区分，一个比较典型的做法是在使用多领域数据训练时，在神经机器翻译模型的编码器顶部中添加一个判别器\upcite{britz2017effective}，该判别器使用源语言句子$x$的编码器表示作为输入，预测句子所属的领域标签$d$，如图\ref{fig:16-20}所示。为了使预测领域标签$d$的正确概率$\funp{P(d|\mathbi{H})}$最大（其中$\mathbi{H}$为编码器的隐藏状态），模型在训练过程中最小化如下损失函数$\funp{L}_{\rm{disc}}$：

 \begin{eqnarray}
 \funp{L}_{\rm{disc}}& = &-\log\funp{P}(d|\mathbi{H})
@@ -895,13 +883,13 @@
 \item 如何更高效地利用已有双语数据或单语数据进行数据增强始终是一个热点问题。研究人员分别探索了源语言单语和目标语言单语的使用方法\upcite{DBLP:conf/emnlp/ZhangZ16,DBLP:conf/emnlp/WuWXQLL19,DBLP:conf/acl/XiaKAN19}，以及如何对已有双语数据进行修改\upcite{DBLP:conf/emnlp/WangPDN18,DBLP:conf/acl/GaoZWXQCZL19}。经过数据增强得到的伪数据的质量时好时坏，如何提高伪数据的质量，以及更好地利用伪数据进行训练也是十分重要的问题\upcite{DBLP:conf/emnlp/FadaeeM18,DBLP:conf/nlpcc/XuLXLLXZ19,DBLP:conf/wmt/CaswellCG19,DBLP:journals/corr/abs200403672,DBLP:conf/emnlp/WangLWLS19}。此外，还有一些工作对数据增强技术进行了理论分析\upcite{DBLP:conf/emnlp/LiLHZZ19,DBLP:conf/acl/MarieRF20}。

 \vspace{0.5em}
-\item 预训练模型也是自然语言处理的重要突破之一，也给低资源机器翻译提供了新的思路。除了基于语言模型或掩码语言模型的方法，也有很多新的架构和模型被提出，如排列语言模型、降噪自编码器等\upcite{DBLP:conf/nips/YangDYCSL19,DBLP:conf/acl/LewisLGGMLSZ20,DBLP:conf/iclr/LanCGGSS20,DBLP:conf/acl/ZhangHLJSL19}。预训练技术也逐渐向多语言领域扩展\upcite{DBLP:conf/nips/ConneauL19,DBLP:conf/emnlp/HuangLDGSJZ19,song2019mass}，甚至不再只局限于文本任务\upcite{DBLP:conf/iccv/SunMV0S19,DBLP:journals/corr/abs-2010-12831,DBLP:conf/nips/LuBPL19,DBLP:conf/interspeech/ChuangLLL20}。对于如何将预训练模型高效地应用到下游任务中，也进行了很多的经验性对比与分析\upcite{DBLP:journals/corr/abs-1802-05365,DBLP:conf/rep4nlp/PetersRS19,DBLP:conf/cncl/SunQXH19}。但将预训练模型应用于下游任务存在的一个问题是，模型巨大的参数量会带来较大的延时及显存消耗。因此，很多工作对如何压缩预训练模型进行了研究\upcite{shen2020q,Lan2020ALBERTAL,DBLP:journals/corr/abs-1910-01108,Jiao2020TinyBERTDB}。
+\item 预训练模型也是自然语言处理的重要突破之一，也给低资源机器翻译提供了新的思路。除了基于语言模型或掩码语言模型的方法，也有很多新的架构和模型被提出，如排列语言模型、降噪自编码器等\upcite{DBLP:conf/nips/YangDYCSL19,DBLP:conf/acl/LewisLGGMLSZ20,DBLP:conf/iclr/LanCGGSS20,DBLP:conf/acl/ZhangHLJSL19}。预训练技术也逐渐向多语言领域扩展\upcite{DBLP:conf/nips/ConneauL19,DBLP:conf/emnlp/HuangLDGSJZ19,song2019mass}，甚至不再只局限于文本任务\upcite{DBLP:conf/iccv/SunMV0S19,DBLP:journals/corr/abs-2010-12831,DBLP:conf/nips/LuBPL19,DBLP:conf/interspeech/ChuangLLL20}。对于如何将预训练模型高效地应用到下游任务中，也进行了很多的经验性对比与分析\upcite{Peters2018DeepCW,DBLP:conf/rep4nlp/PetersRS19,DBLP:conf/cncl/SunQXH19}。但将预训练模型应用于下游任务存在的一个问题是，模型巨大的参数量会带来较大的延时及显存消耗。因此，很多工作对如何压缩预训练模型进行了研究\upcite{shen2020q,Lan2020ALBERTAL,DBLP:journals/corr/abs-1910-01108,Jiao2020TinyBERTDB}。

 \vspace{0.5em}
 \item 多任务学习是多语言翻译的一种典型方法。通过共享编码器模块或是注意力模块来进行一对多\upcite{DBLP:conf/acl/DongWHYW15}或多对一\upcite{DBLP:journals/tacl/LeeCH17}或多对多\upcite{DBLP:conf/naacl/FiratCB16} 的学习，然而这些方法需要为每个翻译语言对设计单独的编码器和解码器，限制了其扩展性。为了解决以上问题，研究人员进一步探索了用于多语言翻译的单个机器翻译模型的方法，也就是本章提到的多语言单模型系统\upcite{DBLP:journals/corr/HaNW16,DBLP:journals/tacl/JohnsonSLKWCTVW17}。为了弥补多语言单模型系统中缺乏语言表示多样性的问题，可以重新组织分享模块，设计特定任务相关模块\upcite{DBLP:conf/coling/BlackwoodBW18,DBLP:conf/wmt/SachanN18,DBLP:conf/wmt/LuKLBZS18,DBLP:conf/acl/WangZZZXZ19}；也可以将多语言单词编码和语言聚类分离，用一种多语言词典编码框架智能地共享词汇级别的信息，有助于语言间的泛化\upcite{DBLP:conf/iclr/WangPAN19}；还可以将语言聚类为不同的组，并为每个聚类单独训练一个多语言模型\upcite{DBLP:conf/emnlp/TanCHXQL19}。

 \vspace{0.5em}
-\item 零资源翻译也是近几年受到广泛关注的研究方向\upcite{firat2016zero,DBLP:journals/corr/abs-1805-10338}。在零资源翻译中，仅使用少量并行语料库（覆盖$k$个语言），单个多语言翻译模型就能在任何$k(k-1)$个语言对之间进行翻译\upcite{DBLP:conf/naacl/Al-ShedivatP19}。 但是，零资源翻译的性能通常很不稳定并且明显落后于有监督的翻译方法。为了改善零资源翻译，可以开发新的跨语言正则化方法，例如对齐正则化方法\upcite{DBLP:journals/corr/abs-1903-07091}，一致性正则化方法\upcite{DBLP:conf/naacl/Al-ShedivatP19}；也可以通过反向翻译或基于枢轴语言的翻译生成伪数据\upcite{DBLP:conf/acl/GuWCL19,DBLP:conf/emnlp/FiratSAYC16,DBLP:conf/emnlp/CurreyH19}。
+\item 零资源翻译也是近几年受到广泛关注的研究方向\upcite{firat2016zero,DBLP:journals/corr/abs-1805-10338}。在零资源翻译中，仅使用少量并行语料库（覆盖$k$个语言），单个多语言翻译模型就能在任何$k(k-1)$个语言对之间进行翻译\upcite{DBLP:conf/naacl/Al-ShedivatP19}。 但是，零资源翻译的性能通常很不稳定并且明显落后于有监督的翻译方法。为了改善零资源翻译，可以开发新的跨语言正则化方法，例如对齐正则化方法\upcite{DBLP:journals/corr/abs-1903-07091}，一致性正则化方法\upcite{DBLP:conf/naacl/Al-ShedivatP19}；也可以通过反向翻译或基于枢轴语言的翻译生成伪数据\upcite{DBLP:conf/acl/GuWCL19,firat2016zero,DBLP:conf/emnlp/CurreyH19}。

 \vspace{0.5em}
 \end{itemize}

--- a/Chapter17/Figures/figure-cache.tex
+++ b/Chapter17/Figures/figure-cache.tex
@@ -18,7 +18,7 @@
 \node[anchor=south,font=\footnotesize,inner sep=0pt] (cache)at ([yshift=2em,xshift=1.5em]key.north){\small\bfnew{Cache}};

 \node[draw,anchor=east,minimum size=1.8em,fill=orange!15] (dt) at ([yshift=2.1em,xshift=-4em]key.west){${\mathbi{d}}_{t}$};
-\node[anchor=north,font=\footnotesize] (readlab) at ([xshift=2.8em,yshift=0.3em]dt.north){\red{reading}};
+\node[anchor=north,font=\footnotesize] (readlab) at ([xshift=2.8em,yshift=0.3em]dt.north){\red{读取}};
 \node[draw,anchor=east,minimum size=1.8em,fill=ugreen!15] (st) at ([xshift=-3.7em]dt.west){${\mathbi{s}}_{t}$};
 \node[draw,anchor=east,minimum size=1.8em,fill=red!15] (st2) at ([xshift=-0.85em,yshift=3.5em]dt.west){$ \widetilde{\mathbi{s}}_{t}$};

@@ -27,10 +27,10 @@
 \draw[-,thick] (add.0) -- (add.180);
 \draw[-,thick] (add.90) -- (add.-90);

-\node[anchor=north,inner sep=0pt,font=\footnotesize,text=red] at ([xshift=-0.08em,yshift=-1em]add.south){combining};
+\node[anchor=north,inner sep=0pt,font=\footnotesize,text=red] at ([xshift=-0em,yshift=-0.5em]add.south){融合};

 \node[draw,anchor=east,minimum size=1.8em,fill=yellow!15] (ct) at ([xshift=-2em,yshift=-3.5em]st.west){$ {\mathbi{C}}_{t}$};
-\node[anchor=north,font=\footnotesize] (matchlab) at ([xshift=6.7em,yshift=-0.1em]ct.north){\red{mathching}};
+\node[anchor=north,font=\footnotesize] (matchlab) at ([xshift=6.7em,yshift=-0.1em]ct.north){\red{匹配}};

 \node[anchor=east] (y) at ([xshift=-6em,yshift=1em]st.west){$\mathbi{y}_{t-1}$};


--- a/Chapter17/chapter17.tex
+++ b/Chapter17/chapter17.tex
@@ -493,7 +493,7 @@
 \parinterval 从对篇章信息建模的角度看，在统计机器翻译时代就已经有大量的研究工作。这些工作大多针对某一具体的上下文现象，比如，篇章结构\upcite{DBLP:conf/anlp/MarcuCW00,foster2010translating,DBLP:conf/eacl/LouisW14}、代词回指\upcite{DBLP:conf/iwslt/HardmeierF10,DBLP:conf/wmt/NagardK10,DBLP:conf/eamt/LuongP16,}、词汇衔接\upcite{tiedemann2010context,DBLP:conf/emnlp/GongZZ11,DBLP:conf/ijcai/XiongBZLL13,xiao2011document}和篇章连接词\upcite{DBLP:conf/sigdial/MeyerPZC11,DBLP:conf/hytra/MeyerP12,}等。但是由于统计机器翻译本身流程复杂，依赖于许多组件和针对上下文现象所精心构造的特征，其建模方法相对比较困难。到了神经机器翻译时代，端到端建模给篇章级翻译提供了新的视角，相关工作不断涌现并且取得了很好的进展\upcite{DBLP:journals/corr/abs-1912-08494}。

 \parinterval
-区别于篇章级统计机器翻译，篇章级神经机器翻译不需要针对某一具体的上下文现象构造相应的特征，而是通过翻译模型本身从上下文句子中抽取和融合的上下文信息。通常情况下，篇章级机器翻译可以采用局部建模的手段将前一句或者周围几句作为上下文送入模型。针对需要长距离上下文的情况，也可以使用全局建模的手段直接从篇章中所有句子中提取上下文信息。近几年多数研究工作都在探索更有效的局部建模或全局建模方法，主要包括改进输入\upcite{DBLP:conf/discomt/TiedemannS17,DBLP:conf/naacl/BawdenSBH18,DBLP:conf/wmt/GonzalesMS17,DBLP:journals/corr/abs-1910-07481}、多编码器结构\upcite{DBLP:journals/corr/JeanLFC17,DBLP:conf/acl/TitovSSV18,DBLP:conf/emnlp/ZhangLSZXZL18}、层次结构\upcite{DBLP:conf/naacl/MarufMH19,DBLP:conf/acl/HaffariM18,DBLP:conf/emnlp/YangZMGFZ19,DBLP:conf/ijcai/ZhengYHCB20}以及基于缓存的方法\upcite{DBLP:conf/coling/KuangXLZ18,DBLP:journals/tacl/TuLSZ18}四类。
+区别于篇章级统计机器翻译，篇章级神经机器翻译不需要针对某一具体的上下文现象构造相应的特征，而是通过翻译模型本身从上下文句子中抽取和融合的上下文信息。通常情况下，篇章级机器翻译可以采用局部建模的手段将前一句或者周围几句作为上下文送入模型。针对需要长距离上下文的情况，也可以使用全局建模的手段直接从篇章中所有句子中提取上下文信息。近几年多数研究工作都在探索更有效的局部建模或全局建模方法，主要包括改进输入\upcite{DBLP:conf/discomt/TiedemannS17,DBLP:conf/naacl/BawdenSBH18,DBLP:conf/wmt/GonzalesMS17,DBLP:journals/corr/abs-1910-07481}、多编码器结构\upcite{DBLP:journals/corr/JeanLFC17,DBLP:journals/corr/abs-1805-10163,DBLP:conf/emnlp/ZhangLSZXZL18}、层次结构\upcite{DBLP:conf/naacl/MarufMH19,DBLP:conf/acl/HaffariM18,DBLP:conf/emnlp/YangZMGFZ19,DBLP:conf/ijcai/ZhengYHCB20}以及基于缓存的方法\upcite{DBLP:conf/coling/KuangXLZ18,DBLP:journals/tacl/TuLSZ18}四类。

 \parinterval 此外，篇章级机器翻译面临的另外一个挑战是数据稀缺。篇章级机器翻译所需要的双语数据需要保留篇章边界，数量相比于句子级双语数据要少很多。除了在之前提到的端到端方法中采用预训练或者参数共享的手段（见{\chaptersixteen}），也可以采用新的建模手段来缓解数据稀缺问题。比如，在句子级翻译模型的推断过程中，通过篇章级语言模型在目标端引入上下文信息\upcite{DBLP:conf/discomt/GarciaCE19,DBLP:journals/tacl/YuSSLKBD20,DBLP:journals/corr/abs-2010-12827}，或者对句子级的翻译结果进行修正\upcite{DBLP:conf/aaai/XiongH0W19,DBLP:conf/acl/VoitaST19,DBLP:conf/emnlp/VoitaST19}（{\color{red} 如何修正？用什么修正？修正什么？感觉这句话没有信息量}）。

@@ -515,7 +515,7 @@

 \subsection{篇章级翻译的建模}

-\parinterval 在理想情况下，篇章级翻译应该以整个篇章为单位作为模型的输入和输出。然而由于现实中篇章对应的序列过长，因此直接建模整个篇章的词序列难度很大，这使得主流的序列到序列模型很难直接使用。一种思路是采用能够处理超长序列的模型对篇章序列建模，比如，使用{\chapterfifteen}中提到的处理长序列的Transformer模型就是一种的解决方法\upcite{DBLP:conf/iclr/KitaevKL20}。不过，这类模型并不针对篇章级翻译的具体问题，因此并不是篇章级翻译中的主流方法。
+\parinterval 在理想情况下，篇章级翻译应该以整个篇章为单位作为模型的输入和输出。然而由于现实中篇章对应的序列过长，因此直接建模整个篇章的词序列难度很大，这使得主流的序列到序列模型很难直接使用。一种思路是采用能够处理超长序列的模型对篇章序列建模，比如，使用{\chapterfifteen}中提到的处理长序列的Transformer模型就是一种的解决方法\upcite{Kitaev2020ReformerTE}。不过，这类模型并不针对篇章级翻译的具体问题，因此并不是篇章级翻译中的主流方法。

 \parinterval 现在常见的端到端做法还是从句子级翻译出发，通过额外的模块来对篇章中的上下文句子进行表示，然后提取相应的上下文信息并融入到当前句子的翻译过程中。形式上，篇章级翻译的建模方式如下：
 \begin{eqnarray}
@@ -524,7 +524,7 @@
 \end{eqnarray}
 其中，$\seq{X}$和$\seq{Y}$分别为源语言篇章和目标语言篇章，$X_i$和$Y_i$分别为源语言篇章和目标语言篇章中的某个句子，$T$表示篇章中句子的数目\footnote{为了简化问题，为了假设源语言端和目标语言段具有相同的句子数目$T$。}。$D_i$表示翻译第$i$个句子时所对应的上下文句子集合，理想情况下，$D_i$中包含源语言篇章和目标语言篇章中所有除第$i$句之外的句子，但考虑到不同的任务场景需求与模型的应用效率，篇章级神经机器翻译在建模的时候通常仅使用其中的一部分作为上下文句子输入。

-\parinterval 上下文范围的选取是篇章级神经机器翻译需要着重考虑的问题，比如上下文句子的多少\upcite{agrawal2018contextual,DBLP:conf/emnlp/WerlenRPH18,DBLP:conf/naacl/MarufMH19}，是否考虑目标端上下文句子\upcite{DBLP:conf/discomt/TiedemannS17,agrawal2018contextual}等。此外，不同的上下文范围也对应着不同的建模方法，接下来将对一些典型的方法进行介绍，包括改进输入\upcite{DBLP:conf/discomt/TiedemannS17,DBLP:conf/naacl/BawdenSBH18,DBLP:conf/wmt/GonzalesMS17,DBLP:journals/corr/abs-1910-07481}、多编码器模型\upcite{DBLP:journals/corr/JeanLFC17,DBLP:conf/acl/TitovSSV18,DBLP:conf/emnlp/ZhangLSZXZL18}、层次结构模型\upcite{DBLP:conf/emnlp/WangTWL17,DBLP:conf/emnlp/TanZXZ19,Werlen2018DocumentLevelNM}以及基于缓存的方法\upcite{DBLP:conf/coling/KuangXLZ18,DBLP:journals/tacl/TuLSZ18}。
+\parinterval 上下文范围的选取是篇章级神经机器翻译需要着重考虑的问题，比如上下文句子的多少\upcite{agrawal2018contextual,Werlen2018DocumentLevelNM,DBLP:conf/naacl/MarufMH19}，是否考虑目标端上下文句子\upcite{DBLP:conf/discomt/TiedemannS17,agrawal2018contextual}等。此外，不同的上下文范围也对应着不同的建模方法，接下来将对一些典型的方法进行介绍，包括改进输入\upcite{DBLP:conf/discomt/TiedemannS17,DBLP:conf/naacl/BawdenSBH18,DBLP:conf/wmt/GonzalesMS17,DBLP:journals/corr/abs-1910-07481}、多编码器模型\upcite{DBLP:journals/corr/JeanLFC17,DBLP:journals/corr/abs-1805-10163,DBLP:conf/emnlp/ZhangLSZXZL18}、层次结构模型\upcite{DBLP:conf/emnlp/WangTWL17,DBLP:conf/emnlp/TanZXZ19,Werlen2018DocumentLevelNM}以及基于缓存的方法\upcite{DBLP:conf/coling/KuangXLZ18,DBLP:journals/tacl/TuLSZ18}。

 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -600,7 +600,7 @@

 \parinterval 多编码器结构通过额外的编码器对前一句进行编码，但是当处理更多上下文句子的时候仍然面临效率低下的问题。为了捕捉更大范围的上下文，可以采用层次结构来对更多的上下文句子进行建模。层次结构是一种有效的序列表示方法，而且人类语言中天然就具有层次性，比如，句法树、篇章结构树等。类似的思想也成功的应用在基于树的句子级翻译模型中（{\chaptereight}和{\chapterfifteen}）。

-\parinterval 图\ref{fig:17-19}描述了一个基于层次注意力的模型结构\upcite{DBLP:conf/emnlp/WerlenRPH18}。首先通过翻译模型的编码器获取前$k$个句子的词序列编码表示$(\mathbi{h}^{\textrm{pre}1},\mathbi{h}^{\textrm{pre}2},\dots,\mathbi{h}^{\textrm{pre}k})$，然后针对前文每个句子的词序列编码表示$\mathbi{h}^{\textrm{pre}j}$，使用词级注意力提取句子级的上下文信息$\mathbi{s}^{j}$，然后在这$k$个句子级上下文信息$\mathbi{s}=(\mathbi{s}^1,\mathbi{s}^2,\dots,\mathbi{s}^k)$的基础上，使用句子级注意力提取篇章上下文信息$\mathbi{d}$。最终上下文信息$\mathbi{d}$的获取涉及到词级和句子级两个不同层次的注意力操作，因此将该过程称为层次注意力。实际上，这种方法并没有使用语言学的篇章层次结构。但是，句子级注意力在归纳统计意义上的篇章结构，因此这种方法也可以很好地捕捉不同句子之间的关系。
+\parinterval 图\ref{fig:17-19}描述了一个基于层次注意力的模型结构\upcite{Werlen2018DocumentLevelNM}。首先通过翻译模型的编码器获取前$k$个句子的词序列编码表示$(\mathbi{h}^{\textrm{pre}1},\mathbi{h}^{\textrm{pre}2},\dots,\mathbi{h}^{\textrm{pre}k})$，然后针对前文每个句子的词序列编码表示$\mathbi{h}^{\textrm{pre}j}$，使用词级注意力提取句子级的上下文信息$\mathbi{s}^{j}$，然后在这$k$个句子级上下文信息$\mathbi{s}=(\mathbi{s}^1,\mathbi{s}^2,\dots,\mathbi{s}^k)$的基础上，使用句子级注意力提取篇章上下文信息$\mathbi{d}$。最终上下文信息$\mathbi{d}$的获取涉及到词级和句子级两个不同层次的注意力操作，因此将该过程称为层次注意力。实际上，这种方法并没有使用语言学的篇章层次结构。但是，句子级注意力在归纳统计意义上的篇章结构，因此这种方法也可以很好地捕捉不同句子之间的关系。

 \parinterval 为了增强模型表示能力，层次注意力中并未直接使用当前句子第$t$个位置的编码表示$\mathbi{h}_{t}$作为查询，而是通过$f_w$和$f_s$两个线性变换分别获取词级注意力和句子级注意力的查询$\mathbi{q}_{w}$ 和$\mathbi{q}_{s}$，定义如下：

@@ -687,7 +687,7 @@
 \item 此外，语音翻译的一个重要应用是机器同声传译。机器同声传译的一个难点在于不同语言的文字顺序不同。这个问题导致了同声传译模型需要在翻译性能和实时性之间进行取舍。目前，同声传译的一种思路是基于目前已经说出的语音进行翻译\upcite{DBLP:conf/acl/MaHXZLZZHLLWW19}，比如，等待源语$k$个词语，然后再进行翻译，同时改进束搜索方式来预测未来的词序列，从而提升准确度\upcite{DBLP:conf/emnlp/ZhengMZH19}。或者，对当前语音进行翻译，但需要判断翻译的词是否能够作为最终结果。如果是则不需要重新解码，否则将会根据之后的语音重新进行解码\upcite{DBLP:conf/naacl/DalviDSV18,DBLP:journals/corr/ChoE16}。第二种思路是动态预测当前时刻是应该继续等待还是开始翻译，这种方式更符合人类进行同传的思路。但是这种策略的难点在于标注每一时刻的决策状态十分耗时且标准难以统一，目前主流的方式是利用强化学习方法\upcite{DBLP:conf/eacl/NeubigCGL17,DBLP:conf/emnlp/GrissomHBMD14}，对句子进行不同决策方案采样，最终学到最优的决策方案。此外，还有一些工作设计不同的学习策略\upcite{DBLP:conf/acl/ZhengLZMLH20,DBLP:conf/emnlp/ZhengZMH19,DBLP:conf/acl/ZhengZMH19}或改进注意力机制\upcite{DBLP:conf/acl/ArivazhaganCMCY19}以提升机器同声传译的性能。

 \vspace{0.5em}
-\item 在篇章级翻译方面，一些研究工作对这类模型的上下文建模能力进行了探索\upcite{DBLP:conf/discomt/KimTN19,DBLP:conf/acl/LiLWJXZLL20}，发现模型性能在小数据集上的BLEU提升并不完全来自于上下文信息的利用。同时，受限于数据规模，篇章级翻译模型相对难以训练。一些研究人员通过调整训练策略来帮助模型更容易捕获上下文信息\upcite{DBLP:journals/corr/abs-1903-04715,DBLP:conf/acl/SaundersSB20,DBLP:conf/mtsummit/StojanovskiF19}。除了训练策略的调整，也可以使用数据增强和预训练的手段来缓解数据稀缺的问题\upcite{DBLP:conf/discomt/SugiyamaY19,DBLP:journals/corr/abs-1911-03110,DBLP:journals/tacl/LiuGGLEGLZ20}。此外，区别于传统的篇章级翻译，一些对话翻译也需要使用长距离上下文信息\upcite{DBLP:conf/wmt/MarufMH18}。
+\item 在篇章级翻译方面，一些研究工作对这类模型的上下文建模能力进行了探索\upcite{DBLP:conf/discomt/KimTN19,DBLP:conf/acl/LiLWJXZLL20}，发现模型性能在小数据集上的BLEU提升并不完全来自于上下文信息的利用。同时，受限于数据规模，篇章级翻译模型相对难以训练。一些研究人员通过调整训练策略来帮助模型更容易捕获上下文信息\upcite{DBLP:journals/corr/abs-1903-04715,DBLP:conf/acl/SaundersSB20,DBLP:conf/mtsummit/StojanovskiF19}。除了训练策略的调整，也可以使用数据增强和预训练的手段来缓解数据稀缺的问题\upcite{DBLP:conf/discomt/SugiyamaY19,DBLP:journals/corr/abs-1911-03110,DBLP:journals/corr/abs-2001-08210}。此外，区别于传统的篇章级翻译，一些对话翻译也需要使用长距离上下文信息\upcite{DBLP:conf/wmt/MarufMH18}。

 \vspace{0.5em}
 \item 此外，多模态机器翻译、图像描述、视觉问答等多模态任务受到广泛关注。如何将多个模态的信息充分融合，是研究多模态任务的重要问题。另外，数据稀缺是大多数多模态任务的瓶颈之一，可以采取数据增强\upcite{DBLP:conf/emnlp/GokhaleBBY20,DBLP:conf/eccv/Tang0ZWY20}的方式缓解。但是，这时仍需要回答在：模型没有充分训练时，图像等模态信息究竟在翻译里发挥了多少作用？类似的问题在篇章级机器翻译中也存在，上下文模型在训练数据量很小的时候对翻译的作用十分微弱\upcite{DBLP:conf/acl/LiLWJXZLL20}。此外，受到预训练模型的启发，在多模态领域，图像和文本联合预训练\upcite{DBLP:conf/eccv/Li0LZHZWH0WCG20,DBLP:conf/aaai/ZhouPZHCG20,DBLP:conf/iclr/SuZCLLWD20}的工作也相继开展。

--- a/bibliography.bib
+++ b/bibliography.bib
@@ -2400,15 +2400,6 @@ year = {2012}
  pages     = {260--269},
  year      = {1967}
 }
-@inproceedings{DBLP:conf/acl/OchN02,
-  author    = {Franz Josef Och and
-               Hermann Ney},
-  title     = {Discriminative Training and Maximum Entropy Models for Statistical
-               Machine Translation},
-  pages     = {295--302},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2002}
-}
 @inproceedings{koehn2000estimating,
  author    = {Philipp Koehn and
               Kevin Knight},
@@ -3853,15 +3844,6 @@ year = {2012}
  pages     = {701--710},
  year      = {2014}
 }
-@inproceedings{2011Natural,
-  title={Natural Language Processing (almost) from Scratch},
-  author={ Collobert, Ronan  and  Weston, Jason  and Bottou, Léon and  Karlen, Michael  and  Kavukcuoglu, Koray  and  Kuksa, Pavel },
-  publisher={Journal of Machine Learning Research},
-  volume={12},
-  number={1},
-  pages={2493-2537},
-  year={2011}
-}
 @inproceedings{mccann2017learned,
  author    = {Bryan Mccann and
               James Bradbury and
@@ -3874,16 +3856,17 @@ year = {2012}
 }
 %%%%%%%%%%%%%%%%%%%%%%%神经语言模型，已检查修改%%%%%%%%%%%%%%%%%%%%%%%%%
 @inproceedings{Peters2018DeepCW,
-  title={Deep contextualized word representations},
-  author={Matthew Peters and 
+  author    = {Matthew Peters and
               Mark Neumann and
               Mohit Iyyer and
               Matt Gardner and
               Christopher Clark and
               Kenton Lee and
               Luke Zettlemoyer},
-  publisher={Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics},
-  year={2018}
+  title     = {Deep Contextualized Word Representations},
+  pages     = {2227--2237},
+  publisher = {Annual Conference of the North American Chapter of the Association for Computational Linguistics},
+  year      = {2018}
 }
 @inproceedings{Graves2013HybridSR,
  title={Hybrid speech recognition with Deep Bidirectional LSTM},
@@ -4116,13 +4099,6 @@ year = {2012}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%% chapter 10------------------------------------------------------
-@inproceedings{vaswani2017attention,
-	title={Attention is All You Need},
-	author={Ashish {Vaswani} and Noam {Shazeer} and Niki {Parmar} and Jakob {Uszkoreit} and Llion {Jones} and Aidan N. {Gomez} and Lukasz {Kaiser} and Illia {Polosukhin}},
-	publisher={International Conference on Neural Information Processing},
-	pages={5998--6008},
-	year={2017}
-}
 @inproceedings{DBLP:conf/acl/LiLWJXZLL20,
  author    = {Bei Li and
               Hui Liu and
@@ -4679,15 +4655,6 @@ author    = {Yoshua Bengio and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2015}
 }
-@inproceedings{DBLP:journals/corr/LuongPM15,
-  author    = {Thang Luong and
-               Hieu Pham and
-               Christopher D. Manning},
-  title     = {Effective Approaches to Attention-based Neural Machine Translation},
-  publisher = {Conference on Empirical Methods in Natural Language Processing},
-  pages     = {1412--1421},
-  year      = {2015}
-}
 @inproceedings{He2016ImprovedNM,
  author    = {Wei He and
               Zhongjun He and
@@ -4775,19 +4742,6 @@ author    = {Yoshua Bengio and
  pages     = {21--37},
  year      = {2016}
 }
-@inproceedings{devlin-etal-2014-fast,
-  author    = {Jacob Devlin and
-               Rabih Zbib and
-               Zhongqiang Huang and
-               Thomas Lamar and
-               Richard M. Schwartz and
-               John Makhoul},
-  title     = {Fast and Robust Neural Network Joint Models for Statistical Machine
-               Translation},
-  pages     = {1370--1380},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2014}
-}
 @inproceedings{DBLP:conf/acl/WangLLJL15,
  author    = {Mingxuan Wang and
               Zhengdong Lu and
@@ -4818,15 +4772,6 @@ author    = {Yoshua Bengio and
  publisher = {International Conference on Acoustics, Speech and Signal Processing},
  year      = {2013}
 }
-@inproceedings{DBLP:journals/corr/LuongPM15,
-  author    = {Thang Luong and
-               Hieu Pham and
-               Christopher D. Manning},
-  title     = {Effective Approaches to Attention-based Neural Machine Translation},
-  publisher = {Conference on Empirical Methods in Natural Language Processing},
-  pages     = {1412--1421},
-  year      = {2015}
-}
 @inproceedings{DBLP:conf/acl-codeswitch/WangCK18,
  author    = {Changhan Wang and
               Kyunghyun Cho and
@@ -4879,15 +4824,6 @@ author    = {Yoshua Bengio and
  publisher = {Springer},
  year      = {2017}
 }
-@inproceedings{2011Natural,
-  title={Natural Language Processing (almost) from Scratch},
-  author={ Collobert, Ronan  and  Weston, Jason  and Bottou, Léon and  Karlen, Michael  and  Kavukcuoglu, Koray  and  Kuksa, Pavel },
-  publisher={Journal of Machine Learning Research},
-  volume={12},
-  number={1},
-  pages={2493-2537},
-  year={2011},
-}
 @inproceedings{DBLP:conf/acl/NguyenG15,
  author    = {Thien Huu Nguyen and
               Ralph Grishman},
@@ -4943,14 +4879,6 @@ author    = {Yoshua Bengio and
  publisher = {Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics},
  year      = {2015}
 }
-@inproceedings{StahlbergNeural,
-  title={Neural Machine Translation: A Review},
-  author={Felix Stahlberg},
-  publisher={Journal of Artificial Intelligence Research},
-  year={2020},
-  volume={69},
-  pages={343-418}
-}
 @inproceedings{Sennrich2016ImprovingNM,
  author    = {Rico Sennrich and
               Barry Haddow and
@@ -4959,14 +4887,6 @@ author    = {Yoshua Bengio and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2016}
 }
-@inproceedings{bahdanau2014neural,
-  author    = {Dzmitry Bahdanau and
-               Kyunghyun Cho and
-               Yoshua Bengio},
-  title     = {Neural Machine Translation by Jointly Learning to Align and Translate},
-  publisher = {International Conference on Learning Representations},
-  year      = {2015}
-}
 @inproceedings{Waibel1989PhonemeRU,
  title={Phoneme recognition using time-delay neural networks},
  author={Alexander Waibel and Toshiyuki Hanazawa and Geoffrey Hinton and Kiyohiro Shikano and Kevin J. Lang},
@@ -5002,30 +4922,6 @@ author    = {Yoshua Bengio and
  pages     = {770--778},
  year      = {2016}
 }
-@inproceedings{DBLP:conf/cvpr/HuangLMW17,
-  author    = {Gao Huang and
-               Zhuang Liu and
-               Laurens van der Maaten and
-               Kilian Q. Weinberger},
-  title     = {Densely Connected Convolutional Networks},
-  pages     = {2261--2269},
-  publisher = {{IEEE} Conference on Computer Vision and Pattern Recognition},
-  year      = {2017}
-}
-@inproceedings{Girshick2015FastR,
-  title={Fast R-CNN},
-  author={Ross Girshick},
-  publisher={International Conference on Computer Vision},
-  year={2015},
-  pages={1440-1448}
-}
-@inproceedings{He2020MaskR,
-  title={Mask R-CNN},
-  author={Kaiming He and Georgia Gkioxari and Piotr Doll{\'a}r and Ross B. Girshick},
-  publisher={International Conference on Computer Vision},
-  pages={2961--2969},
-  year={2017}
-}
 @inproceedings{Kalchbrenner2014ACN,
  title={A Convolutional Neural Network for Modelling Sentences},
  author={Nal Kalchbrenner and Edward Grefenstette and Phil Blunsom},
@@ -5079,18 +4975,6 @@ author    = {Yoshua Bengio and
  pages     = {123--135},
  year={2017}
 }
-@inproceedings{DBLP:journals/corr/GehringAGYD17,
-  author    = {Jonas Gehring and
-               Michael Auli and
-               David Grangier and
-               Denis Yarats and
-               Yann N. Dauphin},
-  title     = {Convolutional Sequence to Sequence Learning},
-  publisher = {International Conference on Machine Learning},
-  volume    = {70},
-  pages     = {1243--1252},
-  year      = {2017}
-}
 @inproceedings{Kaiser2018DepthwiseSC,
  title={Depthwise Separable Convolutions for Neural Machine Translation},
  author    = {Lukasz Kaiser and
@@ -5109,14 +4993,6 @@ author    = {Yoshua Bengio and
 publisher = {International Conference on Learning Representations},
 year = {2019}
 }
-@inproceedings{kalchbrenner-blunsom-2013-recurrent,
-  author    = {Nal Kalchbrenner and
-               Phil Blunsom},
-  title     = {Recurrent Continuous Translation Models},
-  pages     = {1700--1709},
-  publisher = {Conference on Empirical Methods in Natural Language Processing},
-  year      = {2013}
-}
 @inproceedings{Wu2016GooglesNM,
  title={Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation},
  author    = {Yonghui Wu and
@@ -5154,16 +5030,6 @@ author    = {Yoshua Bengio and
  year={2016},
  volume={abs/1609.08144}
 }
-@inproceedings{DBLP:journals/corr/HeZRS15,
-  author    = {Kaiming He and
-               Xiangyu Zhang and
-               Shaoqing Ren and
-               Jian Sun},
-  title     = {Deep Residual Learning for Image Recognition},
-  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
-  pages     = {770--778},
-  year      = {2016},
-}
 @inproceedings{Sukhbaatar2015EndToEndMN,
  title={End-To-End Memory Networks},
  author    = {Sainbayar Sukhbaatar and
@@ -5326,13 +5192,6 @@ author    = {Yoshua Bengio and
  volume    = {abs/1802.05751},
  year      = {2018}
 }
-@inproceedings{vaswani2017attention,
-	title={Attention is All You Need},
-	author={Ashish {Vaswani} and Noam {Shazeer} and Niki {Parmar} and Jakob {Uszkoreit} and Llion {Jones} and Aidan N. {Gomez} and Lukasz {Kaiser} and Illia {Polosukhin}},
-	publisher={International Conference on Neural Information Processing},
-	pages={5998--6008},
-	year={2017}
-}
 @inproceedings{DBLP:conf/iclr/WuLLLH20,
  author    = {Zhanghao Wu and
               Zhijian Liu and
@@ -5377,24 +5236,6 @@ author    = {Yoshua Bengio and
  pages     = {464--468},
  year      = {2018},
 }
-@inproceedings{DBLP:journals/corr/HeZRS15,
-  author    = {Kaiming He and
-               Xiangyu Zhang and
-               Shaoqing Ren and
-               Jian Sun},
-  title     = {Deep Residual Learning for Image Recognition},
-  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
-  pages     = {770--778},
-  year      = {2016},
-}
-@inproceedings{JMLR:v15:srivastava14a,
-  author  = {Nitish Srivastava and Geoffrey Hinton and Alex Krizhevsky and Ilya Sutskever and Ruslan Salakhutdinov},
-  title   = {Dropout: A Simple Way to Prevent Neural Networks from Overfitting},
-  publisher = {Journal of Machine Learning Research},
-  year    = {2014},
-  volume  = {15},
-  pages   = {1929-1958},
-}
 @inproceedings{Szegedy_2016_CVPR,
  author    = {Christian Szegedy and
               Vincent Vanhoucke and
@@ -5424,28 +5265,6 @@ author    = {Yoshua Bengio and
  volume    = {abs/1602.02830},
  year      = {2016},
 }
-@inproceedings{Wu2019PayLA,
- author = {Felix Wu and
-		 Angela Fan and
-		 Alexei Baevski and
-		 Yann N. Dauphin and
-		 Michael Auli},
- title = {Pay Less Attention with Lightweight and Dynamic Convolutions},
- publisher = {International Conference on Learning Representations},
- year = {2019},
-}
-@inproceedings{dai-etal-2019-transformer,
- author    = {Zihang Dai and
-               Zhilin Yang and
-               Yiming Yang and
-               Jaime G. Carbonell and
-               Quoc Viet Le and
-               Ruslan Salakhutdinov},
-  title     = {Transformer-XL: Attentive Language Models beyond a Fixed-Length Context},
-  pages     = {2978--2988},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2019}
-}
 @inproceedings{Liu2020LearningTE,
 	title={Learning to Encode Position for Transformer with Continuous Dynamical Model},
 	author={Xuanqing Liu and Hsiang-Fu Yu and Inderjit Dhillon and Cho-Jui Hsieh},
@@ -5620,14 +5439,6 @@ author    = {Yoshua Bengio and
  publisher = {IEEE International Conference on Acoustics, Speech and Signal Processing},
  year      = {2012}
 }
-@inproceedings{JMLR:v15:srivastava14a,
-  author  = {Nitish Srivastava and Geoffrey Hinton and Alex Krizhevsky and Ilya Sutskever and Ruslan Salakhutdinov},
-  title   = {Dropout: A Simple Way to Prevent Neural Networks from Overfitting},
-  publisher = {Journal of Machine Learning Research},
-  year    = {2014},
-  volume  = {15},
-  pages   = {1929-1958},
-}
 @inproceedings{DBLP:conf/amta/MullerRS20,
  author    = {Mathias M{\"{u}}ller and
               Annette Rios and
@@ -5834,21 +5645,6 @@ author    = {Yoshua Bengio and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
-@inproceedings{DBLP:conf/acl/LiLWJXZLL20,
-  author    = {Bei Li and
-               Hui Liu and
-               Ziyang Wang and
-               Yufan Jiang and
-               Tong Xiao and
-               Jingbo Zhu and
-               Tongran Liu and
-               Changliang Li},
-  title     = {Does Multi-Encoder Help? {A} Case Study on Context-Aware Neural Machine
-               Translation},
-  pages     = {3512--3518},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2020}
-}
 @techreport{chen1999gaussian,
  title={A Gaussian prior for smoothing maximum entropy models},
  author={Chen, Stanley F and Rosenfeld, Ronald},
@@ -5863,14 +5659,6 @@ author    = {Yoshua Bengio and
  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2018}
 }
-@inproceedings{DBLP:conf/icassp/SchusterN12,
-  author    = {Mike Schuster and
-               Kaisuke Nakajima},
-  title     = {Japanese and Korean voice search},
-  pages     = {5149--5152},
-  publisher = {IEEE International Conference on Acoustics, Speech and Signal Processing},
-  year      = {2012}
-}
 @inproceedings{kudo2018sentencepiece,
 	title={SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing},
 	author={Taku {Kudo} and John {Richardson}},
@@ -6166,12 +5954,6 @@ author    = {Yoshua Bengio and
  publisher = {Conference on Computational Learning Theory},
  year      = {1992}
 }
-@book{mitchell1996m,
-  title={Machine Learning},
-  author={Mitchell, Tom},
-  journal={McCraw Hill},
-  year={1996}
-}
 @inproceedings{DBLP:conf/icml/AbeM98,
  author    = {Naoki Abe and
               Hiroshi Mamitsuka},
@@ -6195,15 +5977,6 @@ author    = {Yoshua Bengio and
  publisher = {{IEEE} Conference on Computer Vision and Pattern Recognition},
  year      = {2005}
 }
-@inproceedings{726791,
-  author={Yann {Lecun} and Leon {Bottou} and Yoshua {Bengio} and Patrick {Haffner}},
-  publisher={Proceedings of the IEEE}, 
-  title={Gradient-based learning applied to document recognition}, 
-  year={1998},
-  volume={86},
-  number={11},
-  pages={2278-2324}
-}
 @book{atkinson2007optimum,
  title={Optimum experimental designs, with SAS},
  author={Atkinson, Anthony and Donev, Alexander and Tobias, Randall and others},
@@ -6245,16 +6018,6 @@ author    = {Yoshua Bengio and
  publisher = {{IEEE} Winter Conference on Applications of Computer Vision},
  year      = {2020}
 }
-@inproceedings{DBLP:conf/acl/JeanCMB15,
-  author    = {S{\'{e}}bastien Jean and
-               KyungHyun Cho and
-               Roland Memisevic and
-               Yoshua Bengio},
-  title     = {On Using Very Large Target Vocabulary for Neural Machine Translation},
-  pages     = {1--10},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2015}
-}
 @inproceedings{2015OnGulcehre,
  title = {On Using Monolingual Corpora in Neural Machine Translation},
  author = {Gulcehre Caglar  and  
@@ -6269,14 +6032,6 @@ author    = {Yoshua Bengio and
  publisher = {Computer Science},
  year = {2015},
 }
-@inproceedings{Sennrich2016ImprovingNM,
-  author    = {Rico Sennrich and
-               Barry Haddow and
-               Alexandra Birch},
-  title     = {Improving Neural Machine Translation Models with Monolingual Data},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2016}
-}
 @inproceedings{DBLP:conf/aaai/Zhang0LZC18,
  author    = {Zhirui Zhang and
               Shujie Liu and
@@ -6641,16 +6396,6 @@ author    = {Yoshua Bengio and
  pages     = {1171--1179},
  year      = {2015}
 }
-@inproceedings{Bengio2015ScheduledSF,
-  title={Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks},
-  author={Samy Bengio and
-               Oriol Vinyals and
-               Navdeep Jaitly and
-               Noam Shazeer},
-  publisher = {Annual Conference on Neural Information Processing Systems},
-  pages     = {1171--1179},
-  year      = {2015}
-}
 @inproceedings{Ranzato2016SequenceLT,
  title={Sequence Level Training with Recurrent Neural Networks},
  author={Marc'Aurelio Ranzato and
@@ -6674,43 +6419,6 @@ author    = {Yoshua Bengio and
  pages     = {2672--2680},
  year      = {2014}
 }
-@inproceedings{DBLP:conf/acl/ShenCHHWSL16,
-  author    = {Shiqi Shen and
-               Yong Cheng and
-               Zhongjun He and
-               Wei He and
-               Hua Wu and
-               Maosong Sun and
-               Yang Liu},
-  title     = {Minimum Risk Training for Neural Machine Translation},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2016},
-}
-@inproceedings{DBLP:conf/acl/PapineniRWZ02,
-  author    = {Kishore Papineni and
-               Salim Roukos and
-               Todd Ward and
-               Wei-jing Zhu},
-  title     = {Bleu: a Method for Automatic Evaluation of Machine Translation},
-  pages     = {311--318},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2002}
-}
-@inproceedings{doddington2002automatic,
-  title={Automatic evaluation of machine translation quality using n-gram co-occurrence statistics},
-  publisher={Proceedings of the second international conference on Human Language Technology Research},
-  author={Doddington, George},
-  pages={138--145},
-  year={2002}
-}
-@inproceedings{snover2006study,
-  title={A study of translation edit rate with targeted human annotation},
-  author={Snover, Matthew and Dorr, Bonnie and Schwartz, Richard and Micciulla, Linnea and Makhoul, John},
-  publisher={Proceedings of association for machine translation in the Americas},
-  volume={200},
-  number={6},
-  year={2006}
-}
 @inproceedings{lavie2009meteor,
  title={The METEOR metric for automatic evaluation of machine translation},
  author={Lavie, Alon and Denkowski, Michael J},
@@ -6720,36 +6428,6 @@ author    = {Yoshua Bengio and
  pages={105--115},
  year={2009}
 }
-@inproceedings{bahdanau2014neural,
-  author    = {Dzmitry Bahdanau and
-               Kyunghyun Cho and
-               Yoshua Bengio},
-  title     = {Neural Machine Translation by Jointly Learning to Align and Translate},
-  publisher = {International Conference on Learning Representations},
-  year      = {2015}
-}
-@inproceedings{koehn2003statistical,
-  author    = {Philipp Koehn and
-               Franz Josef Och and
-               Daniel Marcu},
-  title     = {Statistical Phrase-Based Translation},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2003}
-}
-@inproceedings{smith2006minimum,
-  author    = {David A. Smith and
-               Jason Eisner},
-  title     = {Minimum Risk Annealing for Training Log-Linear Models},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2006}
-}
-@inproceedings{he2012maximum,
-title={Maximum expected bleu training of phrase and lexicon translation models},
-author={He, Xiaodong and Deng, Li},
-publisher={Annual Meeting of the Association for Computational Linguistics},
-pages={292--301},
-year={2012}
-}
 @inproceedings{DBLP:conf/acl/GaoHYD14,
  author    = {Jianfeng Gao and
               Xiaodong He and
@@ -6907,14 +6585,6 @@ year={2012}
  volume    = {abs/2002.11794},
  year      = {2020}
 }
-@inproceedings{kim-rush-2016-sequence,
-    author    = {Yoon Kim and
-               Alexander M. Rush},
-  title     = {Sequence-Level Knowledge Distillation},
-  pages     = {1317--1327},
-  publisher = {Conference on Empirical Methods in Natural Language Processing},
-  year      = {2016}
-}
 @inproceedings{Jiao2020TinyBERTDB,
  author    = {Xiaoqi Jiao and
               Yichun Yin and
@@ -6952,34 +6622,6 @@ year={2012}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%% chapter 14------------------------------------------------------
-@inproceedings{Koehn2007Moses,
-  author    = {Philipp Koehn and
-               Hieu Hoang and
-			    Alexandra Birch and
-               Chris Callison-Burch and
-               Marcello Federico and
-               Nicola Bertoldi and
-               Brooke Cowan and
-               Wade Shen and
-               Christine Moran and
-               Richard Zens and
-               Chris Dyer and
-               Ondrej Bojar and
-               Alexandra Constantin and
-               Evan Herbst},
-  title     = {Moses: Open Source Toolkit for Statistical Machine Translation},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2007}
-}
-@inproceedings{DBLP:conf/amta/Koehn04,
-  author    = {Philipp Koehn},
-  title     = {Pharaoh: {A} Beam Search Decoder for Phrase-Based Statistical Machine
-               Translation Models},
-  volume    = {3265},
-  pages     = {115--124},
-  publisher = {	Association for Machine Translation in the Americas},
-  year      = {2004}
-}
 @inproceedings{DBLP:conf/emnlp/StahlbergHSB17,
  author    = {Felix Stahlberg and
               Eva Hasler and
@@ -7189,16 +6831,6 @@ year={2012}
  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2016}
 }
-@inproceedings{DBLP:conf/emnlp/HuangZM17,
-  author    = {Liang Huang and
-               Kai Zhao and
-               Mingbo Ma},
-  title     = {When to Finish? Optimal Beam Search for Neural Text Generation (modulo
-               beam size)},
-  pages     = {2134--2139},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2017}
-}
 @inproceedings{Wiseman2016SequencetoSequenceLA,
  title={Sequence-to-Sequence Learning as Beam-Search Optimization},
  author={Sam Wiseman and Alexander M. Rush},
@@ -7206,16 +6838,6 @@ year={2012}
  pages={1296--1306},
  year={2016}
 }
-@inproceedings{DBLP:conf/emnlp/Yang0M18,
-  author    = {Yilin Yang and
-               Liang Huang and
-               Mingbo Ma},
-  title     = {Breaking the Beam Search Curse: {A} Study of (Re-)Scoring Methods
-               and Stopping Criteria for Neural Machine Translation},
-  pages     = {3054--3059},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2018}
-}
 @inproceedings{Ma2019LearningTS,
  title={Learning to Stop in Structured Prediction for Neural Machine Translation},
  author={Mingbo Ma and
@@ -7236,14 +6858,6 @@ year={2012}
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2017}
 }
-@inproceedings{bahdanau2014neural,
-  author    = {Dzmitry Bahdanau and
-               Kyunghyun Cho and
-               Yoshua Bengio},
-  title     = {Neural Machine Translation by Jointly Learning to Align and Translate},
-  publisher = {International Conference on Learning Representations},
-  year      = {2015}
-}
 @inproceedings{Jiang2012LearnedPF,
  title={Learned Prioritization for Trading Off Accuracy and Speed},
  author={Jiarong Jiang and Adam R. Teichert and Hal Daum{\'e} and Jason Eisner},
@@ -7379,33 +6993,6 @@ year={2012}
  publisher={Annual Meeting of the Association for Computational Linguistics},
  year={2017}
 }
-@inproceedings{StahlbergNeural,
-  title={Neural Machine Translation: A Review},
-  author={Felix Stahlberg},
-  publisher={Journal of Artificial Intelligence Research},
-  year={2020},
-  volume={69},
-  pages={343-418}
-}
-@inproceedings{Ranzato2016SequenceLT,
-  title={Sequence Level Training with Recurrent Neural Networks},
-  author={Marc'Aurelio Ranzato and
-               Sumit Chopra and
-               Michael Auli and
-               Wojciech Zaremba},
-  publisher={International Conference on Learning Representations},
-  year={2016}
-}
-@inproceedings{Bengio2015ScheduledSF,
-  title={Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks},
-  author={Samy Bengio and
-               Oriol Vinyals and
-               Navdeep Jaitly and
-               Noam Shazeer},
-  publisher = {Annual Conference on Neural Information Processing Systems},
-  pages     = {1171--1179},
-  year      = {2015}
-}
 @inproceedings{Zhang2019BridgingTG,
  title={Bridging the Gap between Training and Inference for Neural Machine Translation},
  author={Wen Zhang and Yang Feng and Fandong Meng and Di You and Qun Liu},
@@ -7413,55 +7000,6 @@ year={2012}
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
-@inproceedings{DBLP:conf/acl/ShenCHHWSL16,
-  author    = {Shiqi Shen and
-               Yong Cheng and
-               Zhongjun He and
-               Wei He and
-               Hua Wu and
-               Maosong Sun and
-               Yang Liu},
-  title     = {Minimum Risk Training for Neural Machine Translation},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2016},
-}
-@inproceedings{DBLP:conf/acl/SennrichHB16a,
-  author    = {Rico Sennrich and
-               Barry Haddow and
-               Alexandra Birch},
-  title     = {Neural Machine Translation of Rare Words with Subword Units},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2016},
-}
-@inproceedings{DBLP:conf/emnlp/ZensSX12,
-  author    = {Richard Zens and
-               Daisy Stanton and
-               Peng Xu},
-  title     = {A Systematic Comparison of Phrase Table Pruning Techniques},
-  pages     = {972--983},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2012}
-}
-@inproceedings{DBLP:conf/emnlp/JohnsonMFK07,
-  author    = {Howard Johnson and
-               Joel D. Martin and
-               George F. Foster and
-               Roland Kuhn},
-  title     = {Improving Translation Quality by Discarding Most of the Phrasetable},
-  pages     = {967--975},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2007}
-}
-@inproceedings{DBLP:conf/emnlp/LingGTB12,
-  author    = {Wang Ling and
-               Jo{\~{a}}o Gra{\c{c}}a and
-               Isabel Trancoso and
-               Alan W. Black},
-  title     = {Entropy-based Pruning for Phrase-based Machine Translation},
-  pages     = {962--971},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2012}
-}
 @inproceedings{Narang2017BlockSparseRN,
  title={Block-Sparse Recurrent Neural Networks},
  author={Sharan Narang and Eric Undersander and Gregory Diamos},
@@ -7483,31 +7021,10 @@ year={2012}
  author    = {Paul Michel and
               Omer Levy and
               Graham Neubig},
-  title     = {Are Sixteen Heads Really Better than One?},
  publisher = {Annual Conference on Neural Information Processing Systems},
  pages     = {14014--14024},
  year      = {2019}
 }
-@inproceedings{DBLP:journals/corr/abs-1905-09418,
-  author    = {Elena Voita and
-               David Talbot and
-               Fedor Moiseev and
-               Rico Sennrich and
-               Ivan Titov},
-  title     = {Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy
-               Lifting, the Rest Can Be Pruned},
-  pages     = {5797--5808},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2019},
-}
-@inproceedings{Kitaev2020ReformerTE,
-  author    = {Nikita Kitaev and
-               Lukasz Kaiser and
-               Anselm Levskaya},
-  title     = {Reformer: The Efficient Transformer},
-  publisher = {International Conference on Learning Representations},
-  year      = {2020}
-}
 @inproceedings{Katharopoulos2020TransformersAR,
  title={Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention},
  author={Angelos Katharopoulos and Apoorv Vyas and Nikolaos Pappas and Franccois Fleuret},
@@ -7515,15 +7032,6 @@ year={2012}
  year={2020},
  volume={abs/2006.16236}
 }
-@inproceedings{xiao2011language,
-  title ={Language Modeling for Syntax-Based Machine Translation Using Tree Substitution Grammars: A Case Study on Chinese-English Translation},
-  author ={Xiao, Tong and Zhu, Jingbo and Zhu, Muhua},
-  volume ={10},
-  number ={4},
-  pages ={1--29},
-  year ={2011},
-  publisher ={ACM Transactions on Asian Language Information Processing (TALIP)}
-}
 @inproceedings{Li2009VariationalDF,
  title={Variational Decoding for Statistical Machine Translation},
  author={Zhifei Li and
@@ -7558,18 +7066,6 @@ year={2012}
  pages={5488--5495},
  year={2018}
 }
-@inproceedings{DBLP:journals/corr/GehringAGYD17,
-  author    = {Jonas Gehring and
-               Michael Auli and
-               David Grangier and
-               Denis Yarats and
-               Yann N. Dauphin},
-  title     = {Convolutional Sequence to Sequence Learning},
-  publisher = {International Conference on Machine Learning},
-  volume    = {70},
-  pages     = {1243--1252},
-  year      = {2017}
-}
 @inproceedings{Wei2019ImitationLF,
  title={Imitation Learning for Non-Autoregressive Neural Machine Translation},
  author={Bingzhen Wei and Mingxuan Wang and Hao Zhou and Junyang Lin and Xu Sun},
@@ -7717,15 +7213,6 @@ author    = {Zhuang Liu and
  volume={18},
  pages={187:1-187:30}
 }
-@inproceedings{DBLP:journals/corr/HintonVD15,
-  author    = {Geoffrey E. Hinton and
-               Oriol Vinyals and
-               Jeffrey Dean},
-  title     = {Distilling the Knowledge in a Neural Network},
-  publisher   = {CoRR},
-  volume    = {abs/1503.02531},
-  year      = {2015}
-}
 @inproceedings{Munim2019SequencelevelKD,
  title={Sequence-level Knowledge Distillation for Model Compression of Attention-based Sequence-to-sequence Speech Recognition},
  author={Raden Mu'az Mun'im and Nakamasa Inoue and Koichi Shinoda},
@@ -7746,20 +7233,6 @@ author    = {Zhuang Liu and
  volume    = {abs/1903.12136},
  year      = {2019}
 }
-@inproceedings{Jiao2020TinyBERTDB,
-  author    = {Xiaoqi Jiao and
-               Yichun Yin and
-               Lifeng Shang and
-               Xin Jiang and
-               Xiao Chen and
-               Linlin Li and
-               Fang Wang and
-               Qun Liu},
-  title     = {TinyBERT: Distilling {BERT} for Natural Language Understanding},
-  pages     = {4163--4174},
-  publisher={Conference on Empirical Methods in Natural Language Processing},
-  year={2020}
-}
 @inproceedings{Ghazvininejad2020AlignedCE,
  author    = {Marjan Ghazvininejad and
               Vladimir Karpukhin and
@@ -7816,23 +7289,6 @@ author    = {Zhuang Liu and
  volume    = {abs/1911.02215},
  year      = {2019}
 }
-@inproceedings{vaswani2017attention,
-	title={Attention is All You Need},
-	author={Ashish {Vaswani} and Noam {Shazeer} and Niki {Parmar} and Jakob {Uszkoreit} and Llion {Jones} and Aidan N. {Gomez} and Lukasz {Kaiser} and Illia {Polosukhin}},
-	publisher={International Conference on Neural Information Processing},
-	pages={5998--6008},
-	year={2017}
-}
-@inproceedings{Gu2017NonAutoregressiveNM,
-  author    = {Jiatao Gu and
-               James Bradbury and
-               Caiming Xiong and
-               Victor O. K. Li and
-               Richard Socher},
-  title     = {Non-Autoregressive Neural Machine Translation},
-  publisher = {International Conference on Learning Representations},
-  year      = {2018}
-}
 @inproceedings{Zhou2020UnderstandingKD,
  title={Understanding Knowledge Distillation in Non-autoregressive Machine Translation},
  author={Chunting Zhou and Graham Neubig and Jiatao Gu},
@@ -7941,13 +7397,6 @@ author    = {Zhuang Liu and
  volume={7},
  pages={91-105}
 }
-@inproceedings{devlin2019bert,
-  title={Bert: Pre-training of deep bidirectional transformers for language understanding},
-  author={Devlin Jacob and Chang Ming-Wei and Lee Kenton and Toutanova Kristina},
-  year={2019},
-  pages = {4171--4186},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-}
 @inproceedings{Feng2016ImprovingAM,
  title={Improving Attention Modeling with Implicit Distortion and Fertility for Machine Translation},
  author={Shi Feng and Shujie Liu and Nan Yang and Mu Li and Ming Zhou and Kenny Q. Zhu},
@@ -7955,65 +7404,6 @@ author    = {Zhuang Liu and
  pages={3082--3092},
  year={2016}
 }
-@inproceedings{TuModeling,
-  author    = {Zhaopeng Tu and
-               Zhengdong Lu and
-               Yang Liu and
-               Xiaohua Liu and
-               Hang Li},
-  title     = {Modeling Coverage for Neural Machine Translation},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2016}
-}
-@inproceedings{Wu2016GooglesNM,
-  title={Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation},
-  author    = {Yonghui Wu and
-               Mike Schuster and
-               Zhifeng Chen and
-               Quoc V. Le and
-               Mohammad Norouzi and
-               Wolfgang Macherey and
-               Maxim Krikun and
-               Yuan Cao and
-               Qin Gao and
-               Klaus Macherey and
-               Jeff Klingner and
-               Apurva Shah and
-               Melvin Johnson and
-               Xiaobing Liu and
-               Lukasz Kaiser and
-               Stephan Gouws and
-               Yoshikiyo Kato and
-               Taku Kudo and
-               Hideto Kazawa and
-               Keith Stevens and
-               George Kurian and
-               Nishant Patil and
-               Wei Wang and
-               Cliff Young and
-               Jason Smith and
-               Jason Riesa and
-               Alex Rudnick and
-               Oriol Vinyals and
-               Greg Corrado and
-               Macduff Hughes and
-               Jeffrey Dean},
-  publisher   = {CoRR},
-  year={2016},
-  volume={abs/1609.08144}
-}
-@inproceedings{li-etal-2018-simple,
-  author    = {Yanyang Li and
-               Tong Xiao and
-               Yinqiao Li and
-               Qiang Wang and
-               Changming Xu and
-               Jingbo Zhu},
-  title     = {A Simple and Effective Approach to Coverage-Aware Neural Machine Translation},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  pages     = {292--297},
-  year      = {2018}
-}
 @inproceedings{Peris2017InteractiveNM,
  title={Interactive neural machine translation},
  author={{\'A}lvaro Peris and Miguel Domingo and F. Casacuberta},
@@ -8022,13 +7412,6 @@ author    = {Zhuang Liu and
  volume={45},
  pages={201-220}
 }
-@inproceedings{Peris2018ActiveLF,
-  title={Active Learning for Interactive Neural Machine Translation of Data Streams},
-  author={{\'A}lvaro Peris and Francisco Casacuberta},
-  publisher={The SIGNLL Conference on Computational Natural Language Learning},
-  pages={151--160},
-  year={2018}
-}
 @inproceedings{Xiao2016ALA,
  title={A Loss-Augmented Approach to Training Syntactic Machine Translation Systems},
  author={Tong Xiao and Derek F. Wong and Jingbo Zhu},
@@ -8037,16 +7420,6 @@ author    = {Zhuang Liu and
  volume={24},
  pages={2069-2083}
 }
-@inproceedings{DBLP:conf/acl/JeanCMB15,
-  author    = {S{\'{e}}bastien Jean and
-               KyungHyun Cho and
-               Roland Memisevic and
-               Yoshua Bengio},
-  title     = {On Using Very Large Target Vocabulary for Neural Machine Translation},
-  pages     = {1--10},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2015}
-}
 @inproceedings{61115,
  author={Jianhua Lin},
  publisher={IEEE Transactions on Information Theory}, 
@@ -8065,73 +7438,6 @@ author    = {Zhuang Liu and
  publisher = {	AAAI Conference on Artificial Intelligence},
  year      = {2019}
 }
-@inproceedings{DBLP:journals/corr/abs-1805-00631,
-  author    = {Biao Zhang and
-               Deyi Xiong and
-               Jinsong Su},
-  title     = {Accelerating Neural Transformer via an Average Attention Network},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  pages     = {1789--1798},
-  year      = {2018},
-}
-@inproceedings{Wu2019PayLA,
- author = {Felix Wu and
-		 Angela Fan and
-		 Alexei Baevski and
-		 Yann N. Dauphin and
-		 Michael Auli},
- title = {Pay Less Attention with Lightweight and Dynamic Convolutions},
- publisher = {International Conference on Learning Representations},
- year = {2019},
-}
-@inproceedings{Xiao2019SharingAW,
-  author    = {Tong Xiao and
-               Yinqiao Li and
-               Jingbo Zhu and
-               Zhengtao Yu and
-               Tongran Liu},
-  title     = {Sharing Attention Weights for Fast Transformer},
-  publisher = {International Joint Conference on Artificial Intelligence},
-  pages     = {5292--5298},
-  year      = {2019}
-}
-@inproceedings{Chen2018TheBO,
-  author    = {Mia Xu Chen and
-               Orhan Firat and
-               Ankur Bapna and
-               Melvin Johnson and
-               Wolfgang Macherey and
-               George F. Foster and
-               Llion Jones and
-               Mike Schuster and
-               Noam Shazeer and
-               Niki Parmar and
-               Ashish Vaswani and
-               Jakob Uszkoreit and
-               Lukasz Kaiser and
-               Zhifeng Chen and
-               Yonghui Wu and
-               Macduff Hughes},
-  title     = {The Best of Both Worlds: Combining Recent Advances in Neural Machine
-               Translation},
-  pages     = {76--86},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2018}
-}
-@inproceedings{DBLP:journals/corr/abs-1906-00532,
-  author    = {Aishwarya Bhandare and
-               Vamsi Sripathi and
-               Deepthi Karkada and
-               Vivek Menon and
-               Sun Choi and
-               Kushal Datta and
-               Vikram Saletore},
-  title     = {Efficient 8-Bit Quantization of Transformer Neural Machine Language
-               Translation Model},
-  publisher   = {CoRR},
-  volume    = {abs/1906.00532},
-  year      = {2019}
-}
 @inproceedings{DBLP:conf/cvpr/JacobKCZTHAK18,
  author    = {Benoit Jacob and
               Skirmantas Kligys and
@@ -8239,14 +7545,6 @@ author    = {Zhuang Liu and
  volume    = {abs/1611.08562},
  year      = {2016}
 }
-@inproceedings{xiao2013bagging,
-  title ={Bagging and boosting statistical machine translation systems},
-  author ={Tong Xiao and Jingbo Zhu and Tongran Liu },
-  publisher ={Artificial Intelligence},
-  volume ={195},
-  pages ={496--527},
-  year ={2013}
-}
 @inproceedings{DBLP:conf/emnlp/TrombleKOM08,
  author    = {Roy Tromble and
               Shankar Kumar and
@@ -8270,29 +7568,6 @@ author    = {Zhuang Liu and
  pages     = {3302--3308},
  year      = {2017}
 }
-@inproceedings{Shaw2018SelfAttentionWR,
-  author    = {Peter Shaw and
-               Jakob Uszkoreit and
-               Ashish Vaswani},
-  title     = {Self-Attention with Relative Position Representations},
-  publisher = {Proceedings of the Human Language Technology Conference of 
-               the North American Chapter of the Association for Computational Linguistics},
-  pages     = {464--468},
-  year      = {2018}
-}
-@inproceedings{WangLearning,
-  author    = {Qiang Wang and
-               Bei Li and
-               Tong Xiao and
-               Jingbo Zhu and
-               Changliang Li and
-               Derek F. Wong and
-               Lidia S. Chao},
-  title     = {Learning Deep Transformer Models for Machine Translation},
-  pages     = {1810--1822},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2019}
-}
 @inproceedings{DBLP:conf/iclr/FanGJ20,
  author    = {Angela Fan and
               Edouard Grave and
@@ -8409,25 +7684,6 @@ author    = {Zhuang Liu and
  volume    = {abs/2010.02416},
  year      = {2020}
 }
-@inproceedings{Vaswani2018Tensor2TensorFN,
-   author    = {Ashish Vaswani and
-               Samy Bengio and
-               Eugene Brevdo and
-               Fran{\c{c}}ois Chollet and
-               Aidan N. Gomez and
-               Stephan Gouws and
-               Llion Jones and
-               Lukasz Kaiser and
-               Nal Kalchbrenner and
-               Niki Parmar and
-               Ryan Sepassi and
-               Noam Shazeer and
-               Jakob Uszkoreit},
-  title     = {Tensor2Tensor for Neural Machine Translation},
-  pages     = {193--199},
-  publisher = {Association for Machine Translation in the Americas},
-  year      = {2018}
-}
 @inproceedings{Sun2019BaiduNM,
  title={Baidu Neural Machine Translation Systems for WMT19},
  author    = {Meng Sun and
@@ -8610,17 +7866,6 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
-@inproceedings{KleinOpenNMT,
-  author    = {Guillaume Klein and
-               Yoon Kim and
-               Yuntian Deng and
-               Jean Senellart and
-               Alexander M. Rush},
-  title     = {OpenNMT: Open-Source Toolkit for Neural Machine Translation},
-  pages     = {67--72},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2017}
-}
 @inproceedings{DBLP:conf/acl/WuWXTGQLL19,
  author    = {Lijun Wu and
               Yiren Wang and
@@ -8635,16 +7880,6 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
-@inproceedings{DBLP:conf/cvpr/HuangLMW17,
-  author    = {Gao Huang and
-               Zhuang Liu and
-               Laurens van der Maaten and
-               Kilian Q. Weinberger},
-  title     = {Densely Connected Convolutional Networks},
-  pages     = {2261--2269},
-  publisher = {{IEEE} Conference on Computer Vision and Pattern Recognition},
-  year      = {2017}
-}
 @inproceedings{DBLP:journals/corr/GreffSS16,
  author    = {Klaus Greff and
               Rupesh Kumar Srivastava and
@@ -8653,31 +7888,6 @@ author    = {Zhuang Liu and
  publisher = {International Conference on Learning Representations},
  year      = {2017}
 }
-@inproceedings{Bapna2018TrainingDN,
-  author    = {Ankur Bapna and
-               Mia Xu Chen and
-               Orhan Firat and
-               Yuan Cao and
-               Yonghui Wu},
-  title     = {Training Deeper Neural Machine Translation Models with Transparent
-               Attention},
-  pages     = {3028--3033},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2018}
-}
-@inproceedings{WangLearning,
-  author    = {Qiang Wang and
-               Bei Li and
-               Tong Xiao and
-               Jingbo Zhu and
-               Changliang Li and
-               Derek F. Wong and
-               Lidia S. Chao},
-  title     = {Learning Deep Transformer Models for Machine Translation},
-  pages     = {1810--1822},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2019}
-}
 @inproceedings{DBLP:journals/corr/abs-2002-04745,
  author    = {Ruibin Xiong and
               Yunchang Yang and
@@ -8705,86 +7915,6 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2020}
 }
-@inproceedings{DBLP:journals/corr/HeZRS15,
-  author    = {Kaiming He and
-               Xiangyu Zhang and
-               Shaoqing Ren and
-               Jian Sun},
-  title     = {Deep Residual Learning for Image Recognition},
-  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
-  pages     = {770--778},
-  year      = {2016},
-}
-@inproceedings{Ba2016LayerN,
-  author    = {Lei Jimmy Ba and
-               Jamie Ryan Kiros and
-               Geoffrey E. Hinton},
-  title     = {Layer Normalization},
-  publisher   = {CoRR},
-  volume    = {abs/1607.06450},
-  year      = {2016}
-}
-@inproceedings{Vaswani2018Tensor2TensorFN,
-   author    = {Ashish Vaswani and
-               Samy Bengio and
-               Eugene Brevdo and
-               Fran{\c{c}}ois Chollet and
-               Aidan N. Gomez and
-               Stephan Gouws and
-               Llion Jones and
-               Lukasz Kaiser and
-               Nal Kalchbrenner and
-               Niki Parmar and
-               Ryan Sepassi and
-               Noam Shazeer and
-               Jakob Uszkoreit},
-  title     = {Tensor2Tensor for Neural Machine Translation},
-  pages     = {193--199},
-  publisher = {Association for Machine Translation in the Americas},
-  year      = {2018}
-}
-@inproceedings{Dou2019DynamicLA,
-  author    = {Zi-Yi Dou and
-               Zhaopeng Tu and
-               Xing Wang and
-               Longyue Wang and
-               Shuming Shi and
-               Tong Zhang},
-  title     = {Dynamic Layer Aggregation for Neural Machine Translation with Routing-by-Agreement},
-  pages     = {86--93},
-  publisher = {AAAI Conference on Artificial Intelligence},
-  year      = {2019}
-}
-@inproceedings{Wang2018MultilayerRF,
-  title={Multi-layer Representation Fusion for Neural Machine Translation},
-  author={Qiang Wang and Fuxue Li and Tong Xiao and Yanyang Li and Yinqiao Li and Jingbo Zhu},
-  publisher={International Conference on Computational Linguistics},
-  year={2018},
-  volume={abs/2002.06714}
-}
-@inproceedings{Dou2018ExploitingDR,
-   author    = {Zi-Yi Dou and
-               Zhaopeng Tu and
-               Xing Wang and
-               Shuming Shi and
-               Tong Zhang},
-  title     = {Exploiting Deep Representations for Neural Machine Translation},
-  pages     = {4253--4262},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2018}
-}
-@inproceedings{DBLP:journals/corr/LinFSYXZB17,
-  author    = {Zhouhan Lin and
-               Minwei Feng and
-               C{\'{\i}}cero Nogueira dos Santos and
-               Mo Yu and
-               Bing Xiang and
-               Bowen Zhou and
-               Yoshua Bengio},
-  title     = {A Structured Self-Attentive Sentence Embedding},
-  publisher = {International Conference on Learning Representations},
-  year      = {2017},
-}
 @inproceedings{DBLP:conf/nips/SrivastavaGS15,
  author    = {Rupesh Kumar Srivastava and
               Klaus Greff and
@@ -8830,15 +7960,6 @@ author    = {Zhuang Liu and
  pages     = {1675--1685},
  year      = {2019}
 }
-@inproceedings{pmlr-v9-glorot10a,
-  author    = {Xavier Glorot and
-               Yoshua Bengio},
-  title     = {Understanding the difficulty of training deep feedforward neural networks},
-  publisher = {International Conference on Artificial Intelligence and Statistics},
-  volume    = {9},
-  pages     = {249--256},
-  year      = {2010}
-}
 @inproceedings{DBLP:conf/iccv/HeZRS15,
  author    = {Kaiming He and
               Xiangyu Zhang and
@@ -9269,13 +8390,6 @@ author    = {Zhuang Liu and
  volume    = {abs/2003.03384},
  year      = {2020}
 }
-@inproceedings{Chollet2017XceptionDL,
-  title={Xception: Deep Learning with Depthwise Separable Convolutions},
-  author    = {Fran{\c{c}}ois Chollet},
-  publisher={IEEE Conference on Computer Vision and Pattern Recognition},
-  year={2017},
-  pages={1800-1807}
-}
 @inproceedings{DBLP:journals/tnn/AngelineSP94,
  author    = {Peter J. Angeline and
               Gregory M. Saunders and
@@ -9523,20 +8637,6 @@ author    = {Zhuang Liu and
  volume    = {abs/2009.02070},
  year      = {2020}
 }
-@inproceedings{DBLP:conf/acl/WangWLCZGH20,
-  author    = {Hanrui Wang and
-               Zhanghao Wu and
-               Zhijian Liu and
-               Han Cai and
-               Ligeng Zhu and
-               Chuang Gan and
-               Song Han},
-  title     = {{HAT:} Hardware-Aware Transformers for Efficient Natural Language
-               Processing},
-  pages     = {7675--7688},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2020}
-}
 @inproceedings{DBLP:journals/corr/abs-2008-06808,
  author    = {Henry Tsai and
               Jayden Ooi and
@@ -9549,24 +8649,6 @@ author    = {Zhuang Liu and
  volume    = {abs/2008.06808},
  year      = {2020}
 }
-@inproceedings{Wang2019ExploitingSC,
-  title={Exploiting Sentential Context for Neural Machine Translation},
-  author={Xing Wang and Zhaopeng Tu and Longyue Wang and Shuming Shi},
-  publisher={Annual Meeting of the Association for Computational Linguistics},
-  year={2019}
-}
-@inproceedings{Wei2020MultiscaleCD,
-  title={Multiscale Collaborative Deep Models for Neural Machine Translation},
-  author={Xiangpeng Wei and Heng Yu and Yue Hu and Yue Zhang and Rongxiang Weng and Weihua Luo},
-  publisher={Annual Meeting of the Association for Computational Linguistics},
-  year={2020}
-}
-@inproceedings{li2020shallow,
-  title={Shallow-to-Deep Training for Neural Machine Translation},
-  author={Li, Bei and Wang, Ziyang and Liu, Hui and Jiang, Yufan and Du, Quan and Xiao, Tong and Wang, Huizhen and Zhu, Jingbo},
-  publisher={Conference on Empirical Methods in Natural Language Processing},
-  year={2020}
-}
 @inproceedings{DBLP:journals/corr/abs-2007-06257,
  author    = {Hongfei Xu and
               Qiuhui Liu and
@@ -9588,18 +8670,6 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2020}
 }
-@inproceedings{DBLP:journals/corr/abs-2006-10369,
-  author    = {Jungo Kasai and
-               Nikolaos Pappas and
-               Hao Peng and
-               James Cross and
-               Noah A. Smith},
-  title     = {Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff
-               in Machine Translation},
-  publisher   = {CoRR},
-  volume    = {abs/2006.10369},
-  year      = {2020}
-}
 @inproceedings{DBLP:journals/corr/abs-1806-01261,
  author    = {Peter W. Battaglia and
               Jessica B. Hamrick and
@@ -9633,34 +8703,6 @@ author    = {Zhuang Liu and
  volume    = {abs/1806.01261},
  year      = {2018}
 }
-@inproceedings{Shaw2018SelfAttentionWR,
-  author    = {Peter Shaw and
-               Jakob Uszkoreit and
-               Ashish Vaswani},
-  title     = {Self-Attention with Relative Position Representations},
-  publisher = {Annual Conference of the North American Chapter of the Association for Computational Linguistics},
-  pages     = {464--468},
-  year      = {2018},
-}
-@inproceedings{Dai2019TransformerXLAL,
-  author    = {Zihang Dai and
-               Zhilin Yang and
-               Yiming Yang and
-               Jaime G. Carbonell and
-               Quoc V. Le and
-               Ruslan Salakhutdinov},
-  title     = {Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context},
-  publisher   = {Annual Meeting of the Association for Computational Linguistics},
-  pages     = {2978--2988},
-  year      = {2019}
-}
-@inproceedings{vaswani2017attention,
-	title={Attention is All You Need},
-	author={Ashish {Vaswani} and Noam {Shazeer} and Niki {Parmar} and Jakob {Uszkoreit} and Llion {Jones} and Aidan N. {Gomez} and Lukasz {Kaiser} and Illia {Polosukhin}},
-	publisher={International Conference on Neural Information Processing},
-	pages={5998--6008},
-	year={2017}
-}
 @inproceedings{DBLP:conf/acl/LiXTZZZ17,
  author    = {Junhui Li and
               Deyi Xiong and
@@ -9681,18 +8723,6 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2016}
 }
-@inproceedings{Yang2017TowardsBH,
-  author    = {Baosong Yang and
-               Derek F. Wong and
-               Tong Xiao and
-               Lidia S. Chao and
-               Jingbo Zhu},
-  title     = {Towards Bidirectional Hierarchical Representations for Attention-based
-               Neural Machine Translation},
-  publisher = {Conference on Empirical Methods in Natural Language Processing},
-  pages     = {1432--1441},
-  year      = {2017}
-}
 @inproceedings{DBLP:conf/acl/ChenHCC17,
  author    = {Huadong Chen and
               Shujian Huang and
@@ -9704,16 +8734,6 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2017}
 }
-@inproceedings{TuModeling,
-  author    = {Zhaopeng Tu and
-               Zhengdong Lu and
-               Yang Liu and
-               Xiaohua Liu and
-               Hang Li},
-  title     = {Modeling Coverage for Neural Machine Translation},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2016}
-}
 @inproceedings{DBLP:conf/wmt/SennrichH16,
  author    = {Rico Sennrich and
               Barry Haddow},
@@ -9739,13 +8759,6 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2020}
 }
-@inproceedings{Aharoni2017TowardsSN,
-  title={Towards String-To-Tree Neural Machine Translation},
-  author={Roee Aharoni and 
-          Yoav Goldberg},
-  publisher={Annual Meeting of the Association for Computational Linguistics},
-  year={2017}
-}
 @inproceedings{DBLP:conf/iclr/Alvarez-MelisJ17,
  author    = {David Alvarez-Melis and
               Tommi S. Jaakkola},
@@ -9763,13 +8776,6 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2016}
 }
-@book{aho1972theory,
-  author    = {Aho, Alfred V and
-               Ullman, Jeffrey D},
-  title     = {The theory of parsing, translation, and compiling},
-  publisher = {Prentice-Hall Englewood Cliffs, NJ},
-  year      = {1973},
-}
 @inproceedings{DBLP:journals/corr/LuongLSVK15,
  author    = {Minh-Thang Luong and
               Quoc V. Le and
@@ -9805,26 +8811,6 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2017}
 }
-@inproceedings{DBLP:journals/corr/abs-1808-09374,
-  author    = {Xinyi Wang and
-               Hieu Pham and
-               Pengcheng Yin and
-               Graham Neubig},
-  title     = {A Tree-based Decoder for Neural Machine Translation},
-  publisher = {Conference on Empirical Methods in Natural Language Processing},
-  pages     = {4772--4777},
-  year      = {2018}
-}
-@inproceedings{Tong2016Syntactic,
-  author    = {Tong Xiao and
-               Jingbo Zhu and
-               Chunliang Zhang and
-               Tongran Liu},
-  title     = {Syntactic Skeleton-Based Translation},
-  pages     = {2856--2862},
-  publisher = {AAAI Conference on Artificial Intelligence},
-  year      = {2016},
-}
 @inproceedings{DBLP:conf/emnlp/WangTWS19a,
  author    = {Xing Wang and
               Zhaopeng Tu and
@@ -9835,13 +8821,6 @@ author    = {Zhuang Liu and
  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2019}
 }
-@inproceedings{Liu2020LearningTE,
-	title={Learning to Encode Position for Transformer with Continuous Dynamical Model},
-	author={Xuanqing Liu and Hsiang-Fu Yu and Inderjit Dhillon and Cho-Jui Hsieh},
-	publisher={ArXiv},
-	year={2020},
-	volume={abs/2003.09229}
-}
 @inproceedings{DBLP:conf/nips/ChenRBD18,
  author    = {Tian Qi Chen and
               Yulia Rubanova and
@@ -9852,27 +8831,6 @@ author    = {Zhuang Liu and
  pages     = {6572--6583},
  year      = {2018}
 }
-@inproceedings{DBLP:journals/corr/LuongPM15,
-  author    = {Thang Luong and
-               Hieu Pham and
-               Christopher D. Manning},
-  title     = {Effective Approaches to Attention-based Neural Machine Translation},
-  publisher = {Conference on Empirical Methods in Natural Language Processing},
-  pages     = {1412--1421},
-  year      = {2015}
-}
-@inproceedings{Yang2018ModelingLF,
-	author    = {Baosong Yang and
-               Zhaopeng Tu and
-               Derek F. Wong and
-               Fandong Meng and
-               Lidia S. Chao and
-               Tong Zhang},
-  title     = {Modeling Localness for Self-Attention Networks},
-  pages     = {4449--4458},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2018}
-}
 @inproceedings{DBLP:conf/aaai/GuoQLXZ20,
  author    = {Qipeng Guo and
               Xipeng Qiu and
@@ -9884,33 +8842,6 @@ author    = {Zhuang Liu and
  publisher = {AAAI Conference on Artificial Intelligence},
  year      = {2020}
 }
-@inproceedings{Wu2019PayLA,
- author = {Felix Wu and
-		 Angela Fan and
-		 Alexei Baevski and
-		 Yann N. Dauphin and
-		 Michael Auli},
- title = {Pay Less Attention with Lightweight and Dynamic Convolutions},
- publisher = {International Conference on Learning Representations},
- year = {2019},
-}
-@inproceedings{DBLP:conf/interspeech/GulatiQCPZYHWZW20,
-  author    = {Anmol Gulati and
-               James Qin and
-               Chung-Cheng Chiu and
-               Niki Parmar and
-               Yu Zhang and
-               Jiahui Yu and
-               Wei Han and
-               Shibo Wang and
-               Zhengdong Zhang and
-               Yonghui Wu and
-               Ruoming Pang},
-  title     = {Conformer: Convolution-augmented Transformer for Speech Recognition},
-  pages     = {5036--5040},
-  publisher = {International Speech Communication Association},
-  year      = {2020}
-}
 @inproceedings{DBLP:conf/cvpr/XieGDTH17,
  author    = {Saining Xie and
               Ross B. Girshick and
@@ -9961,16 +8892,6 @@ author    = {Zhuang Liu and
  number={3},
  year={2019},
 }
-@inproceedings{DBLP:conf/iclr/WuLLLH20,
-  author    = {Zhanghao Wu and
-               Zhijian Liu and
-               Ji Lin and
-               Yujun Lin and
-               Song Han},
-  title     = {Lite Transformer with Long-Short Range Attention},
-  publisher = {International Conference on Learning Representations},
-  year      = {2020}
-}
 @inproceedings{DBLP:conf/iclr/DehghaniGVUK19,
  author    = {Mostafa Dehghani and
               Stephan Gouws and
@@ -9981,12 +8902,6 @@ author    = {Zhuang Liu and
  publisher = {International Conference on Learning Representations},
  year      = {2019}
 }
-@inproceedings{Lan2020ALBERTAL,
-  title={ALBERT: A Lite BERT for Self-supervised Learning of Language Representations},
-  author={Zhenzhong Lan and Mingda Chen and Sebastian Goodman and Kevin Gimpel and Piyush Sharma and Radu Soricut},
-  publisher={International Conference on Learning Representations},
-  year={2020}
-}
 @inproceedings{DBLP:conf/naacl/HaoWYWZT19,
  author    = {Jie Hao and
               Xing Wang and
@@ -10032,14 +8947,6 @@ author    = {Zhuang Liu and
  volume    = {abs/2004.05150},
  year      = {2020}
 }
-@inproceedings{Kitaev2020ReformerTE,
-  author    = {Nikita Kitaev and
-               Lukasz Kaiser and
-               Anselm Levskaya},
-  title     = {Reformer: The Efficient Transformer},
-  publisher = {International Conference on Learning Representations},
-  year      = {2020}
-}
 @inproceedings{DBLP:journals/corr/abs-2003-05997,
  author    = {Aurko Roy and
               Mohammad Saffar and
@@ -10050,13 +8957,6 @@ author    = {Zhuang Liu and
  volume    = {abs/2003.05997},
  year      = {2020}
 }
-@inproceedings{Katharopoulos2020TransformersAR,
-  title={Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention},
-  author={Angelos Katharopoulos and Apoorv Vyas and Nikolaos Pappas and Franccois Fleuret},
-  publisher={CoRR},
-  year={2020},
-  volume={abs/2006.16236}
-}
 @inproceedings{DBLP:journals/corr/abs-2009-14794,
  author    = {Krzysztof Choromanski and
               Valerii Likhosherstov and
@@ -10099,17 +8999,6 @@ author    = {Zhuang Liu and
  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2018}
 }
-@inproceedings{DBLP:journals/corr/abs-2006-04768,
-  author    = {Sinong Wang and
-               Belinda Z. Li and
-               Madian Khabsa and
-               Han Fang and
-               Hao Ma},
-  title     = {Linformer: Self-Attention with Linear Complexity},
-  publisher   = {CoRR},
-  volume    = {abs/2006.04768},
-  year      = {2020}
-}
 @inproceedings{DBLP:conf/nips/BergstraBBK11,
  author    = {James Bergstra and
               R{\'{e}}mi Bardenet and
@@ -10131,18 +9020,6 @@ author    = {Zhuang Liu and
  publisher = {Learning and Intelligent Optimization},
  year      = {2011}
 }
-@inproceedings{DBLP:conf/icml/BergstraYC13,
-  author    = {James Bergstra and
-               Daniel Yamins and
-               David D. Cox},
-  title     = {Making a Science of Model Search: Hyperparameter Optimization in Hundreds
-               of Dimensions for Vision Architectures},
-  series    = {{JMLR} Workshop and Conference Proceedings},
-  volume    = {28},
-  pages     = {115--123},
-  publisher = {International Conference on Machine Learning},
-  year      = {2013}
-}
 @inproceedings{DBLP:conf/iccv/ChenXW019,
  author    = {Xin Chen and
               Lingxi Xie and
@@ -10165,122 +9042,34 @@ author    = {Zhuang Liu and
  publisher = {International Conference on Machine Learning},
  year      = {2020}
 }
-@inproceedings{Jawahar2019WhatDB,
-	title={What Does BERT Learn about the Structure of Language?},
-	author={Ganesh Jawahar and Beno{\^{\i}}t Sagot and Djam{\'e} Seddah},
-	publisher={Annual Meeting of the Association for Computational Linguistics},
-	year={2019}
-}
 @inproceedings{DBLP:conf/emnlp/Ethayarajh19,
  author    = {Kawin Ethayarajh},
  title     = {How Contextual are Contextualized Word Representations? Comparing
               the Geometry of BERT, ELMo, and {GPT-2} Embeddings},
  pages     = {55--65},
  publisher = {Conference on Empirical Methods in Natural Language Processing},
-  year      = {2019}
-}
-@inproceedings{DBLP:journals/corr/abs-1905-09418,
-  author    = {Elena Voita and
-               David Talbot and
-               Fedor Moiseev and
-               Rico Sennrich and
-               Ivan Titov},
-  title     = {Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy
-               Lifting, the Rest Can Be Pruned},
-  pages     = {5797--5808},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2019},
-}
-@inproceedings{Michel2019AreSH,
-  title={Are Sixteen Heads Really Better than One?},
-  author    = {Paul Michel and
-               Omer Levy and
-               Graham Neubig},
-  title     = {Are Sixteen Heads Really Better than One?},
-  publisher = {Annual Conference on Neural Information Processing Systems},
-  pages     = {14014--14024},
-  year      = {2019}
-}
-@inproceedings{DBLP:conf/emnlp/LiTYLZ18,
-  author    = {Jian Li and
-               Zhaopeng Tu and
-               Baosong Yang and
-               Michael R. Lyu and
-               Tong Zhang},
-  title     = {Multi-Head Attention with Disagreement Regularization},
-  pages     = {2897--2903},
-  publisher = {Conference on Empirical Methods in Natural Language Processing},
-  year      = {2018}
-}
-@inproceedings{Su2018VariationalRN,
-  title={Variational Recurrent Neural Machine Translation},
-  author={Jinsong Su and Shan Wu and Deyi Xiong and Yaojie Lu and Xianpei Han and Biao Zhang},
-  publisher={AAAI Conference on Artificial Intelligence},
-  pages={5488--5495},
-  year={2018}
-}
-@inproceedings{DBLP:conf/acl/SetiawanSNP20,
-  author    = {Hendra Setiawan and
-               Matthias Sperber and
-               Udhyakumar Nallasamy and
-               Matthias Paulik},
-  title     = {Variational Neural Machine Translation with Normalizing Flows},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2020}
-}
-@inproceedings{Li2020NeuralMT,
-  author    = {Yanyang Li and
-               Qiang Wang and
-               Tong Xiao and
-               Tongran Liu and
-               Jingbo Zhu},
-  title     = {Neural Machine Translation with Joint Representation},
-  pages     = {8285--8292},
-  publisher = {AAAI Conference on Artificial Intelligence},
-  year      = {2020}
-}
-@inproceedings{JMLR:v15:srivastava14a,
-  author  = {Nitish Srivastava and Geoffrey Hinton and Alex Krizhevsky and Ilya Sutskever and Ruslan Salakhutdinov},
-  title   = {Dropout: A Simple Way to Prevent Neural Networks from Overfitting},
-  publisher = {Journal of Machine Learning Research},
-  year    = {2014},
-  volume  = {15},
-  pages   = {1929-1958},
-}
-@inproceedings{Szegedy_2016_CVPR,
-  author    = {Christian Szegedy and
-               Vincent Vanhoucke and
-               Sergey Ioffe and
-               Jonathon Shlens and
-               Zbigniew Wojna},
-  title     = {Rethinking the Inception Architecture for Computer Vision},
-  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
-  pages     = {2818--2826},
-  year      = {2016},
-}
-@inproceedings{Chen2018TheBO,
-  author    = {Mia Xu Chen and
-               Orhan Firat and
-               Ankur Bapna and
-               Melvin Johnson and
-               Wolfgang Macherey and
-               George F. Foster and
-               Llion Jones and
-               Mike Schuster and
-               Noam Shazeer and
-               Niki Parmar and
-               Ashish Vaswani and
-               Jakob Uszkoreit and
-               Lukasz Kaiser and
-               Zhifeng Chen and
-               Yonghui Wu and
-               Macduff Hughes},
-  title     = {The Best of Both Worlds: Combining Recent Advances in Neural Machine
-               Translation},
-  pages     = {76--86},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2019}
+}
+@inproceedings{DBLP:conf/emnlp/LiTYLZ18,
+  author    = {Jian Li and
+               Zhaopeng Tu and
+               Baosong Yang and
+               Michael R. Lyu and
+               Tong Zhang},
+  title     = {Multi-Head Attention with Disagreement Regularization},
+  pages     = {2897--2903},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2018}
 }
+@inproceedings{DBLP:conf/acl/SetiawanSNP20,
+  author    = {Hendra Setiawan and
+               Matthias Sperber and
+               Udhyakumar Nallasamy and
+               Matthias Paulik},
+  title     = {Variational Neural Machine Translation with Normalizing Flows},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2020}
+}
 @inproceedings{DBLP:conf/naacl/GuoQLSXZ19,
  author    = {Qipeng Guo and
               Xipeng Qiu and
@@ -10312,6 +9101,16 @@ author    = {Zhuang Liu and
  publisher = {International Conference on Learning Representations},
  year      = {2018}
 }
+@inproceedings{DBLP:conf/cvpr/HuangLMW17,
+  author    = {Gao Huang and
+               Zhuang Liu and
+               Laurens van der Maaten and
+               Kilian Q. Weinberger},
+  title     = {Densely Connected Convolutional Networks},
+  pages     = {2261--2269},
+  publisher = {{IEEE} Conference on Computer Vision and Pattern Recognition},
+  year      = {2017}
+}
 %%%%% chapter 15------------------------------------------------------
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -10575,20 +9374,6 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
-@inproceedings{2015OnGulcehre,
-  title = {On Using Monolingual Corpora in Neural Machine Translation},
-  author = {Gulcehre Caglar  and  
-           Firat Orhan  and  
-           Xu Kelvin  and  
-           Cho Kyunghyun  and  
-           Barrault Loic  and  
-           Lin Huei Chi  and  
-           Bougares Fethi  and  
-           Schwenk Holger  and  
-           Bengio  Yoshua},
-  publisher = {Computer Science},
-  year = {2015},
-}
 @inproceedings{黄书剑0统计机器翻译中的词对齐研究,
  title={统计机器翻译中的词对齐研究},
  author={黄书剑},
@@ -11087,18 +9872,6 @@ author    = {Zhuang Liu and
  publisher = {Annual Conference of the North American Chapter of the Association for Computational Linguistics},
  year      = {2016}
 }
-@inproceedings{DBLP:conf/emnlp/KimPPKN19,
-  author    = {Yunsu Kim and
-               Petre Petrov and
-               Pavel Petrushkov and
-               Shahram Khadivi and
-               Hermann Ney},
-  title     = {Pivot-based Transfer Learning for Neural Machine Translation between
-               Non-English Languages},
-  pages     = {866--876},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2019}
-}
 @inproceedings{DBLP:journals/mt/WuW07,
  author    = {Hua Wu and
               Haifeng Wang},
@@ -11211,15 +9984,6 @@ author    = {Zhuang Liu and
  publisher = {International Joint Conference on Natural Language Processing},
  year      = {2011}
 }
-@inproceedings{DBLP:journals/corr/HintonVD15,
-  author    = {Geoffrey E. Hinton and
-               Oriol Vinyals and
-               Jeffrey Dean},
-  title     = {Distilling the Knowledge in a Neural Network},
-  publisher   = {CoRR},
-  volume    = {abs/1503.02531},
-  year      = {2015}
-}
 @inproceedings{gu2018meta,
  author    = {Jiatao Gu and
               Yong Wang and
@@ -11541,25 +10305,6 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
-@inproceedings{DBLP:conf/emnlp/FiratSAYC16,
-  author    = {Orhan Firat and
-               Baskaran Sankaran and
-               Yaser Al-Onaizan and
-               Fatos T. Yarman-Vural and
-               Kyunghyun Cho},
-  title     = {Zero-Resource Translation with Multi-Lingual Neural Machine Translation},
-  pages     = {268--277},
-  publisher = {Conference on Empirical Methods in Natural Language Processing},
-  year      = {2016}
-}
-@inproceedings{DBLP:conf/emnlp/CurreyH19,
-  author    = {Anna Currey and
-               Kenneth Heafield},
-  title     = {Zero-Resource Neural Machine Translation with Monolingual Pivot Data},
-  pages     = {99--107},
-  publisher = {Conference on Empirical Methods in Natural Language Processing},
-  year      = {2019}
-}
 @inproceedings{DBLP:conf/acl/FadaeeBM17a,
  author    = {Marzieh Fadaee and
               Arianna Bisazza and
@@ -11609,15 +10354,6 @@ author    = {Zhuang Liu and
  year      = {2008},
  publisher = {International Conference on Machine Learning}
 }
-@inproceedings{DBLP:conf/iclr/LampleCDR18,
-  author    = {Guillaume Lample and
-               Alexis Conneau and
-               Ludovic Denoyer and
-               Marc'Aurelio Ranzato},
-  title     = {Unsupervised Machine Translation Using Monolingual Corpora Only},
-  publisher = {International Conference on Learning Representations},
-  year      = {2018}
-}
 @inproceedings{DBLP:journals/coling/BhagatH13,
  author    = {Rahul Bhagat and
               Eduard Hovy},
@@ -11684,16 +10420,6 @@ author    = {Zhuang Liu and
  pages     = {569--631},
  year      = {2019}
 }
-@inproceedings{DBLP:conf/acl/TuLLLL16,
-  author    = {Zhaopeng Tu and
-               Zhengdong Lu and
-               Yang Liu and
-               Xiaohua Liu and
-               Hang Li},
-  title     = {Modeling Coverage for Neural Machine Translation},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2016}
-}
 @inproceedings{DBLP:journals/tacl/TuLLLL17,
  author    = {Zhaopeng Tu and
               Yang Liu and
@@ -11748,29 +10474,6 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2018}
 }
-@inproceedings{DBLP:conf/wmt/LiLXLLLWZXWFCLL19,
-  author    = {Bei Li and
-               Yinqiao Li and
-               Chen Xu and
-               Ye Lin and
-               Jiqiang Liu and
-               Hui Liu and
-               Ziyang Wang and
-               Yuhao Zhang and
-               Nuo Xu and
-               Zeyang Wang and
-               Kai Feng and
-               Hexuan Chen and
-               Tengbo Liu and
-               Yanyang Li and
-               Qiang Wang and
-               Tong Xiao and
-               Jingbo Zhu},
-  title     = {The NiuTrans Machine Translation Systems for {WMT19}},
-  pages     = {257--266},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2019}
-}
 @inproceedings{DBLP:conf/nips/DaiL15,
  author    = {Andrew Dai and
               Quoc Le},
@@ -11779,19 +10482,6 @@ author    = {Zhuang Liu and
  publisher = {Annual Conference on Neural Information Processing Systems},
  year      = {2015}
 }
-@inproceedings{DBLP:journals/corr/abs-1802-05365,
-  author    = {Matthew Peters and
-               Mark Neumann and
-               Mohit Iyyer and
-               Matt Gardner and
-               Christopher Clark and
-               Kenton Lee and
-               Luke Zettlemoyer},
-  title     = {Deep Contextualized Word Representations},
-  pages     = {2227--2237},
-  publisher = {Annual Conference of the North American Chapter of the Association for Computational Linguistics},
-  year      = {2018}
-}
 @inproceedings{DBLP:conf/icml/CollobertW08,
  author    = {Ronan Collobert and
               Jason Weston},
@@ -11889,16 +10579,6 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
-@inproceedings{DBLP:journals/corr/LuongLSVK15,
-  author    = {Minh-Thang Luong and
-               Quoc V. Le and
-               Ilya Sutskever and
-               Oriol Vinyals and
-               Lukasz Kaiser},
-  title     = {Multi-task Sequence to Sequence Learning},
-  publisher = {International Conference on Learning Representations},
-  year      = {2016}
-}
 @inproceedings{DBLP:conf/emnlp/ZhangZ16,
  author    = {Jiajun Zhang and
               Chengqing Zong},
@@ -12094,13 +10774,6 @@ author    = {Zhuang Liu and
  pages={117},
  year={2015}
 }
-@inproceedings{chen2016bilingual,
-  title={Bilingual methods for adaptive training data selection for machine translation},
-  author={Chen, Boxing and Kuhn, Roland and Foster, George and Cherry, Colin and Huang, Fei},
-  publisher={Association for Machine Translation in the Americas},
-  pages={93--103},
-  year={2016}
-}
 @inproceedings{DBLP:conf/iwslt/Ueffing06,
  author    = {Nicola Ueffing},
  title     = {Using monolingual source-language data to improve {MT} performance},
@@ -12272,15 +10945,6 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2020}
 }
-@inproceedings{DBLP:conf/emnlp/AxelrodHG11,
-  author    = {Amittai Axelrod and
-               Xiaodong He and
-               Jianfeng Gao},
-  title     = {Domain Adaptation via Pseudo In-Domain Data Selection},
-  pages     = {355--362},
-  publisher = {Conference on Empirical Methods in Natural Language Processing},
-  year      = {2011}
-}
 @inproceedings{DBLP:conf/icdm/Remus12,
  author    = {Robert Remus},
  title     = {Domain Adaptation Using Domain Similarity- and Domain Complexity-Based
@@ -12309,13 +10973,6 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
-@inproceedings{britz2017effective,
-  title={Effective domain mixing for neural machine translation},
-  author={Britz, Denny and Le, Quoc and Pryzant, Reid},
-  publisher={Proceedings of the Second Conference on Machine Translation},
-  pages={118--126},
-  year={2017}
-}
 @inproceedings{DBLP:conf/ranlp/KobusCS17,
  author    = {Catherine Kobus and
               Josep Maria Crego and
@@ -12326,27 +10983,6 @@ author    = {Zhuang Liu and
               Language Processing},
  year      = {2017}
 }
-@inproceedings{DBLP:conf/emnlp/WangULCS17,
-  author    = {Rui Wang and
-               Masao Utiyama and
-               Lemao Liu and
-               Kehai Chen and
-               Eiichiro Sumita},
-  title     = {Instance Weighting for Neural Machine Translation Domain Adaptation},
-  pages     = {1482--1488},
-  publisher = {Conference on Empirical Methods in Natural Language Processing},
-  year      = {2017}
-}
-@inproceedings{DBLP:conf/aclnmt/ChenCFL17,
-  author    = {Boxing Chen and
-               Colin Cherry and
-               George F. Foster and
-               Samuel Larkin},
-  title     = {Cost Weighting for Neural Machine Translation Domain Adaptation},
-  pages     = {40--46},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2017}
-}
 @inproceedings{DBLP:journals/corr/abs-1906-03129,
  author    = {Shen Yan and
               Leonard Dahlmann and
@@ -12432,15 +11068,6 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
-@inproceedings{DBLP:conf/wmt/BritzLP17,
-  author    = {Denny Britz and
-               Quoc V. Le and
-               Reid Pryzant},
-  title     = {Effective Domain Mixing for Neural Machine Translation},
-  pages     = {118--126},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2017}
-}
 @inproceedings{DBLP:journals/ibmrd/Luhn58,
  author    = {Hans Peter Luhn},
  title     = {The Automatic Creation of Literature Abstracts},
@@ -12468,27 +11095,6 @@ author    = {Zhuang Liu and
  publisher = {Annual Conference of the North American Chapter of the Association for Computational Linguistics},
  year      = {2019}
 }
-@inproceedings{DBLP:conf/emnlp/WeesBM17,
-  author    = {Marlies van der Wees and
-               Arianna Bisazza and
-               Christof Monz},
-  title     = {Dynamic Data Selection for Neural Machine Translation},
-  pages     = {1400--1410},
-  publisher = {Conference on Empirical Methods in Natural Language Processing},
-  year      = {2017}
-}
-@inproceedings{DBLP:conf/naacl/ZhangSKMCD19,
-  author    = {Xuan Zhang and
-               Pamela Shapiro and
-               Gaurav Kumar and
-               Paul McNamee and
-               Marine Carpuat and
-               Kevin Duh},
-  title     = {Curriculum Learning for Domain Adaptation in Neural Machine Translation},
-  pages     = {1903--1915},
-  publisher = {Annual Conference of the North American Chapter of the Association for Computational Linguistics},
-  year      = {2019}
-}
 @inproceedings{DBLP:conf/acl/ChuDK17,
  author    = {Chenhui Chu and
               Raj Dabre and
@@ -12564,18 +11170,6 @@ author    = {Zhuang Liu and
  publisher = {Annual Conference of the European Association for Machine Translation},
  year      = {2017}
 }
-@inproceedings{DBLP:conf/aaai/Zhang0LZC18,
-  author    = {Zhirui Zhang and
-               Shujie Liu and
-               Mu Li and
-               Ming Zhou and
-               Enhong Chen},
-  title     = {Joint Training for Neural Machine Translation Models with Monolingual
-               Data},
-  pages     = {555--562},
-  publisher = {AAAI Conference on Artificial Intelligence},
-  year      = {2018}
-}
 @inproceedings{DBLP:conf/wmt/SunJXHWW19,
  author    = {Meng Sun and
               Bojian Jiang and
@@ -12794,19 +11388,6 @@ author    = {Zhuang Liu and
  publisher = {International Conference on Machine Learning},
  year      = {2018}
 }
-@inproceedings{DBLP:conf/nips/HeXQWYLM16,
-  author    = {Di He and
-               Yingce Xia and
-               Tao Qin and
-               Liwei Wang and
-               Nenghai Yu and
-               Tie-Yan Liu and
-               Wei-Ying Ma},
-  title     = {Dual Learning for Machine Translation},
-  publisher = {Annual Conference on Neural Information Processing Systems},
-  pages     = {820--828},
-  year      = {2016}
-}
 @article{zhao2020dual,
  title={Dual Learning: Theoretical Study and an Algorithmic Extension},
  author={Zhao, Zhibing and Xia, Yingce and Qin, Tao and Xia, Lirong and Liu, Tie-Yan},
@@ -12832,12 +11413,6 @@ author    = {Zhuang Liu and
  volume    = {abs/1901.09115},
  year      = {2019}
 }
-@book{jurafsky2000speech,
-  title={Speech \& language processing},
-  author={Jurafsky, Dan},
-  year={2000},
-  publisher={Pearson Education India}
-}
 @inproceedings{DBLP:conf/anlp/MarcuCW00,
  author    = {Daniel Marcu and
               Lynn Carlson and
@@ -12936,19 +11511,10 @@ author    = {Zhuang Liu and
  author    = {Thomas Meyer and
               Andrei Popescu-Belis},
  title     = {Using Sense-labeled Discourse Connectives for Statistical Machine
-               Translation},
-  pages     = {129--138},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2012}
-}
-@inproceedings{DBLP:conf/nips/SutskeverVL14,
-  author    = {Ilya Sutskever and
-               Oriol Vinyals and
-               Quoc V. Le},
-  title     = {Sequence to Sequence Learning with Neural Networks},
-  pages     = {3104--3112},
-  year      = {2014},
-  publisher = {Annual Conference on Neural Information Processing Systems}
+               Translation},
+  pages     = {129--138},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2012}
 }
 @inproceedings{DBLP:conf/emnlp/LaubliS018,
  author    = {Samuel L{\"{a}}ubli and
@@ -12995,16 +11561,6 @@ author    = {Zhuang Liu and
  volume    = {abs/1704.05135},
  year      = {2017}
 }
-@inproceedings{DBLP:conf/acl/TitovSSV18,
-  author    = {Elena Voita and
-               Pavel Serdyukov and
-               Rico Sennrich and
-               Ivan Titov},
-  title     = {Context-Aware Neural Machine Translation Learns Anaphora Resolution},
-  pages     = {1264--1274},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2018}
-}
 @inproceedings{DBLP:conf/acl/HaffariM18,
  author    = {Sameen Maruf and
               Gholamreza Haffari},
@@ -13129,14 +11685,6 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2018}
 }
-@inproceedings{DBLP:conf/iclr/KitaevKL20,
-  author    = {Nikita Kitaev and
-               Lukasz Kaiser and
-               Anselm Levskaya},
-  title     = {Reformer: The Efficient Transformer},
-  publisher = {International Conference on Learning Representations},
-  year      = {2020}
-}
 @inproceedings{agrawal2018contextual,
  title={Contextual handling in neural machine translation: Look behind, ahead and on both sides},
  author={Agrawal, Ruchit Rajeshkumar and Turchi, Marco and Negri, Matteo},
@@ -13144,17 +11692,6 @@ author    = {Zhuang Liu and
  pages={11--20},
  year={2018}
 }
-@inproceedings{DBLP:conf/emnlp/WerlenRPH18,
-  author    = {Lesly Miculicich Werlen and
-               Dhananjay Ram and
-               Nikolaos Pappas and
-               James Henderson},
-  title     = {Document-Level Neural Machine Translation with Hierarchical Attention
-               Networks},
-  pages     = {2947--2954},
-  publisher = {Conference on Empirical Methods in Natural Language Processing},
-  year      = {2018}
-}
 @inproceedings{DBLP:conf/naacl/MarufMH19,
  author    = {Sameen Maruf and
               Andr{\'{e}} F. T. Martins and
@@ -13230,21 +11767,6 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2017}
 }
-@inproceedings{DBLP:conf/acl/LiLWJXZLL20,
-  author    = {Bei Li and
-               Hui Liu and
-               Ziyang Wang and
-               Yufan Jiang and
-               Tong Xiao and
-               Jingbo Zhu and
-               Tongran Liu and
-               Changliang Li},
-  title     = {Does Multi-Encoder Help? {A} Case Study on Context-Aware Neural Machine
-               Translation},
-  pages     = {3512--3518},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2020}
-}
 @inproceedings{DBLP:conf/discomt/KimTN19,
  author    = {Yunsu Kim and
               Duc Thanh Tran and
@@ -13364,21 +11886,6 @@ author    = {Zhuang Liu and
  volume    = {abs/1911.03110},
  year      = {2019}
 }
-@article{DBLP:journals/tacl/LiuGGLEGLZ20,
-  author    = {Yinhan Liu and
-               Jiatao Gu and
-               Naman Goyal and
-               Xian Li and
-               Sergey Edunov and
-               Marjan Ghazvininejad and
-               Mike Lewis and
-               Luke Zettlemoyer},
-  title     = {Multilingual Denoising Pre-training for Neural Machine Translation},
-  journal   = {Transactions of the Association for Computational Linguistics},
-  volume    = {8},
-  pages     = {726--742},
-  year      = {2020}
-}
 @inproceedings{DBLP:conf/wmt/MarufMH18,
  author    = {Sameen Maruf and
               Andr{\'{e}} F. T. Martins and
@@ -13480,17 +11987,6 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
-@inproceedings{DBLP:conf/acl/LiuTMCZ18,
-  author    = {Yong Cheng and
-               Zhaopeng Tu and
-               Fandong Meng and
-               Junjie Zhai and
-               Yang Liu},
-  title     = {Towards Robust Neural Machine Translation},
-  pages     = {1756--1766},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2018}
-}
 @inproceedings{DBLP:conf/naacl/DuongACBC16,
  author    = {Long Duong and
               Antonios Anastasopoulos and
@@ -14262,20 +12758,6 @@ author    = {Zhuang Liu and
  publisher = {International Conference on Learning Representations},
  year      = {2020}
 }
-@inproceedings{DBLP:conf/nips/GoodfellowPMXWOCB14,
-  author    = {Ian J. Goodfellow and
-               Jean Pouget-Abadie and
-               Mehdi Mirza and
-               Bing Xu and
-               David Warde-Farley and
-               Sherjil Ozair and
-               Aaron C. Courville and
-               Yoshua Bengio},
-  title     = {Generative Adversarial Nets},
-  publisher = {Conference on Neural Information Processing Systems},
-  pages     = {2672--2680},
-  year      = {2014}
-}
 @inproceedings{DBLP:conf/nips/ZhuZPDEWS17,
  author    = {Jun-Yan Zhu and
               Richard Zhang and
@@ -14320,16 +12802,6 @@ author    = {Zhuang Liu and
  publisher = {International Conference on Computer Vision},
  year      = {2017}
 }
-@inproceedings{DBLP:conf/iccv/YiZTG17,
-  author    = {Zili Yi and
-               Hao (Richard) Zhang and
-               Ping Tan and
-               Minglun Gong},
-  title     = {DualGAN: Unsupervised Dual Learning for Image-to-Image Translation},
-  pages     = {2868--2876},
-  publisher = {International Conference on Computer Vision},
-  year      = {2017}
-}
 @inproceedings{DBLP:conf/nips/LiuBK17,
  author    = {Ming-Yu Liu and
               Thomas Breuel and
@@ -14584,24 +13056,6 @@ author    = {Zhuang Liu and
  pages     = {163--185},
  year      = {2017}
 }
-@inproceedings{Peris2017InteractiveNM,
-  title={Interactive neural machine translation},
-  author={{\'A}lvaro Peris and Miguel Domingo and F. Casacuberta},
-  publisher={Computer Speech and Language},
-  year={2017},
-  volume={45},
-  pages={201-220}
-}
-@inproceedings{DBLP:journals/csl/PerisC19,
-  author    = {{\'{A}}lvaro Peris and
-               Francisco Casacuberta},
-  title     = {Online learning for effort reduction in interactive neural machine
-               translation},
-  publisher   = {Computer Speech Language},
-  volume    = {58},
-  pages     = {98--126},
-  year      = {2019}
-}
 @inproceedings{DBLP:journals/coling/BarrachinaBCCCKLNTVV09,
  author    = {Sergio Barrachina and
               Oliver Bender and
@@ -14670,16 +13124,6 @@ author    = {Zhuang Liu and
  volume    = {abs/1702.07811},
  year      = {2017}
 }
-@inproceedings{DBLP:conf/emnlp/WangXZ20,
-  author    = {Qiang Wang and
-               Tong Xiao and
-               Jingbo Zhu},
-  title     = {Training Flexible Depth Model by Multi-Task Learning for Neural Machine
-               Translation},
-  pages     = {4307--4312},
-  publisher = {Conference on Empirical Methods in Natural Language Processing},
-  year      = {2020}
-}
 @inproceedings{DBLP:conf/ijcai/ChenCWL20,
  author    = {Guanhua Chen and
               Yun Chen and
@@ -14762,18 +13206,6 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2017}
 }
-@inproceedings{DBLP:conf/naacl/ThompsonGKDK19,
-  author    = {Brian Thompson and
-               Jeremy Gwinnup and
-               Huda Khayrallah and
-               Kevin Duh and
-               Philipp Koehn},
-  title     = {Overcoming Catastrophic Forgetting During Domain Adaptation of Neural
-               Machine Translation},
-  pages     = {2062--2068},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2019}
-}
 @inproceedings{DBLP:conf/aclnmt/KhayrallahTDK18,
  author    = {Huda Khayrallah and
               Brian Thompson and
@@ -14785,12 +13217,6 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2018}
 }
-@inproceedings{barone2017regularization,
-  title={Regularization techniques for fine-tuning in neural machine translation},
-  author={Barone, Antonio Valerio Miceli and Haddow, Barry and Germann, Ulrich and Sennrich, Rico},
-  publisher={arXiv preprint arXiv:1707.09920},
-  year={2017}
-}
 @inproceedings{DBLP:journals/corr/ChuDK17,
  author    = {Chenhui Chu and
               Raj Dabre and
@@ -14801,15 +13227,6 @@ author    = {Zhuang Liu and
  volume    = {abs/1701.03214},
  year      = {2017}
 }
-@inproceedings{DBLP:conf/coling/GuF20,
-  author    = {Shuhao Gu and
-               Yang Feng},
-  title     = {Investigating Catastrophic Forgetting During Continual Training for
-               Neural Machine Translation},
-  pages     = {4315--4326},
-  publisher = {International Committee on Computational Linguistics},
-  year      = {2020}
-}
 %%%%% chapter 18------------------------------------------------------
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -14885,15 +13302,6 @@ author    = {Zhuang Liu and
  pages={197--216},
  year={2012}
 }
-@inproceedings{DBLP:conf/naacl/DyerCS13,
-  author    = {Chris Dyer and
-               Victor Chahuneau and
-               Noah A. Smith},
-  title     = {A Simple, Fast, and Effective Reparameterization of {IBM} Model 2},
-  pages     = {644--648},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2013}
-}
 @inproceedings{al2016theano,
  author    = {Rami Al-Rfou and
               Guillaume Alain and
@@ -15013,63 +13421,6 @@ author    = {Zhuang Liu and
  volume    = {abs/1605.02688},
  year      = {2016}
 }
-@inproceedings{DBLP:journals/corr/SennrichFCBHHJL17,
-  author    = {Rico Sennrich and
-               Orhan Firat and
-               Kyunghyun Cho and
-               Barry Haddow and
-			   Alexandra Birch and
-               Julian Hitschler and
-               Marcin Junczys-Dowmunt and
-               Samuel L{\"{a}}ubli and
-               Antonio Valerio Miceli Barone and
-               Jozef Mokry and
-               Maria Nadejde},
-  title     = {Nematus: a Toolkit for Neural Machine Translation},
-  publisher = {Annual Conference of the European Association for Machine Translation},
-  pages     = {65--68},
-  year      = {2017}
-}
-@inproceedings{Koehn2007Moses,
-  author    = {Philipp Koehn and
-               Hieu Hoang and
-			    Alexandra Birch and
-               Chris Callison-Burch and
-               Marcello Federico and
-               Nicola Bertoldi and
-               Brooke Cowan and
-               Wade Shen and
-               Christine Moran and
-               Richard Zens and
-               Chris Dyer and
-               Ondrej Bojar and
-               Alexandra Constantin and
-               Evan Herbst},
-  title     = {Moses: Open Source Toolkit for Statistical Machine Translation},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2007}
-}
-@inproceedings{zollmann2007the,
-  author    = {Andreas Zollmann and
-               Ashish Venugopal and
-               Matthias Paulik and
-               Stephan Vogel},
-  title     = {The Syntax Augmented {MT} {(SAMT)} System at the Shared Task for the
-               2007 {ACL} Workshop on Statistical Machine Translation},
-  pages     = {216--219},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2007}
-}
-@inproceedings{och2003systematic,
-  author    = {Franz Josef Och and
-               Hermann Ney},
-  title     = {A Systematic Comparison of Various Statistical Alignment Models},
-  publisher   = {Computational Linguistics},
-  volume    = {29},
-  number    = {1},
-  pages     = {19--51},
-  year      = {2003}
-}
 @inproceedings{zoph2016simple,
  author    = {Barret Zoph and
               Ashish Vaswani and
@@ -15080,50 +13431,6 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2016}
 }
-@inproceedings{Ottfairseq,
-  author    = {Myle Ott and
-               Sergey Edunov and
-               Alexei Baevski and
-               Angela Fan and
-               Sam Gross and
-               Nathan Ng and
-               David Grangier and
-               Michael Auli},
-  title     = {fairseq: {A} Fast, Extensible Toolkit for Sequence Modeling},
-  pages     = {48--53},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2019}
-}
-@inproceedings{Vaswani2018Tensor2TensorFN,
-   author    = {Ashish Vaswani and
-               Samy Bengio and
-               Eugene Brevdo and
-               Fran{\c{c}}ois Chollet and
-               Aidan N. Gomez and
-               Stephan Gouws and
-               Llion Jones and
-               Lukasz Kaiser and
-               Nal Kalchbrenner and
-               Niki Parmar and
-               Ryan Sepassi and
-               Noam Shazeer and
-               Jakob Uszkoreit},
-  title     = {Tensor2Tensor for Neural Machine Translation},
-  pages     = {193--199},
-  publisher = {Association for Machine Translation in the Americas},
-  year      = {2018}
-}
-@inproceedings{KleinOpenNMT,
-  author    = {Guillaume Klein and
-               Yoon Kim and
-               Yuntian Deng and
-               Jean Senellart and
-               Alexander M. Rush},
-  title     = {OpenNMT: Open-Source Toolkit for Neural Machine Translation},
-  pages     = {67--72},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2017}
-}
 @inproceedings{luong2016acl_hybrid,
  author    = {Minh-Thang Luong and
               Christopher D. Manning},