chapter7 bib

4bd516e9 · zengxin · 19f0d8e5 · 4bd516e9 · 4bd516e9
Commit 4bd516e9 authored May 06, 2020 by zengxin
--- a/Book/Chapter7/Chapter7.tex
+++ b/Book/Chapter7/Chapter7.tex
@@ -134,7 +134,7 @@
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{分词}
-\parinterval 分词是数据处理的第一步。这部分技术在第二章已经进行了讨论。对于像中文这样没有单词边界的语言，分词的策略通常比较复杂。现在常用的一些中文分词工具有NLTK\cite{DBLP:conf/acl/BirdL04}、jieba\footnote{\url{https://github.com/fxsjy/jieba}}等。而像英文这种有单词边界的语言，分词要简单许多，比如，Moses工具包就有可以处理绝大多数拉丁语系语言的分词脚本\cite{Koehn2007Moses}。图\ref{fig:7-4}展示了一个经过分词后的中英文双语对照数据。
+\parinterval 分词是数据处理的第一步。这部分技术在第二章已经进行了讨论。对于像中文这样没有单词边界的语言，分词的策略通常比较复杂。现在常用的一些中文分词工具有NLTK\cite{DBLP:conf/acl/Bird06}、jieba\footnote{\url{https://github.com/fxsjy/jieba}}等。而像英文这种有单词边界的语言，分词要简单许多，比如，Moses工具包就有可以处理绝大多数拉丁语系语言的分词脚本\cite{Koehn2007Moses}。图\ref{fig:7-4}展示了一个经过分词后的中英文双语对照数据。
 %----------------------------------------------
 % 图7.4
@@ -288,9 +288,9 @@
 %%%%%%%%%%%%%%%%%%
 \subsubsection{子词}
-\parinterval 一种解决开放词表翻译问题的方法是改造输出层结构\cite{garciamartinez:hal-01433161}\cite{DBLP:journals/corr/JeanCMB14}，比如，替换原始的Softmax层，用更加高效的神经网络结构进行超大规模词表上的预测。不过这类方法往往需要对系统进行修改，由于模型结构和训练方法的调整使得系统开发与调试的工作量增加。而且这类方法仍然无法解决OOV问题。因此在实用系统中并不常用。
+\parinterval 一种解决开放词表翻译问题的方法是改造输出层结构\cite{garciamartinez:hal-01433161}\cite{DBLP:conf/acl/JeanCMB15}，比如，替换原始的Softmax层，用更加高效的神经网络结构进行超大规模词表上的预测。不过这类方法往往需要对系统进行修改，由于模型结构和训练方法的调整使得系统开发与调试的工作量增加。而且这类方法仍然无法解决OOV问题。因此在实用系统中并不常用。
-\parinterval 另一种思路是不改变机器翻译系统，而是从数据处理的角度来缓解OOV问题。既然使用单词会带来数据稀疏问题，那么自然会想到使用更小的单元。比如，把字符作为最小的翻译单元 \footnote{中文中字符可以被看作是汉字。} \ \dash \ 也就是基于字符的翻译模型\cite{DBLP:journals/corr/LeeCH16}。以英文为例，只需要构造一个包含26个英文字母、数字和一些特殊符号的字符表，便可以表示所有的单词。
+\parinterval 另一种思路是不改变机器翻译系统，而是从数据处理的角度来缓解OOV问题。既然使用单词会带来数据稀疏问题，那么自然会想到使用更小的单元。比如，把字符作为最小的翻译单元 \footnote{中文中字符可以被看作是汉字。} \ \dash \ 也就是基于字符的翻译模型\cite{DBLP:journals/tacl/LeeCH17}。以英文为例，只需要构造一个包含26个英文字母、数字和一些特殊符号的字符表，便可以表示所有的单词。
 \parinterval 但是字符级翻译也面临着新的问题\ \dash\ 使用字符增加了系统捕捉不同语言单元之间搭配的难度。假设平均一个单词由5个字符组成，所处理的序列长度便增大5倍。这使得具有独立意义的不同语言单元需要跨越更远的距离才能产生联系。此外，基于字符的方法也破坏了单词中天然存在的构词规律，或者说破坏了单词内字符的局部依赖。比如，英文单词``telephone''中的``tele''和``phone''都是有具体意义的词缀，但是如果把它们打散为字符就失去了这些含义。
@@ -323,7 +323,7 @@
 %%%%%%%%%%%%%%%%%%
 \subsubsection{双字节编码（BPE）}
-\parinterval {\small\bfnew{字节对编码}}\index{字节对编码}或{\small\bfnew{双字节编码}}\index{双字节编码}（Byte Pair Encoding，BPE）\index{Byte Pair Encoding，BPE}是一种常用的子词词表构建方法\cite{DBLP:journals/corr/SennrichHB15}。BPE方法最早用于数据压缩，对于数据中常见的连续字符串替换为一个不存在的字符，之后通过构建一个替换关系的对应表，对压缩后的数据进行还原。机器翻译借用了这种思想，把子词切分看作是学习对自然语言句子进行压缩编码表示的问题\cite{philipAlgorithmfordataCompression}。其目的是，保证编码后的结果（即子词切分）占用的字节尽可能少。这样，子词单元会尽可能被不同单词复用，同时又不会因为使用过小的单元造成子词切分序列过长。使用BPE算法构建子词词表可以分为如下几个步骤：
+\parinterval {\small\bfnew{字节对编码}}\index{字节对编码}或{\small\bfnew{双字节编码}}\index{双字节编码}（Byte Pair Encoding，BPE）\index{Byte Pair Encoding，BPE}是一种常用的子词词表构建方法\cite{DBLP:conf/acl/SennrichHB16a}。BPE方法最早用于数据压缩，对于数据中常见的连续字符串替换为一个不存在的字符，之后通过构建一个替换关系的对应表，对压缩后的数据进行还原。机器翻译借用了这种思想，把子词切分看作是学习对自然语言句子进行压缩编码表示的问题\cite{philipAlgorithmfordataCompression}。其目的是，保证编码后的结果（即子词切分）占用的字节尽可能少。这样，子词单元会尽可能被不同单词复用，同时又不会因为使用过小的单元造成子词切分序列过长。使用BPE算法构建子词词表可以分为如下几个步骤：
 \begin{itemize}
 \vspace{0.3em}
@@ -371,16 +371,16 @@
 \parinterval 由于模型的输出也是子词序列，因此需要对最终得到的翻译结果进行子词还原，即将子词形式表达的单元重新组合为原本的单词。这一步操作也十分简单，只需要不断的将每个子词向后合并，直至遇到表示单词边界的结束符<e>，便得到了一个完整的单词。
-\parinterval 使用BPE方法的策略有很多。不仅可以单独对源语言和目标语言进行子词的切分，也可以联合源语言和目标语言，共同进行子词切分，被称作Joint-BPE\cite{DBLP:journals/corr/SennrichHB15}。单语BPE比较简单直接，而Joint-BPE则可以增加两种语言子词切分的一致性。对于相似语系中的语言，如英语和德语，常使用Joint-BPE的方法联合构建词表。而对于中英这些差异比较大的语种，则需要独立的进行子词切分。
+\parinterval 使用BPE方法的策略有很多。不仅可以单独对源语言和目标语言进行子词的切分，也可以联合源语言和目标语言，共同进行子词切分，被称作Joint-BPE\cite{DBLP:conf/acl/SennrichHB16a}。单语BPE比较简单直接，而Joint-BPE则可以增加两种语言子词切分的一致性。对于相似语系中的语言，如英语和德语，常使用Joint-BPE的方法联合构建词表。而对于中英这些差异比较大的语种，则需要独立的进行子词切分。
-\parinterval BPE还有很多变种方法。在进行子词切分时，BPE从最长的子词开始进行切分。这个启发性规则可以保证切分结果的唯一性，实际上，在对一个单词用同一个子词词表切分时，可能存在多种切分方式，如hello，我们可以分割为``hell''和``o''，也可以分割为``h''和``ello''。这种切分的多样性可以来提高神经机器翻译系统的健壮性\cite{DBLP:journals/corr/abs-1804-10959}。而在T5等预训练模型中\cite{DBLP:journals/corr/abs-1910-10683}则使用了基于字符级别的BPE。此外，尽管BPE被命名为字节对编码，实际上一般处理的是Unicode编码，而不是字节。在预训练模型GPT2中，也探索了字节级别的BPE，在机器翻译、问答等任务中取得了很好的效果\cite{radford2019language}。
+\parinterval BPE还有很多变种方法。在进行子词切分时，BPE从最长的子词开始进行切分。这个启发性规则可以保证切分结果的唯一性，实际上，在对一个单词用同一个子词词表切分时，可能存在多种切分方式，如hello，我们可以分割为``hell''和``o''，也可以分割为``h''和``ello''。这种切分的多样性可以来提高神经机器翻译系统的健壮性\cite{DBLP:conf/acl/Kudo18}。而在T5等预训练模型中\cite{DBLP:journals/corr/abs-1910-10683}则使用了基于字符级别的BPE。此外，尽管BPE被命名为字节对编码，实际上一般处理的是Unicode编码，而不是字节。在预训练模型GPT2中，也探索了字节级别的BPE，在机器翻译、问答等任务中取得了很好的效果\cite{radford2019language}。
 %%%%%%%%%%%%%%%%%%
 \subsubsection{其他方法}
-\parinterval 与基于统计的BPE算法不同，基于Word Piece和1-gram Language Model（ULM）的方法则是利用语言模型进行子词词表的构造\cite{DBLP:journals/corr/abs-1804-10959}。本质上，基于语言模型的方法和基于BPE的方法的思路是一样的，即通过合并字符和子词不断生成新的子词。它们的区别仅在于合并子词的方式不同。基于BPE的方法选择出现频次最高的连续字符2-gram合并为新的子词，而基于语言模型的方法中则是根据语言模型概率选择要合并哪些子词。
+\parinterval 与基于统计的BPE算法不同，基于Word Piece和1-gram Language Model（ULM）的方法则是利用语言模型进行子词词表的构造\cite{DBLP:conf/acl/Kudo18}。本质上，基于语言模型的方法和基于BPE的方法的思路是一样的，即通过合并字符和子词不断生成新的子词。它们的区别仅在于合并子词的方式不同。基于BPE的方法选择出现频次最高的连续字符2-gram合并为新的子词，而基于语言模型的方法中则是根据语言模型概率选择要合并哪些子词。
-\parinterval 具体来说，基于Word Piece的方法首先将句子切割为字符表示的形式\cite{6289079}，并利用该数据训练一个1-gram语言模型，记为$\textrm{logP}(\cdot)$。假设两个相邻的子词单元$a$和$b$被合并为新的子词$c$，则整个句子的语言模型得分的变化为$\triangle=\textrm{logP}(c)-\textrm{logP}(a)-\textrm{logP}(b)$。这样，可以不断的选择使$\triangle$最大的两个子词单元进行合并，直到达到预设的词表大小或者句子概率的增量低于某个阈值。而ULM方法以最大化整个句子的概率为目标构建词表\cite{DBLP:journals/corr/abs-1804-10959}，具体实现上也不同于基于Word Piece的方法，这里不做详细介绍。
+\parinterval 具体来说，基于Word Piece的方法首先将句子切割为字符表示的形式\cite{6289079}，并利用该数据训练一个1-gram语言模型，记为$\textrm{logP}(\cdot)$。假设两个相邻的子词单元$a$和$b$被合并为新的子词$c$，则整个句子的语言模型得分的变化为$\triangle=\textrm{logP}(c)-\textrm{logP}(a)-\textrm{logP}(b)$。这样，可以不断的选择使$\triangle$最大的两个子词单元进行合并，直到达到预设的词表大小或者句子概率的增量低于某个阈值。而ULM方法以最大化整个句子的概率为目标构建词表\cite{DBLP:conf/acl/Kudo18}，具体实现上也不同于基于Word Piece的方法，这里不做详细介绍。
 \parinterval 使用子词表示句子的方法可以有效的平衡词汇量，增大对未见单词的覆盖度。像英译德、汉译英任务，使用16k或者32k的子词词表大小便能取得很好的效果。
@@ -541,7 +541,7 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \parinterval 神经机器翻译是一种典型的多层神经网络。一方面，可以通过设计合适的网络连接方式和激活函数来捕捉复杂的翻译现象；另一方面，越来越多的可用数据让模型能够得到更有效的训练。在训练数据较为充分的情况下，设计更加``复杂''的模型成为了提升系统性能的有效手段。比如，Transformer模型有两个常用配置Transformer-Base和Transformer-Big。其中，Transformer-Big比Transformer-Base使用了更多的神经元，相应的翻译品质更优\cite{vaswani2017attention}。
-\parinterval 那么是否还有类似的方法可以改善系统性能呢？答案显然是肯定的。这里，把这类方法统称为基于大容量模型的方法。在传统机器学习的观点中，神经网络的性能不仅依赖于架构设计，同样与容量密切相关。那么什么是模型的{\small\bfnew{容量}}\index{容量}（Capacity）\index{Capacity}？简单理解，容量是指神经网络的参数量，即神经元之间连接权重的个数。另一种定义是把容量看作神经网络所能表示的假设空间大小\cite{deeplearning}，也就是神经网络能表示的不同函数所构成的空间。
+\parinterval 那么是否还有类似的方法可以改善系统性能呢？答案显然是肯定的。这里，把这类方法统称为基于大容量模型的方法。在传统机器学习的观点中，神经网络的性能不仅依赖于架构设计，同样与容量密切相关。那么什么是模型的{\small\bfnew{容量}}\index{容量}（Capacity）\index{Capacity}？简单理解，容量是指神经网络的参数量，即神经元之间连接权重的个数。另一种定义是把容量看作神经网络所能表示的假设空间大小\cite{DBLP:journals/nature/LeCunBH15}，也就是神经网络能表示的不同函数所构成的空间。
 \parinterval 而学习一个神经网络就是要找到一个``最优''的函数，它可以准确地拟合数据。当假设空间变大时，训练系统有机会找到更好的函数，但是同时也需要依赖更多的训练样本才能完成最优函数的搜索。相反，当假设空间变小时，训练系统会更容易完成函数搜索，但是很多优质的函数可能都没有包含在假设空间里。这也体现了一种简单的辩证思想：如果训练（搜索）的代价高，会有更大的机会找到更好的解；另一方面，如果想少花力气进行训练（搜索），那就设计一个小一些的假设空间，在小一些规模的样本集上进行训练，当然搜索到的解可能不是最好的。
@@ -562,7 +562,7 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \parinterval 宽网络通常指隐藏层维度更大的网络，目前在图像处理领域和自然语言处理领域被广泛地使用。第五章已经验证了包含足够多神经元的多层前馈神经网络可以无限逼近任意复杂的连续函数\cite{Hornic1989Multilayer}，这也在一定程度上说明了神经网络建模中神经元数目的重要性。
-\parinterval 增大隐藏层神经元的数目是网络变宽的基本方式之一。例如，图像处理领域中提出的{\small\bfnew{宽残差网络}}\index{宽残差网络}（Wide Residual Network）\index{Wide Residual Network}使用更大的卷积核来提高每次卷积计算的精度\cite{DBLP:journals/corr/ZagoruykoK16}；神经机器翻译中，Transformer-Big模型广受研究人员的认可\cite{NIPS2017_7181}，它同样是一个典型的宽网络。对比基线模型Transformer-Base，Transformer-Big通过扩大隐藏层维度与滤波器（Filter）维度，取得了显著的翻译性能提升。表\ref{tab:Parameter-setting}是相应的参数设置。
+\parinterval 增大隐藏层神经元的数目是网络变宽的基本方式之一。例如，图像处理领域中提出的{\small\bfnew{宽残差网络}}\index{宽残差网络}（Wide Residual Network）\index{Wide Residual Network}使用更大的卷积核来提高每次卷积计算的精度\cite{DBLP:conf/bmvc/ZagoruykoK16}；神经机器翻译中，Transformer-Big模型广受研究人员的认可\cite{NIPS2017_7181}，它同样是一个典型的宽网络。对比基线模型Transformer-Base，Transformer-Big通过扩大隐藏层维度与滤波器（Filter）维度，取得了显著的翻译性能提升。表\ref{tab:Parameter-setting}是相应的参数设置。
 %----------------------------------------------
 % 表
@@ -626,12 +626,12 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \parinterval 如前所述，神经机器翻译的原始输入是单词序列，包括源语言端和目标语言端。模型中的输入层将这种离散的单词表示表示转换成实数向量的表示，也就是常说的{\small\bfnew{词嵌入}}\index{词嵌入}（Embedding）\index{Embedding}。从实现的角度来看，输入层其实就是从一个词嵌入矩阵中提取对应的词向量表示，这个矩阵两个维度大小分别对应着词表大小和词嵌入的维度。词嵌入的维度也代表着模型对单词刻画的能力。因此适当增加词嵌入的维度也是一种增加模型容量的手段。通常，词嵌入和隐藏层的维度是一致的，这种设计也是为了便于系统实现。
-\parinterval 当然，并不是说词嵌入的维度一定越大就越好。本质上，词嵌入是要在一个多维空间上有效的区分含有不同语义的单词。如果词表较大，更大的词嵌入维度会更有意义，因为需要更多的``特征''描述更多的语义。当词表较小时，增大词嵌入维度可能不会带来增益，相反会增加系统计算的负担。另一种策略是，动态选择词嵌入维度，比如，对于高频词使用较大的词嵌入维度，而对于低频词则使用较小的词嵌入维度\cite{DBLP:journals/corr/abs-1809-10853}。这种方法可以用同样的参数量处理更大的词表。
+\parinterval 当然，并不是说词嵌入的维度一定越大就越好。本质上，词嵌入是要在一个多维空间上有效的区分含有不同语义的单词。如果词表较大，更大的词嵌入维度会更有意义，因为需要更多的``特征''描述更多的语义。当词表较小时，增大词嵌入维度可能不会带来增益，相反会增加系统计算的负担。另一种策略是，动态选择词嵌入维度，比如，对于高频词使用较大的词嵌入维度，而对于低频词则使用较小的词嵌入维度\cite{DBLP:conf/iclr/BaevskiA19}。这种方法可以用同样的参数量处理更大的词表。
 %%%%%%%%%%%%%%%%%%
 \subsubsection{大模型的分布式计算}
-\parinterval 伴随着模型容量的增大，复杂模型可能无法在单GPU上完成训练。比如，即使是不太复杂的Transformer-Base模型在很多任务上也需要在8张GPU进行训练。如何利用多个设备进行大模型的并行训练是一个很现实的问题。比较简单的策略是使用{\small\bfnew{数据并行}}\index{数据并行}（Data Parallelism）\index{Data Parallelism}，即把一个批次分到多个GPU上进行训练，之后对多个GPU上的梯度进行汇总，并更新参数。不过，当模型规模增大到一定程度时，单GPU可能仍然无法处理。这个问题在GPU显存较小的时候会非常突出。这时需要考虑{\small\bfnew{模型并行}}\index{模型并行}（Model Parallelism）\index{Model Parallelism}。模型并行是指将模型分割成不同的部分，在不同的GPU上运行其中的一部分。例如，在训练深层LSTM模型时可以将不同层放置在不同GPU上，这种方式一定程度上能够加速模型的训练。对于更大的模型，如参数量为10亿的BERT-Large模型\cite{devlin2018bert}，同样可以使用这种策略。不过，模型并行中不同设备传输的延时会大大降低模型运行的效率，因此很多时候要考虑训练效率和模型性能之间的平衡。
+\parinterval 伴随着模型容量的增大，复杂模型可能无法在单GPU上完成训练。比如，即使是不太复杂的Transformer-Base模型在很多任务上也需要在8张GPU进行训练。如何利用多个设备进行大模型的并行训练是一个很现实的问题。比较简单的策略是使用{\small\bfnew{数据并行}}\index{数据并行}（Data Parallelism）\index{Data Parallelism}，即把一个批次分到多个GPU上进行训练，之后对多个GPU上的梯度进行汇总，并更新参数。不过，当模型规模增大到一定程度时，单GPU可能仍然无法处理。这个问题在GPU显存较小的时候会非常突出。这时需要考虑{\small\bfnew{模型并行}}\index{模型并行}（Model Parallelism）\index{Model Parallelism}。模型并行是指将模型分割成不同的部分，在不同的GPU上运行其中的一部分。例如，在训练深层LSTM模型时可以将不同层放置在不同GPU上，这种方式一定程度上能够加速模型的训练。对于更大的模型，如参数量为10亿的BERT-Large模型\cite{DBLP:conf/naacl/DevlinCLT19}，同样可以使用这种策略。不过，模型并行中不同设备传输的延时会大大降低模型运行的效率，因此很多时候要考虑训练效率和模型性能之间的平衡。
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{大批量训练}
@@ -657,7 +657,7 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \end{figure}
 %----------------------------------------------
-\parinterval 此外，前人工作表明，使用大批量训练复杂网络结构时要配合略大一些的学习率，加快模型在梯度方向上的更新速度，进而达到更优的翻译性能\cite{DBLP:journals/corr/abs-1806-00187}。例如，深层网络也需要对学习率进行适当的调整才能发挥较好的性能。表\ref{tab:BLEU-under-different-batches-and-peak-learning-rate}展示了30层网络在不同批次大小和学习率峰值的条件下的BLEU值（WMT14 En-De）\footnote{学习率峰值是指Transformer模型训练的预热阶段，学习率所到达的最高值。}。可以发现，在固定学习率峰值的条件下增大批次大小并不能带来性能上的增益，必须同时调整条学习率的峰值。也有研究团队验证了，Transformer-Big模型在128张GPU上进行分布式训练时，适当的增大学习率会带来明显性的BLEU提升\cite{DBLP:journals/corr/abs-1806-00187}。
+\parinterval 此外，前人工作表明，使用大批量训练复杂网络结构时要配合略大一些的学习率，加快模型在梯度方向上的更新速度，进而达到更优的翻译性能\cite{DBLP:conf/wmt/OttEGA18}。例如，深层网络也需要对学习率进行适当的调整才能发挥较好的性能。表\ref{tab:BLEU-under-different-batches-and-peak-learning-rate}展示了30层网络在不同批次大小和学习率峰值的条件下的BLEU值（WMT14 En-De）\footnote{学习率峰值是指Transformer模型训练的预热阶段，学习率所到达的最高值。}。可以发现，在固定学习率峰值的条件下增大批次大小并不能带来性能上的增益，必须同时调整条学习率的峰值。也有研究团队验证了，Transformer-Big模型在128张GPU上进行分布式训练时，适当的增大学习率会带来明显性的BLEU提升\cite{DBLP:conf/wmt/OttEGA18}。
 %----------------------------------------------
 % 表
@@ -699,7 +699,7 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \item 按词数构建批次：对比按照句长生成批次，按词数生成批次可以防止某些批次中句子整体长度特别长或者特别短的情况，保证不同批次之间整体的词数处于大致相同的范围，这样所得到的梯度也是可比较的。通常的做法是根据源语言词数、目标语言词数，或者源语言词数与目标语言词数的最大值等指标生成批次。
 \vspace{0.3em}
-\item 按课程学习的方式：考虑样本的``难度''也是生成批次的一种策略。比如，可以使用{\small\bfnew{课程学习}}\index{课程学习}（Curriculum Learning）\index{Curriculum Learning} 的思想\cite{bengioCurriculumlearning}，让系统先学习``简单''的样本，之后逐渐增加样本的难度，达到循序渐进的学习。具体来说，可以利用句子长度、词频等指标计算每个批次的``难度''，记为$d$。 之后，选择满足$d \leq c$的样本构建一个批次。这里，$c$表示难度的阈值，它可以随着训练的执行不断增大。
+\item 按课程学习的方式：考虑样本的``难度''也是生成批次的一种策略。比如，可以使用{\small\bfnew{课程学习}}\index{课程学习}（Curriculum Learning）\index{Curriculum Learning} 的思想\cite{DBLP:conf/icml/BengioLCW09}，让系统先学习``简单''的样本，之后逐渐增加样本的难度，达到循序渐进的学习。具体来说，可以利用句子长度、词频等指标计算每个批次的``难度''，记为$d$。 之后，选择满足$d \leq c$的样本构建一个批次。这里，$c$表示难度的阈值，它可以随着训练的执行不断增大。
 \vspace{0.3em}
 \end{itemize}
@@ -725,7 +725,7 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \parinterval 准确性通常是研究人员最关心的问题。作为一个搜索过程，需要依据某种指标（如模型得分）找到样本空间中的一个或者若干个样本。不过，机器翻译是一个NP难问题，对整个样本空间进行全搜索显然是十分困难的\cite{Knight1999Decoding}。因此，需要优化搜索算法，并配合剪枝等策略，最终得到一个尽可能逼近全局最优解的搜索结果。
-如果搜索算法没有找到全局最优解，这时称系统出现了{\small\bfnew{搜索错误}}\index{搜索错误}（Search Error）\index{Search Error}。如果模型打分不准确造成没有把最好的翻译排序到第一，这时称系统出现了{\small\bfnew{模型错误}}\index{模型错误}（Modeling Error）\index{Modeling Error}。模型错误是由建模和模型训练等因素决定的，而搜索错误一般是由搜索算法决定的。在早期的机器翻译研究中，搜索错误是机器翻译问题的主要来源之一。不过随着技术的进步，研究者逐渐发现，机器翻译系统的错误更多的集中在模型错误上\cite{Stahlberg2019OnNS}。特别是在神经机器翻译时代，绝大多数研究工作都是在解决模型错误。
+如果搜索算法没有找到全局最优解，这时称系统出现了{\small\bfnew{搜索错误}}\index{搜索错误}（Search Error）\index{Search Error}。如果模型打分不准确造成没有把最好的翻译排序到第一，这时称系统出现了{\small\bfnew{模型错误}}\index{模型错误}（Modeling Error）\index{Modeling Error}。模型错误是由建模和模型训练等因素决定的，而搜索错误一般是由搜索算法决定的。在早期的机器翻译研究中，搜索错误是机器翻译问题的主要来源之一。不过随着技术的进步，研究者逐渐发现，机器翻译系统的错误更多的集中在模型错误上\cite{DBLP:conf/emnlp/StahlbergB19}。特别是在神经机器翻译时代，绝大多数研究工作都是在解决模型错误。
 \parinterval 当然，这里并不是说搜索不重要。相反，在很多应用场景中，搜索算法的效率会起到决定性作用，比如，机器同传\cite{DBLP:journals/corr/abs-1810-08398}。而且，往往需要在翻译精度和速度之间找到一种折中。下面将对神经机器翻译推断的精度和效率进行讨论。
@@ -746,7 +746,7 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \vspace{0.3em}
 \end{itemize}
-\parinterval 上面这个算法与统计机器翻译中自左向右翻译的算法基本上是一样的（见第四章），即，在每一个目标语位置，根据已经生成的译文和源语言的信息，生成下一个译文单词。这个过程可以由两个模块实现\cite{DBLP:journals/corr/StahlbergHSB17}：
+\parinterval 上面这个算法与统计机器翻译中自左向右翻译的算法基本上是一样的（见第四章），即，在每一个目标语位置，根据已经生成的译文和源语言的信息，生成下一个译文单词。这个过程可以由两个模块实现\cite{DBLP:conf/emnlp/StahlbergHSB17}：
 \begin{itemize}
 \vspace{0.3em}
@@ -780,12 +780,12 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \parinterval 不过，简单的使用自右向左的翻译方式并不会取得更好的效果。大多数情况下，自右向左比自左向右推断的翻译质量更差。因此很多时候是对二者进行融合，而不是简单的相互替换。有两种思路：
 \begin{itemize}
-\item {\small\bfnew{重排序}}\index{重排序}（Re-ranking）\index{Re-ranking}。可以用一个基础模型（比如自左向右的模型）得到每个源语言句子的$n$-best结果，之后同时用基础模型的得分和自右向左模型对$n$-best结果进行重排序\cite{DBLP:journals/corr/SennrichHB16,DBLP:conf/wmt/LiLXLLLWZXWFCLL19}。由于这种方法不会改变基础模型的翻译过程，因此相对``安全''，不会对系统性能造成副作用。特别是对于基于循环神经网络的翻译系统，利用自右向左的翻译模型进行重排序往往会取得较好的效果。
+\item {\small\bfnew{重排序}}\index{重排序}（Re-ranking）\index{Re-ranking}。可以用一个基础模型（比如自左向右的模型）得到每个源语言句子的$n$-best结果，之后同时用基础模型的得分和自右向左模型对$n$-best结果进行重排序\cite{DBLP:conf/wmt/SennrichHB16,DBLP:conf/wmt/LiLXLLLWZXWFCLL19}。由于这种方法不会改变基础模型的翻译过程，因此相对``安全''，不会对系统性能造成副作用。特别是对于基于循环神经网络的翻译系统，利用自右向左的翻译模型进行重排序往往会取得较好的效果。
 \item {\small\bfnew{双向推断}}\index{双向推断}（Bidirectional Inference）\index{Bidirectional Inference}。另一种方法是，让自左向右和自右向左模型同步进行，也就是同时考虑译文左侧和右侧的上下文\cite{DBLP:journals/corr/abs-1801-05122}。 例如，可以同时对左边和右边生成的译文进行注意力计算，得到当前位置的单词预测结果。这种方法能够更加充分的融合双向翻译的优势，不过需要对训练和推断系统进行修改，因此也引入了额外的开发和调试工作。
 \end{itemize}
-\parinterval 不论是自左向右还是自右向左翻译，本质上都是在对上下文信息进行建模。最近，以BERT为代表的预训练语言模型已经证明，一个单词的``历史''和``未来''信息对于生成当前单词都是有帮助的\cite{devlin2018bert}。类似的观点也在神经机器翻译编码器设计中得到验证。比如，在基于循环神经网络的模型中，经常同时使用自左向右和自右向左的方式对源语言句子进行编码，还有，Transformer编码会使用整个句子的信息对每一个源语言位置进行表示。因此，在神经机器翻译的解码端采用类似的策略是有其合理性的。
+\parinterval 不论是自左向右还是自右向左翻译，本质上都是在对上下文信息进行建模。最近，以BERT为代表的预训练语言模型已经证明，一个单词的``历史''和``未来''信息对于生成当前单词都是有帮助的\cite{DBLP:conf/naacl/DevlinCLT19}。类似的观点也在神经机器翻译编码器设计中得到验证。比如，在基于循环神经网络的模型中，经常同时使用自左向右和自右向左的方式对源语言句子进行编码，还有，Transformer编码会使用整个句子的信息对每一个源语言位置进行表示。因此，在神经机器翻译的解码端采用类似的策略是有其合理性的。
 %%%%%%%%%%%%%%%%%%
 \subsubsection{推断加速}
@@ -899,7 +899,7 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \item 半精度运算。半精度运算是随着近几年GPU技术发展而逐渐流行的一种运算方式。简单来说，半精度的表示要比单精度需要更少的存储单元，所表示的浮点数范围也相应的变小。不过，实践中已经证明神经机器翻译中的许多运算用半精度计算就可以满足对精度的要求。因此，直接使用半精度运算可以大大加速系统的训练和推断进程，同时对翻译品质的影响很小。不过，需要注意的是，在分布式训练的时候，由于参数服务器需要对多个计算节点上的梯度进行累加，因此保存参数的部分仍然会使用单精度浮点以保证多次累加之后不会造成精度的损失。
 \vspace{0.3em}
-\item 整型运算。整数运算是一种比浮点运算``轻''很多的运算。无论是芯片占用面积、能耗还是处理单次运算的时钟周期数，整数运算相比浮点运算都有着明显的优势。因此，使用整数运算也是很有潜力的加速手段。不过，整数的表示和浮点数有着很大的不同。一个基本的问题是，整数是不连续的，因此无法准确的刻画浮点数中很小的小数。对于这个问题，一种解决方法是利用``量化+反量化+缩放''的策略让整数运算近似浮点运算的效果 \cite{DBLP:journals/corr/abs-1906-00532}\cite{DBLP:journals/corr/abs-1712-05877}\cite{DBLP:journals/corr/abs-1910-10485}）。所谓``量化''就是把一个浮点数离散化为一个整数，``反量化''是这个过程的逆过程。由于浮点数可能超出整数的范围，因此会引入一个缩放因子。在量化前将浮点数缩放到整数可以表示的范围，反量化前再缩放回原始浮点数的表示范围。这种方法在理论上可以带来很好的加速效果。不过由于量化和反量化的操作本身也有时间消耗，而且在不同处理器上的表现差异较大。因此不同的实现方式带来的加速效果并不相同，需要通过实验测算。
+\item 整型运算。整数运算是一种比浮点运算``轻''很多的运算。无论是芯片占用面积、能耗还是处理单次运算的时钟周期数，整数运算相比浮点运算都有着明显的优势。因此，使用整数运算也是很有潜力的加速手段。不过，整数的表示和浮点数有着很大的不同。一个基本的问题是，整数是不连续的，因此无法准确的刻画浮点数中很小的小数。对于这个问题，一种解决方法是利用``量化+反量化+缩放''的策略让整数运算近似浮点运算的效果 \cite{DBLP:journals/corr/abs-1906-00532}\cite{DBLP:conf/cvpr/JacobKCZTHAK18}\cite{DBLP:journals/corr/abs-1910-10485}）。所谓``量化''就是把一个浮点数离散化为一个整数，``反量化''是这个过程的逆过程。由于浮点数可能超出整数的范围，因此会引入一个缩放因子。在量化前将浮点数缩放到整数可以表示的范围，反量化前再缩放回原始浮点数的表示范围。这种方法在理论上可以带来很好的加速效果。不过由于量化和反量化的操作本身也有时间消耗，而且在不同处理器上的表现差异较大。因此不同的实现方式带来的加速效果并不相同，需要通过实验测算。
 \vspace{0.3em}
 \item 低精度整型运算。使用更低精度的整型运算是进一步加速的手段之一。比如使用16位整数、8位整数，甚至4位整数在理论上都会带来速度的提升（表\ref{tab:Comparison-of-occupied-area-and-computing-speed}）。不过，并不是所有处理器都支持低精度整数的运算。开发这样的系统，一般需要硬件和特殊低精度整数计算库的支持。而且相关计算大多是在CPU上实现，应用会受到一定的限制。
@@ -921,7 +921,7 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \vspace{0.3em}
 \end{itemize}
-\parinterval 实际上，低精度表示的另一个好处是可以减少模型存储的体积。比如，如果要把机器翻译模型作为软件的一部分打包存储，这时可以考虑用低精度的方式保存模型参数，使用时再恢复成原始精度的参数。值得注意的是，参数的离散化表示（比如整型表示）的一个极端例子是{\small\bfnew{二值网络}}\index{二值网络}（Binarized Neural Networks）\index{Binarized Neural Networks}\cite{Hubara2016BinarizedNN}，即只用$-1$和$+1$表示网络的每个参数。二值化可以被看作是一种极端的量化手段。不过，这类方法还没有在机器翻译中得到大规模验证。
+\parinterval 实际上，低精度表示的另一个好处是可以减少模型存储的体积。比如，如果要把机器翻译模型作为软件的一部分打包存储，这时可以考虑用低精度的方式保存模型参数，使用时再恢复成原始精度的参数。值得注意的是，参数的离散化表示（比如整型表示）的一个极端例子是{\small\bfnew{二值网络}}\index{二值网络}（Binarized Neural Networks）\index{Binarized Neural Networks}\cite{DBLP:conf/nips/HubaraCSEB16}，即只用$-1$和$+1$表示网络的每个参数。二值化可以被看作是一种极端的量化手段。不过，这类方法还没有在机器翻译中得到大规模验证。
 %%%%%%%%%%%%%%%%%%
 \vspace{0.5em}
@@ -959,7 +959,7 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \vspace{0.3em}
 \item 更多的剪枝。可以使用更小的束宽度进行搜索，或者直接使用贪婪的方法进行搜索，即搜索的每个步骤中只保留最好的一个结果。这种方法基本上不用修改代码，比较适合对系统的快速验证和调试。
 \vspace{0.3em}
-\item 新的搜索终止条件。影响推断时间的一个因素是搜索的终止条件。为了保证搜索到高质量的译文，推断系统往往会多``看''一部分样本，比如，会在一个目标语长度范围内生成译文，但是这个范围都是凭经验设置，甚至无法保证最佳译文的长度会落在这个范围内。理想的情况下，当最佳结果出现的时候就可以停止搜索。因此，可以考虑设计更加合理的搜索终止条件减少不必要的计算\cite{DBLP:journals/corr/abs-1809-00069}。比如，可以考虑当前最优翻译假设的得分和其它翻译假设得分的差距，如果差距过大就可以提前终止。
+\item 新的搜索终止条件。影响推断时间的一个因素是搜索的终止条件。为了保证搜索到高质量的译文，推断系统往往会多``看''一部分样本，比如，会在一个目标语长度范围内生成译文，但是这个范围都是凭经验设置，甚至无法保证最佳译文的长度会落在这个范围内。理想的情况下，当最佳结果出现的时候就可以停止搜索。因此，可以考虑设计更加合理的搜索终止条件减少不必要的计算\cite{DBLP:conf/emnlp/HuangZM17}。比如，可以考虑当前最优翻译假设的得分和其它翻译假设得分的差距，如果差距过大就可以提前终止。
 \vspace{0.3em}
 \end{itemize}
@@ -969,7 +969,7 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \vspace{0.3em}
 \item 需要大量自生成数据的情况。比如，需要利用机器翻译生成大量的伪数据的情况。在无指导神经机器翻译训练中，数据的生成也非常频繁。
 \vspace{0.3em}
-\item 交互式翻译。机器翻译的一个应用场景就是交互式机器翻译\cite{Domingo2017Segment}\cite{Alvaro2017Interactive}\cite{Nepveu2004Adaptive}，即机器翻译会根据用户的行为实时进行调整，这时机器翻译的延时会影响用户体验。
+\item 交互式翻译。机器翻译的一个应用场景就是交互式机器翻译\cite{Domingo2017Segment}\cite{Alvaro2017Interactive}\cite{DBLP:conf/emnlp/NepveuLLF04}，即机器翻译会根据用户的行为实时进行调整，这时机器翻译的延时会影响用户体验。
 \vspace{0.3em}
 \item 互联网机器翻译服务和产品。在大并发时如何保证翻译的低延时也是开发这类应用中必须要考虑的。
 \vspace{0.3em}
@@ -1310,7 +1310,7 @@ $g_l$会作为输入的一部分送入第$l+1$层。其网络的结构图\ref{fi
 %%%%%%%%%%%%%%%%%%
 \subsubsection{分组稠密连接}
-\parinterval 很多研究者已经发现深层网络不同层之间的稠密连接能够很明显地提高信息传递的效率\cite{WangLearning}\cite{DBLP:journals/corr/HuangLW16a}\cite{DBLP:journals/corr/abs-1810-10181}\cite{DBLP:conf/acl/WuWXTGQLL19}。与此同时，对之前层信息的不断复用有助于得到更好的表示，但随之而来的是网络计算代价过大的问题。由于动态线性层聚合方法（DLCL）在每一次聚合时都需要重新计算之前每一层表示对当前层网络输入的贡献度，因此伴随着编码端整体深度的不断增加，这部分的计算代价变的不可忽略。例如，一个基于动态层聚合的48层Transformer模型的训练时间比不使用动态层聚合慢近1.9倍。同时，缓存中间结果也增加了显存的使用量，尽管使用了FP16计算，每张12G显存的GPU上计算的词也不能超过2048个，这导致训练开销急剧增大。
+\parinterval 很多研究者已经发现深层网络不同层之间的稠密连接能够很明显地提高信息传递的效率\cite{WangLearning}\cite{DBLP:conf/cvpr/HuangLMW17}\cite{DBLP:conf/emnlp/DouTWSZ18}\cite{DBLP:conf/acl/WuWXTGQLL19}。与此同时，对之前层信息的不断复用有助于得到更好的表示，但随之而来的是网络计算代价过大的问题。由于动态线性层聚合方法（DLCL）在每一次聚合时都需要重新计算之前每一层表示对当前层网络输入的贡献度，因此伴随着编码端整体深度的不断增加，这部分的计算代价变的不可忽略。例如，一个基于动态层聚合的48层Transformer模型的训练时间比不使用动态层聚合慢近1.9倍。同时，缓存中间结果也增加了显存的使用量，尽管使用了FP16计算，每张12G显存的GPU上计算的词也不能超过2048个，这导致训练开销急剧增大。
 %----------------------------------------------
 % 图7.5.4
@@ -1657,7 +1657,7 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\textbf{y}} | \textbf{x})
 %%%%%%%%%%%%%%%%%%
 \subsubsection{有监督对偶学习}
-\parinterval 除了用条件概率$\textrm{P}(\mathbf t|\mathbf s)$建模翻译问题，还可以使用联合分布$\textrm{P}(\mathbf s,\mathbf t)$进行建模\cite{DBLP:journals/corr/XiaQCBYL17}。根据条件概率的定义，有：
+\parinterval 除了用条件概率$\textrm{P}(\mathbf t|\mathbf s)$建模翻译问题，还可以使用联合分布$\textrm{P}(\mathbf s,\mathbf t)$进行建模\cite{DBLP:conf/icml/XiaQCBYL17}。根据条件概率的定义，有：
 \begin{eqnarray}
 \textrm{P}(\mathbf s,\mathbf t) &=& \textrm{P}(\mathbf s)\textrm{P}(\mathbf t|\mathbf s) \nonumber \\
 &=& \textrm{P}(t)\textrm{P}(\mathbf s|\mathbf t)
@@ -1722,21 +1722,21 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\textbf{y}} | \textbf{x})
 \begin{itemize}
 \item 无指导机器翻译。无指导机器翻译由于其不需要双语语料即可训练翻译模型的特性，在稀缺资源机器翻译的场景中有非常大的潜力而得到广泛的关注。目前无指导机器翻译主要有两种范式：第一种先得到词典的翻译，然后得到短语表的翻译和相应的统计机器翻译系统，最后使用统计机器翻译系统生成伪双语平行语料训练神经机器翻译系统\cite{DBLP:conf/acl/ArtetxeLA19}；第二种是先预训练语言模型来初始化神经机器翻译系统的编码器和解码器，然后使用翻译中回译以及降噪自编码器来训练神经机器翻译系统\cite{lample2019cross}。尽管目前无指导机器翻译在富资源的语种上取得了很大进展，但是离实际应用还有很远距离。比如，目前无指导系统都依赖于大量单语数据，而实际上稀缺资源的语种不但双语语料少，单语语料也少；此外，这些系统还无法在远距离如中英这些字母表重合少，需要大范围调序的语种对上取得可接受的结果；使用大量单语训练无指导系统还面临数据来自于不同领域的问题\cite{DBLP:journals/corr/abs-2004-05516}。设计更鲁棒，使用单语数据更高效的无指导机器翻译方法乃至新范式会是未来的趋势。
 \vspace{0.5em}
-\item 更多上下文信息的建模。由于人类语言潜在的歧义性，传统的神经机器翻译在单句翻译中可能会出现歧义。为此，一些研究工作在翻译过程中尝试引入更多的上下文信息，比如多模态翻译、基于树的翻译或者篇章级翻译。多模态翻译的目标就是在给定一个图片和其源语描述的情况下，生成目标语言的描述。一般做法就是通过一个额外的编码器来提取图像特征\cite{elliott2015multilingual,DBLP:conf/acl/HitschlerSR16}，然后通过权重门控机制、注意力网络等融合到系统中\cite{DBLP:conf/wmt/HuangLSOD16}。
+\item 更多上下文信息的建模。由于人类语言潜在的歧义性，传统的神经机器翻译在单句翻译中可能会出现歧义。为此，一些研究工作在翻译过程中尝试引入更多的上下文信息，比如多模态翻译、基于树的翻译或者篇章级翻译。多模态翻译的目标就是在给定一个图片和其源语描述的情况下，生成目标语言的描述。一般做法就是通过一个额外的编码器来提取图像特征\cite{DBLP:journals/corr/ElliottFH15,DBLP:conf/acl/HitschlerSR16}，然后通过权重门控机制、注意力网络等融合到系统中\cite{DBLP:conf/wmt/HuangLSOD16}。
 \parinterval 基于树的翻译是指在翻译模型中引入句法结构树或依存树，从而引入更多的句法信息。一种常用的做法是将句法树进行序列化，从而保留序列到序列的模型结构\cite{DBLP:conf/emnlp/CurreyH18,DBLP:conf/acl/SaundersSGB18,DBLP:conf/wmt/NadejdeRSDJKB17}。在此基础上，一些研究工作引入了更多的解析结果\cite{DBLP:conf/acl/SumitaUZTM18,DBLP:conf/coling/ZaremoodiH18}。同时，也有一些研究工作直接使用Tree-LSTMs等网络结构\cite{DBLP:conf/acl/TaiSM15,DBLP:conf/iclr/ShenTSC19}来直接表示树结构，并将其应用到神经机器翻译模型中\cite{DBLP:conf/acl/EriguchiHT16,Yang2017TowardsBH,DBLP:conf/acl/ChenHCC17}。
 \parinterval 篇章级翻译是为了引入篇章级上下文信息，来处理篇章翻译中译文不连贯，主谓不一致等歧义现象。为此，一些研究人员针对该问题进行了改进，主要可以分为两类方法：一种是将当前句子与上下文进行句子级的拼接，不改变模型的结构\cite{DBLP:conf/discomt/TiedemannS17}，另外一种是采用额外的编码器来捕获篇章信息\cite{DBLP:journals/corr/JeanLFC17,DBLP:journals/corr/abs-1805-10163,DBLP:conf/emnlp/ZhangLSZXZL18}。编码器的结构除了传统的RNN、自注意力网络，还有利用层级注意力来编码之前的多句上文\cite{Werlen2018DocumentLevelNM,tan-etal-2019-hierarchical}，使用可选择的稀疏注意力机制对整个文档进行篇章建模\cite{DBLP:conf/naacl/MarufMH19},使用记忆网络、缓存机制等对篇章中的关键词进行提取\cite{DBLP:conf/coling/KuangXLZ18,DBLP:journals/tacl/TuLSZ18}或者采用两阶段解码的方式\cite{DBLP:conf/aaai/XiongH0W19,DBLP:conf/acl/VoitaST19}。除了从建模角度引入上下文信息，也有一些工作使用篇章级修正模型\cite{DBLP:conf/emnlp/VoitaST19}或者语言模型\cite{DBLP:journals/corr/abs-1910-00553}对句子级翻译模型的译文进行修正，或者通过自学习在解码过程中保持翻译连贯性\cite{DBLP:journals/corr/abs-2003-05259}。
 \vspace{0.5em}
-\item 语音翻译。在日常生活中，语音翻译也是有很大的需求。针对语音到文本翻译的特点，最简单的做法是使用自动语音识别（ASR）将语音转换成文本，然后送入文本翻译模型进行翻译\cite{DBLP:conf/icassp/Ney99,DBLP:conf/interspeech/MatusovKN05}。然而为了避免流水线中的错误传播和高延迟问题，现在通常采用端到端的建模做法\cite{DBLP:conf/naacl/DuongACBC16,DBLP:journals/corr/BerardPSB16}。同时，针对语音翻译数据稀缺的问题，一些研究工作采用各种方法来进行缓解，包括预训练\cite{DBLP:conf/naacl/BansalKLLG19}、多任务学习\cite{DBLP:conf/naacl/DuongACBC16,DBLP:conf/icassp/BerardBKP18}、课程学习\cite{DBLP:journals/corr/abs-1802-06003}、注意力传递\cite{DBLP:journals/tacl/SperberNNW19}和知识精炼\cite{DBLP:conf/interspeech/LiuXZHWWZ19,DBLP:conf/icassp/JiaJMWCCALW19}。
+\item 语音翻译。在日常生活中，语音翻译也是有很大的需求。针对语音到文本翻译的特点，最简单的做法是使用自动语音识别（ASR）将语音转换成文本，然后送入文本翻译模型进行翻译\cite{DBLP:conf/icassp/Ney99,DBLP:conf/interspeech/MatusovKN05}。然而为了避免流水线中的错误传播和高延迟问题，现在通常采用端到端的建模做法\cite{DBLP:conf/naacl/DuongACBC16,DBLP:journals/corr/BerardPSB16}。同时，针对语音翻译数据稀缺的问题，一些研究工作采用各种方法来进行缓解，包括预训练\cite{DBLP:conf/naacl/BansalKLLG19}、多任务学习\cite{DBLP:conf/naacl/DuongACBC16,DBLP:conf/icassp/BerardBKP18}、课程学习\cite{DBLP:conf/interspeech/KanoS017}、注意力传递\cite{DBLP:journals/tacl/SperberNNW19}和知识精炼\cite{DBLP:conf/interspeech/LiuXZHWWZ19,DBLP:conf/icassp/JiaJMWCCALW19}。
 \vspace{0.5em}
 \item 多语言翻译。神经机器翻译模型经过训练，通常可以将一种固定的源语言翻译成另一种固定的目标语言，但考虑到世界上有成千上万种语言，为每种语言对训练一个单独的模型非常耗资源。相比于单一语言对的神经机器翻译，多语言神经机器翻译具有开发跨语言对相似性的潜力，而且可以节约大量的训练成本\cite{DBLP:journals/tacl/JohnsonSLKWCTVW17}。
-\parinterval 多语言神经机器翻译旨在训练涵盖多种语言翻译的单一模型。多语言神经机器翻译系统可以根据它们在不同翻译语言对之间共享的组件进行分类。一种常见的做法是通过语言标签指定源语言合目标语言的同时，共享整个神经网络结构（编码器和解码器）\cite{DBLP:journals/corr/HaNW16,DBLP:journals/corr/abs-1711-07893}。除此之外，还可以使用共享的编码器，但针对每种目标语言使用单独的解码器进行一对多的多语言翻译\cite{DBLP:conf/naacl/FiratCB16}。还有一些方法为每种源语言和目标语言都使用单独的编码器和解码器，但会共享其中的一些组件\cite{luong2015multi-task,DBLP:conf/naacl/FiratCB16}，比如说，共享其中的注意力机制结构\cite{luong2015multi-task,DBLP:conf/naacl/FiratCB16}多语言神经机器翻译不仅可以减少训练单一语言对神经机器翻译的训练代价，还可以有效的解决低资源神经机器翻译\cite{DBLP:journals/tacl/JohnsonSLKWCTVW17}以及多源神经机器翻译问题\cite{Och2001Statistical}。
+\parinterval 多语言神经机器翻译旨在训练涵盖多种语言翻译的单一模型。多语言神经机器翻译系统可以根据它们在不同翻译语言对之间共享的组件进行分类。一种常见的做法是通过语言标签指定源语言合目标语言的同时，共享整个神经网络结构（编码器和解码器）\cite{DBLP:journals/corr/HaNW16,DBLP:journals/corr/abs-1711-07893}。除此之外，还可以使用共享的编码器，但针对每种目标语言使用单独的解码器进行一对多的多语言翻译\cite{DBLP:conf/naacl/FiratCB16}。还有一些方法为每种源语言和目标语言都使用单独的编码器和解码器，但会共享其中的一些组件\cite{DBLP:journals/corr/LuongLSVK15,DBLP:conf/naacl/FiratCB16}，比如说，共享其中的注意力机制结构\cite{DBLP:journals/corr/LuongLSVK15,DBLP:conf/naacl/FiratCB16}多语言神经机器翻译不仅可以减少训练单一语言对神经机器翻译的训练代价，还可以有效的解决低资源神经机器翻译\cite{DBLP:journals/tacl/JohnsonSLKWCTVW17}以及多源神经机器翻译问题\cite{Och01statisticalmulti-source}。
 \vspace{0.5em}
-\item 结构搜索。除了由研究人员手工设计神经网络结构之外，近些年{\small\bfnew{网络结构搜索技术}}\index{网络结构搜索技术}（Neural Architecture Search；NAS）\index{Neural Architecture Search；NAS}也逐渐在包括机器翻译在内的自然语言处理任务中得到广泛关注\cite{DBLP:journals/jmlr/ElskenMH19}。不同于前文提到的基于循环神经网络、Transformer结构的机器翻译模型，网络结构搜索旨在通过自动的方式根据提供的训练数据自动学习到最适合于当前任务的神经网络模型结构，这种方式能够有效将研究人员从模型结构设计者的位置上“解救”出来，让计算机能够像学网络参数一样学习神经网络模型的结构。目前而言，网络结构搜索的方法已经在自然语言处理的各项任务中崭露头角，在语言模型、命名实体识别等任务中获得优异的成绩\cite{DBLP:conf/iclr/ZophL17,DBLP:conf/emnlp/JiangHXZZ19,liyinqiaoESS}，但对于机器翻译任务而言，由于其任务的复杂性，网络结构的搜索空间往往比较大，很难直接对其空间进行搜索，因此研究人员更倾向于对基于现有经验设计的模型结构进行改良。谷歌大脑团队在The Evolved Transformer文章中提出使用进化算法，在Transformer结构基础上对模型结构进行演化，得到更加高效且建模能力更强的机器翻译模型。微软团队也在Neural Architecture Optimization\cite{Luo2018Neural}论文中提出NAO的方法，通过将神经网络结构映射到连续空间上进行优化来获得优于初始结构的模型，NAO方法在WMT19机器翻译评测任务中也进行了使用，在英语-芬兰语以及芬兰语-英语的任务上均取得了优异的成绩。
+\item 结构搜索。除了由研究人员手工设计神经网络结构之外，近些年{\small\bfnew{网络结构搜索技术}}\index{网络结构搜索技术}（Neural Architecture Search；NAS）\index{Neural Architecture Search；NAS}也逐渐在包括机器翻译在内的自然语言处理任务中得到广泛关注\cite{elsken2019neural}。不同于前文提到的基于循环神经网络、Transformer结构的机器翻译模型，网络结构搜索旨在通过自动的方式根据提供的训练数据自动学习到最适合于当前任务的神经网络模型结构，这种方式能够有效将研究人员从模型结构设计者的位置上“解救”出来，让计算机能够像学网络参数一样学习神经网络模型的结构。目前而言，网络结构搜索的方法已经在自然语言处理的各项任务中崭露头角，在语言模型、命名实体识别等任务中获得优异的成绩\cite{DBLP:conf/iclr/ZophL17,DBLP:conf/emnlp/JiangHXZZ19,liyinqiaoESS}，但对于机器翻译任务而言，由于其任务的复杂性，网络结构的搜索空间往往比较大，很难直接对其空间进行搜索，因此研究人员更倾向于对基于现有经验设计的模型结构进行改良。谷歌大脑团队在The Evolved Transformer文章中提出使用进化算法，在Transformer结构基础上对模型结构进行演化，得到更加高效且建模能力更强的机器翻译模型。微软团队也在Neural Architecture Optimization\cite{DBLP:conf/nips/LuoTQCL18}论文中提出NAO的方法，通过将神经网络结构映射到连续空间上进行优化来获得优于初始结构的模型，NAO方法在WMT19机器翻译评测任务中也进行了使用，在英语-芬兰语以及芬兰语-英语的任务上均取得了优异的成绩。
 \vspace{0.5em}
-\item 与统计机器翻译的结合。尽管神经机器翻译在自动评价和人工评价上都取得比统计机器翻译优异的结果，神经机器翻译仍然面临一些统计机器翻译没有的问题\cite{DBLP:conf/aclnmt/KoehnK17}，如神经机器翻译系统会产生漏译的现象，也就是源语句子的一些短语甚至从句没有被翻译，而统计机器翻译因为是把源语里所有短语都翻译出来后进行拼装，因此不会产生这种译文对原文的忠实度低的问题。一个解决的思路就是把统计机器翻译系统和神经机器翻译系统进行结合。目前的方法主要分为两种，一种是模型的改进，比如在神经机器翻译里建模统计机器翻译的概念或者使用统计机器翻译系统的模块，如词对齐，覆盖度等等\cite{DBLP:conf/aaai/HeHWW16}，或者是把神经机器翻译系统结合到统计机器翻译系统中，如作为一个特征\cite{DBLP:journals/corr/GulcehreFXCBLBS15}；第二种是系统融合，在不改变模型的情况下，把来自神经机器翻译系统的输出和统计机器翻译系统的输出进行融合，得到更好的结果，如使用重排序\cite{DBLP:conf/ijcnlp/KhayrallahKDPK17,DBLP:conf/acl/StahlbergHWB16,DBLP:conf/aclwat/NeubigMN15,DBLP:conf/naacl/GrundkiewiczJ18}，后处理\cite{niehues-etal-2016-pre}，或者把统计机器翻译系统的输出作为神经机器翻译系统解码的约束条件等等\cite{DBLP:conf/eacl/GispertBHS17}。除此之外，也可以把神经机器翻译与翻译记忆相融合\cite{Mengzhou2019Graph,Qiuxiang2019Word}，在机器翻译应用中也是非常有趣的方向。
+\item 与统计机器翻译的结合。尽管神经机器翻译在自动评价和人工评价上都取得比统计机器翻译优异的结果，神经机器翻译仍然面临一些统计机器翻译没有的问题\cite{DBLP:conf/aclnmt/KoehnK17}，如神经机器翻译系统会产生漏译的现象，也就是源语句子的一些短语甚至从句没有被翻译，而统计机器翻译因为是把源语里所有短语都翻译出来后进行拼装，因此不会产生这种译文对原文的忠实度低的问题。一个解决的思路就是把统计机器翻译系统和神经机器翻译系统进行结合。目前的方法主要分为两种，一种是模型的改进，比如在神经机器翻译里建模统计机器翻译的概念或者使用统计机器翻译系统的模块，如词对齐，覆盖度等等\cite{DBLP:conf/aaai/HeHWW16}，或者是把神经机器翻译系统结合到统计机器翻译系统中，如作为一个特征\cite{DBLP:journals/corr/GulcehreFXCBLBS15}；第二种是系统融合，在不改变模型的情况下，把来自神经机器翻译系统的输出和统计机器翻译系统的输出进行融合，得到更好的结果，如使用重排序\cite{DBLP:conf/ijcnlp/KhayrallahKDPK17,DBLP:conf/acl/StahlbergHWB16,DBLP:conf/aclwat/NeubigMN15,DBLP:conf/naacl/GrundkiewiczJ18}，后处理\cite{niehues-etal-2016-pre}，或者把统计机器翻译系统的输出作为神经机器翻译系统解码的约束条件等等\cite{DBLP:conf/eacl/GispertBHS17}。除此之外，也可以把神经机器翻译与翻译记忆相融合\cite{DBLP:conf/aaai/XiaHLS19,DBLP:conf/nlpcc/HeHLL19}，在机器翻译应用中也是非常有趣的方向。

--- a/Book/bibliography.bib
+++ b/Book/bibliography.bib
@@ -2281,7 +2281,7 @@ year ={2008},
               Liang Huang and
               Daniel Gildea and
               Kevin Knight},
-  editor    = {Robert C. Moore and
+  //editor    = {Robert C. Moore and
               Jeff A. Bilmes and
               Jennifer Chu{-}Carroll and
               Mark Sanderson},
@@ -2396,23 +2396,27 @@ year ={2008},
  //biburl    = {https://dblp.org/rec/journals/jmlr/CollobertWBKKK11.bib},
  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
-@article{devlin2018bert,
+@inproceedings{DBLP:conf/naacl/DevlinCLT19,
  author    = {Jacob Devlin and
               Ming{-}Wei Chang and
               Kenton Lee and
               Kristina Toutanova},
  title     = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
               Understanding},
-  journal   = {CoRR},
+  booktitle = {Proceedings of the 2019 Conference of the North American Chapter of
-  volume    = {abs/1810.04805},
+               the Association for Computational Linguistics: Human Language Technologies,
-  year      = {2018},
+               {NAACL-HLT} 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long
-  //url       = {http://arxiv.org/abs/1810.04805},
+               and Short Papers)},
-  //archivePrefix = {arXiv},
+  pages     = {4171--4186},
-  //eprint    = {1810.04805},
+  publisher = {Association for Computational Linguistics},
-  //timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
+  year      = {2019},
-  //biburl    = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
+  //url       = {https://doi.org/10.18653/v1/n19-1423},
+  //doi       = {10.18653/v1/n19-1423},
+  //timestamp = {Tue, 28 Jan 2020 10:30:29 +0100},
+  //biburl    = {https://dblp.org/rec/conf/naacl/DevlinCLT19.bib},
  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
 @article{duchi2011adaptive,
  author    = {John C. Duchi and
               Elad Hazan and
@@ -2731,8 +2735,6 @@ year ={2008},
               Tao Qin and
               Jianfeng Lu and
               Tie{-}Yan Liu},
-  editor    = {Kamalika Chaudhuri and
-               Ruslan Salakhutdinov},
  title     = {{MASS:} Masked Sequence to Sequence Pre-training for Language Generation},
  booktitle = {Proceedings of the 36th International Conference on Machine Learning,
               {ICML} 2019, 9-15 June 2019, Long Beach, California, {USA}},
@@ -4171,29 +4173,21 @@ pages ={157-166},
  pages={72-78},
  year={2011},
 }
-%%%%%%%%%%%%%%%
-@misc{provilkov2019bpedropout,
+@inproceedings{DBLP:conf/acl/SennrichHB16a,
-    title={BPE-Dropout: Simple and Effective Subword Regularization},
-    author={Ivan Provilkov and Dmitrii Emelianenko and Elena Voita},
-    year={2019},
-    //eprint={1910.13267},
-    //archivePrefix={arXiv},
-    //primaryClass={cs.CL}
-}
-%%%%%%%%%%%%%%%%%%%
-@article{DBLP:journals/corr/SennrichHB15,
  author    = {Rico Sennrich and
               Barry Haddow and
               Alexandra Birch},
  title     = {Neural Machine Translation of Rare Words with Subword Units},
-  journal   = {CoRR},
+  booktitle = {Proceedings of the 54th Annual Meeting of the Association for Computational
-  volume    = {abs/1508.07909},
+               Linguistics, {ACL} 2016, August 7-12, 2016, Berlin, Germany, Volume
-  year      = {2015},
+               1: Long Papers},
-  //url       = {http://arxiv.org/abs/1508.07909},
+  publisher = {The Association for Computer Linguistics},
-  //archivePrefix = {arXiv},
+  year      = {2016},
-  //eprint    = {1508.07909},
+  //url       = {https://doi.org/10.18653/v1/p16-1162},
-  //timestamp = {Mon, 13 Aug 2018 16:47:17 +0200},
+  //doi       = {10.18653/v1/p16-1162},
-  //biburl    = {https://dblp.org/rec/journals/corr/SennrichHB15.bib},
+  //timestamp = {Tue, 28 Jan 2020 10:28:06 +0100},
+  //biburl    = {https://dblp.org/rec/conf/acl/SennrichHB16a.bib},
  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
@@ -4225,26 +4219,36 @@ pages ={157-166},
  year={1989},
 }
-@article{DBLP:journals/corr/abs-1809-10853,
+@inproceedings{DBLP:conf/iclr/BaevskiA19,
  author    = {Alexei Baevski and
               Michael Auli},
  title     = {Adaptive Input Representations for Neural Language Modeling},
-  journal   = {CoRR},
+  booktitle = {7th International Conference on Learning Representations, {ICLR} 2019,
-  volume    = {abs/1809.10853},
+               New Orleans, LA, USA, May 6-9, 2019},
-  year      = {2018},
+  publisher = {OpenReview.net},
-  //url       = {http://arxiv.org/abs/1809.10853},
+  year      = {2019},
-  /archivePrefix = {arXiv},
+  //url       = {https://openreview.net/forum?id=ByxZX20qFQ},
-  //eprint    = {1809.10853},
+  //timestamp = {Thu, 25 Jul 2019 14:26:00 +0200},
-  //timestamp = {Fri, 05 Oct 2018 11:34:52 +0200},
+  //biburl    = {https://dblp.org/rec/conf/iclr/BaevskiA19.bib},
-  //biburl    = {https://dblp.org/rec/journals/corr/abs-1809-10853.bib},
  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
-@inproceedings{Stahlberg2019OnNS,
+@inproceedings{DBLP:conf/emnlp/StahlbergB19,
-  title={On NMT Search Errors and Model Errors: Cat Got Your Tongue?},
+  author    = {Felix Stahlberg and
-  author={Felix Stahlberg and Bill Byrne},
+               Bill Byrne},
-  booktitle={EMNLP/IJCNLP},
+  title     = {On {NMT} Search Errors and Model Errors: Cat Got Your Tongue?},
-  year={2019}
+  booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural
+               Language Processing and the 9th International Joint Conference on
+               Natural Language Processing, {EMNLP-IJCNLP} 2019, Hong Kong, China,
+               November 3-7, 2019},
+  pages     = {3354--3360},
+  publisher = {Association for Computational Linguistics},
+  year      = {2019},
+  //url       = {https://doi.org/10.18653/v1/D19-1331},
+  //doi       = {10.18653/v1/D19-1331},
+  //timestamp = {Thu, 12 Dec 2019 13:23:43 +0100},
+  //biburl    = {https://dblp.org/rec/conf/emnlp/StahlbergB19.bib},
+  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
 @article{DBLP:journals/corr/abs-1810-08398,
@@ -4270,37 +4274,40 @@ pages ={157-166},
  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
-@article{DBLP:journals/corr/StahlbergHSB17,
+@inproceedings{DBLP:conf/emnlp/StahlbergHSB17,
  author    = {Felix Stahlberg and
               Eva Hasler and
               Danielle Saunders and
               Bill Byrne},
  title     = {{SGNMT} - {A} Flexible {NMT} Decoding Platform for Quick Prototyping
               of New Models and Search Strategies},
-  journal   = {CoRR},
+  booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural
-  volume    = {abs/1707.06885},
+               Language Processing, {EMNLP} 2017, Copenhagen, Denmark, September
+               9-11, 2017 - System Demonstrations},
+  pages     = {25--30},
+  publisher = {Association for Computational Linguistics},
  year      = {2017},
-  //url       = {http://arxiv.org/abs/1707.06885},
+  //url       = {https://doi.org/10.18653/v1/d17-2005},
-  //archivePrefix = {arXiv},
+  //doi       = {10.18653/v1/d17-2005},
-  //eprint    = {1707.06885},
+  //timestamp = {Tue, 28 Jan 2020 10:28:17 +0100},
-  //timestamp = {Mon, 13 Aug 2018 16:48:37 +0200},
+  //biburl    = {https://dblp.org/rec/conf/emnlp/StahlbergHSB17.bib},
-  //biburl    = {https://dblp.org/rec/journals/corr/StahlbergHSB17.bib},
  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
-@article{DBLP:journals/corr/SennrichHB16,
+@inproceedings{DBLP:conf/wmt/SennrichHB16,
  author    = {Rico Sennrich and
               Barry Haddow and
               Alexandra Birch},
  title     = {Edinburgh Neural Machine Translation Systems for {WMT} 16},
-  journal   = {CoRR},
+  booktitle = {Proceedings of the First Conference on Machine Translation, {WMT}
-  volume    = {abs/1606.02891},
+               2016, colocated with {ACL} 2016, August 11-12, Berlin, Germany},
+  pages     = {371--376},
+  publisher = {The Association for Computer Linguistics},
  year      = {2016},
-  //url       = {http://arxiv.org/abs/1606.02891},
+  //url       = {https://doi.org/10.18653/v1/w16-2323},
-  //archivePrefix = {arXiv},
+  //doi       = {10.18653/v1/w16-2323},
-  //eprint    = {1606.02891},
+  //timestamp = {Tue, 28 Jan 2020 10:31:04 +0100},
-  //timestamp = {Mon, 13 Aug 2018 16:46:23 +0200},
+  //biburl    = {https://dblp.org/rec/conf/wmt/SennrichHB16.bib},
-  //biburl    = {https://dblp.org/rec/journals/corr/SennrichHB16.bib},
  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
@@ -4322,23 +4329,6 @@ pages ={157-166},
               Qiang Wang and
               Tong Xiao and
               Jingbo Zhu},
-  editor    = {Ondrej Bojar and
-               Rajen Chatterjee and
-               Christian Federmann and
-               Mark Fishel and
-               Yvette Graham and
-               Barry Haddow and
-               Matthias Huck and
-               Antonio Jimeno{-}Yepes and
-               Philipp Koehn and
-               Andr{\'{e}} Martins and
-               Christof Monz and
-               Matteo Negri and
-               Aur{\'{e}}lie N{\'{e}}v{\'{e}}ol and
-               Mariana L. Neves and
-               Matt Post and
-               Marco Turchi and
-               Karin Verspoor},
  title     = {The NiuTrans Machine Translation Systems for {WMT19}},
  booktitle = {Proceedings of the Fourth Conference on Machine Translation, {WMT}
               2019, Florence, Italy, August 1-2, 2019 - Volume 2: Shared Task Papers,
@@ -4373,7 +4363,7 @@ pages ={157-166},
  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
-@article{DBLP:journals/corr/abs-1712-05877,
+@inproceedings{DBLP:conf/cvpr/JacobKCZTHAK18,
  author    = {Benoit Jacob and
               Skirmantas Kligys and
               Bo Chen and
@@ -4384,14 +4374,15 @@ pages ={157-166},
               Dmitry Kalenichenko},
  title     = {Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only
               Inference},
-  journal   = {CoRR},
+  booktitle = {2018 {IEEE} Conference on Computer Vision and Pattern Recognition,
-  volume    = {abs/1712.05877},
+               {CVPR} 2018, Salt Lake City, UT, USA, June 18-22, 2018},
-  year      = {2017},
+  pages     = {2704--2713},
-  //url       = {http://arxiv.org/abs/1712.05877},
+  publisher = {{IEEE} Computer Society},
-  //archivePrefix = {arXiv},
+  year      = {2018},
-  //eprint    = {1712.05877},
+  //url       = {http://openaccess.thecvf.com/content\_cvpr\_2018/html/Jacob\_Quantization\_and\_Training\_CVPR\_2018\_paper.html},
-  //timestamp = {Mon, 13 Aug 2018 16:48:27 +0200},
+  //doi       = {10.1109/CVPR.2018.00286},
-  //biburl    = {https://dblp.org/rec/journals/corr/abs-1712-05877.bib},
+  //timestamp = {Wed, 16 Oct 2019 14:14:50 +0200},
+  //biburl    = {https://dblp.org/rec/conf/cvpr/JacobKCZTHAK18.bib},
  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
@@ -4411,7 +4402,7 @@ pages ={157-166},
  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
-@article{DBLP:journals/corr/abs-1801-05122,
+@inproceedings{DBLP:conf/aaai/ZhangSQLJW18,
  author    = {Xiangwen Zhang and
               Jinsong Su and
               Yue Qin and
@@ -4419,31 +4410,36 @@ pages ={157-166},
               Rongrong Ji and
               Hongji Wang},
  title     = {Asynchronous Bidirectional Decoding for Neural Machine Translation},
-  journal   = {CoRR},
+  booktitle = {Proceedings of the Thirty-Second {AAAI} Conference on Artificial Intelligence,
-  volume    = {abs/1801.05122},
+               (AAAI-18), the 30th innovative Applications of Artificial Intelligence
+               (IAAI-18), and the 8th {AAAI} Symposium on Educational Advances in
+               Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February
+               2-7, 2018},
+  pages     = {5698--5705},
+  publisher = {{AAAI} Press},
  year      = {2018},
-  //url       = {http://arxiv.org/abs/1801.05122},
+ //url       = {https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16784},
-  //archivePrefix = {arXiv},
+ //timestamp = {Sun, 31 Mar 2019 12:09:17 +0200},
-  //eprint    = {1801.05122},
+ //biburl    = {https://dblp.org/rec/conf/aaai/ZhangSQLJW18.bib},
-  //timestamp = {Mon, 15 Jul 2019 14:17:41 +0200},
+ //bibsource = {dblp computer science bibliography, https://dblp.org}
-  //biburl    = {https://dblp.org/rec/journals/corr/abs-1801-05122.bib},
-  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
-@article{DBLP:journals/corr/abs-1809-00069,
+@inproceedings{DBLP:conf/emnlp/HuangZM17,
  author    = {Liang Huang and
               Kai Zhao and
               Mingbo Ma},
  title     = {When to Finish? Optimal Beam Search for Neural Text Generation (modulo
               beam size)},
-  journal   = {CoRR},
+  booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural
-  volume    = {abs/1809.00069},
+               Language Processing, {EMNLP} 2017, Copenhagen, Denmark, September
-  year      = {2018},
+               9-11, 2017},
-  //url       = {http://arxiv.org/abs/1809.00069},
+  pages     = {2134--2139},
-  //archivePrefix = {arXiv},
+  publisher = {Association for Computational Linguistics},
-  //eprint    = {1809.00069},
+  year      = {2017},
-  //timestamp = {Fri, 05 Oct 2018 11:34:52 +0200},
+  //url       = {https://doi.org/10.18653/v1/d17-1227},
-  //biburl    = {https://dblp.org/rec/journals/corr/abs-1809-00069.bib},
+  //doi       = {10.18653/v1/d17-1227},
+  //timestamp = {Tue, 28 Jan 2020 10:28:22 +0100},
+  //biburl    = {https://dblp.org/rec/conf/emnlp/HuangZM17.bib},
  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
@@ -4452,7 +4448,7 @@ pages ={157-166},
               Robert E. Schapire},
  title     = {A Decision-Theoretic Generalization of On-Line Learning and an Application
               to Boosting},
-  journal   = {J. Comput. Syst. Sci.},
+  journal   = {Journal of Computer and System Sciences},
  volume    = {55},
  number    = {1},
  pages     = {119--139},
@@ -4469,9 +4465,6 @@ pages ={157-166},
               Jingbo Zhu and
               Muhua Zhu and
               Huizhen Wang},
-  editor    = {Jan Hajic and
-               Sandra Carberry and
-               Stephen Clark},
  title     = {Boosting-Based System Combination for Machine Translation},
  booktitle = {{ACL} 2010, Proceedings of the 48th Annual Meeting of the Association
               for Computational Linguistics, July 11-16, 2010, Uppsala, Sweden},
@@ -4509,7 +4502,7 @@ pages ={157-166},
  author    = {Antti{-}Veikko I. Rosti and
               Spyridon Matsoukas and
               Richard M. Schwartz},
-  editor    = {John A. Carroll and
+  //editor    = {John A. Carroll and
               Antal van den Bosch and
               Annie Zaenen},
  title     = {Improved Word-Level System Combination for Machine Translation},
@@ -4528,7 +4521,7 @@ pages ={157-166},
               Bing Zhang and
               Spyros Matsoukas and
               Richard M. Schwartz},
-  editor    = {Chris Callison{-}Burch and
+  //editor    = {Chris Callison{-}Burch and
               Philipp Koehn and
               Christof Monz and
               Josh Schroeder and
@@ -4588,8 +4581,6 @@ pages ={157-166},
               Rongrong Ji and
               Xiaodong Shi and
               Yang Liu},
-  editor    = {Satinder P. Singh and
-               Shaul Markovitch},
  title     = {Lattice-Based Recurrent Neural Network Encoders for Neural Machine
               Translation},
  booktitle = {Proceedings of the Thirty-First {AAAI} Conference on Artificial Intelligence,
@@ -4603,17 +4594,20 @@ pages ={157-166},
  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
-@inproceedings{DBLP:conf/acl/BirdL04,
+@inproceedings{DBLP:conf/acl/Bird06,
-  author    = {Steven Bird and
+  author    = {Steven Bird},
-               Edward Loper},
+  //editor    = {Nicoletta Calzolari and
+               Claire Cardie and
+               Pierre Isabelle},
  title     = {{NLTK:} The Natural Language Toolkit},
-  booktitle = {Proceedings of the 42nd Annual Meeting of the Association for Computational
+  booktitle = {{ACL} 2006, 21st International Conference on Computational Linguistics
-               Linguistics, Barcelona, Spain, July 21-26, 2004 - Poster and Demonstration},
+               and 44th Annual Meeting of the Association for Computational Linguistics,
-  publisher = {{ACL}},
+               Proceedings of the Conference, Sydney, Australia, 17-21 July 2006},
-  year      = {2004},
+  publisher = {The Association for Computer Linguistics},
-  //url       = {https://www.aclweb.org/anthology/P04-3031/},
+  year      = {2006},
-  //timestamp = {Wed, 18 Sep 2019 12:15:54 +0200},
+  //url       = {https://www.aclweb.org/anthology/P06-4018/},
-  //biburl    = {https://dblp.org/rec/conf/acl/BirdL04.bib},
+  //timestamp = {Fri, 13 Sep 2019 13:00:43 +0200},
+  //biburl    = {https://dblp.org/rec/conf/acl/Bird06.bib},
  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
@@ -4648,51 +4642,58 @@ pages ={157-166},
  //HAL_VERSION = {v1},
 }
-@article{DBLP:journals/corr/JeanCMB14,
+@inproceedings{DBLP:conf/acl/JeanCMB15,
  author    = {S{\'{e}}bastien Jean and
-               Kyunghyun Cho and
+               KyungHyun Cho and
               Roland Memisevic and
               Yoshua Bengio},
  title     = {On Using Very Large Target Vocabulary for Neural Machine Translation},
-  journal   = {CoRR},
+  booktitle = {Proceedings of the 53rd Annual Meeting of the Association for Computational
-  volume    = {abs/1412.2007},
+               Linguistics and the 7th International Joint Conference on Natural
-  year      = {2014},
+               Language Processing of the Asian Federation of Natural Language Processing,
-  //url       = {http://arxiv.org/abs/1412.2007},
+               {ACL} 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers},
-  //archivePrefix = {arXiv},
+  pages     = {1--10},
-  //eprint    = {1412.2007},
+  publisher = {The Association for Computer Linguistics},
-  //timestamp = {Mon, 13 Aug 2018 16:46:10 +0200},
+  year      = {2015},
-  //biburl    = {https://dblp.org/rec/journals/corr/JeanCMB14.bib},
+  //url       = {https://doi.org/10.3115/v1/p15-1001},
+  //doi       = {10.3115/v1/p15-1001},
+  //timestamp = {Tue, 28 Jan 2020 10:27:50 +0100},
+  //biburl    = {https://dblp.org/rec/conf/acl/JeanCMB15.bib},
  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
-@article{DBLP:journals/corr/abs-1804-10959,
+@inproceedings{DBLP:conf/acl/Kudo18,
  author    = {Taku Kudo},
  title     = {Subword Regularization: Improving Neural Network Translation Models
               with Multiple Subword Candidates},
-  journal   = {CoRR},
+  booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational
-  volume    = {abs/1804.10959},
+               Linguistics, {ACL} 2018, Melbourne, Australia, July 15-20, 2018, Volume
+               1: Long Papers},
+  pages     = {66--75},
+  publisher = {Association for Computational Linguistics},
  year      = {2018},
-  //url       = {http://arxiv.org/abs/1804.10959},
+  //url       = {https://www.aclweb.org/anthology/P18-1007/},
-  //archivePrefix = {arXiv},
+  //doi       = {10.18653/v1/P18-1007},
-  //eprint    = {1804.10959},
+  //timestamp = {Mon, 16 Sep 2019 13:46:41 +0200},
-  //timestamp = {Mon, 13 Aug 2018 16:48:57 +0200},
+  //biburl    = {https://dblp.org/rec/conf/acl/Kudo18.bib},
-  //biburl    = {https://dblp.org/rec/journals/corr/abs-1804-10959.bib},
  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
-@article{DBLP:journals/corr/ZagoruykoK16,
+@inproceedings{DBLP:conf/bmvc/ZagoruykoK16,
  author    = {Sergey Zagoruyko and
               Nikos Komodakis},
+ //editor    = {Richard C. Wilson and
+               Edwin R. Hancock and
+               William A. P. Smith},
  title     = {Wide Residual Networks},
-  journal   = {CoRR},
+  booktitle = {Proceedings of the British Machine Vision Conference 2016, {BMVC}
-  volume    = {abs/1605.07146},
+               2016, York, UK, September 19-22, 2016},
+  publisher = {{BMVA} Press},
  year      = {2016},
-  //url       = {http://arxiv.org/abs/1605.07146},
+  //url       = {http://www.bmva.org/bmvc/2016/papers/paper087/index.html},
-  //archivePrefix = {arXiv},
+ //timestamp = {Thu, 07 Jun 2018 10:06:28 +0200},
-  //eprint    = {1605.07146},
+ //biburl    = {https://dblp.org/rec/conf/bmvc/ZagoruykoK16.bib},
-  //timestamp = {Mon, 13 Aug 2018 16:46:42 +0200},
+ //bibsource = {dblp computer science bibliography, https://dblp.org}
-  //biburl    = {https://dblp.org/rec/journals/corr/ZagoruykoK16.bib},
-  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
 @article{DBLP:journals/iet-bmt/Sepas-Moghaddam20,
@@ -4700,7 +4701,7 @@ pages ={157-166},
               Fernando Pereira and
               Paulo Lobato Correia},
  title     = {Face recognition: a novel multi-level taxonomy based survey},
-  journal   = {{IET} Biom.},
+  journal   = {{IET} Biometrics},
  volume    = {9},
  number    = {2},
  pages     = {58--67},
@@ -4731,7 +4732,7 @@ pages ={157-166},
  author    = {Ganesh Jawahar and
               Beno{\^{\i}}t Sagot and
               Djam{\'{e}} Seddah},
-  editor    = {Anna Korhonen and
+  //editor    = {Anna Korhonen and
               David R. Traum and
               Llu{\'{\i}}s M{\`{a}}rquez},
  title     = {What Does {BERT} Learn about the Structure of Language?},
@@ -4748,20 +4749,21 @@ pages ={157-166},
  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
-@article{DBLP:journals/corr/abs-1806-00187,
+@inproceedings{DBLP:conf/wmt/OttEGA18,
  author    = {Myle Ott and
               Sergey Edunov and
               David Grangier and
               Michael Auli},
  title     = {Scaling Neural Machine Translation},
-  journal   = {CoRR},
+  booktitle = {Proceedings of the Third Conference on Machine Translation: Research
-  volume    = {abs/1806.00187},
+               Papers, {WMT} 2018, Belgium, Brussels, October 31 - November 1, 2018},
+  pages     = {1--9},
+  publisher = {Association for Computational Linguistics},
  year      = {2018},
-  //url       = {http://arxiv.org/abs/1806.00187},
+  //url       = {https://doi.org/10.18653/v1/w18-6301},
-  //archivePrefix = {arXiv},
+  //doi       = {10.18653/v1/w18-6301},
-  //eprint    = {1806.00187},
+  //timestamp = {Tue, 28 Jan 2020 10:31:02 +0100},
-  //timestamp = {Mon, 13 Aug 2018 16:47:40 +0200},
+  //biburl    = {https://dblp.org/rec/conf/wmt/OttEGA18.bib},
-  //biburl    = {https://dblp.org/rec/journals/corr/abs-1806-00187.bib},
  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
@@ -4831,10 +4833,6 @@ pages ={157-166},
               Orhan Firat and
               Yuan Cao and
               Yonghui Wu},
-  editor    = {Ellen Riloff and
-               David Chiang and
-               Julia Hockenmaier and
-               Jun'ichi Tsujii},
  title     = {Training Deeper Neural Machine Translation Models with Transparent
               Attention},
  booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural
@@ -4853,10 +4851,6 @@ pages ={157-166},
  author    = {Biao Zhang and
               Ivan Titov and
               Rico Sennrich},
-  editor    = {Kentaro Inui and
-               Jing Jiang and
-               Vincent Ng and
-               Xiaojun Wan},
  title     = {Improving Deep Transformer with Depth-Scaled Initialization and Merged
               Attention},
  booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural
@@ -4878,10 +4872,6 @@ pages ={157-166},
               Xiangyu Zhang and
               Shaoqing Ren and
               Jian Sun},
-  editor    = {Bastian Leibe and
-               Jiri Matas and
-               Nicu Sebe and
-               Max Welling},
  title     = {Identity Mappings in Deep Residual Networks},
  booktitle = {Computer Vision - {ECCV} 2016 - 14th European Conference, Amsterdam,
               The Netherlands, October 11-14, 2016, Proceedings, Part {IV}},
@@ -4906,9 +4896,6 @@ pages ={157-166},
               Tao Qin and
               Jianhuang Lai and
               Tie{-}Yan Liu},
-  editor    = {Anna Korhonen and
-               David R. Traum and
-               Llu{\'{\i}}s M{\`{a}}rquez},
  title     = {Depth Growing for Neural Machine Translation},
  booktitle = {Proceedings of the 57th Conference of the Association for Computational
               Linguistics, {ACL} 2019, Florence, Italy, July 28- August 2, 2019,
@@ -4923,37 +4910,40 @@ pages ={157-166},
  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
-@article{DBLP:journals/corr/HuangLW16a,
+@inproceedings{DBLP:conf/cvpr/HuangLMW17,
  author    = {Gao Huang and
               Zhuang Liu and
+               Laurens van der Maaten and
               Kilian Q. Weinberger},
  title     = {Densely Connected Convolutional Networks},
-  journal   = {CoRR},
+  booktitle = {2017 {IEEE} Conference on Computer Vision and Pattern Recognition,
-  volume    = {abs/1608.06993},
+               {CVPR} 2017, Honolulu, HI, USA, July 21-26, 2017},
-  year      = {2016},
+  pages     = {2261--2269},
-  //url       = {http://arxiv.org/abs/1608.06993},
+  publisher = {{IEEE} Computer Society},
-  //archivePrefix = {arXiv},
+  year      = {2017},
-  //eprint    = {1608.06993},
+  //url       = {https://doi.org/10.1109/CVPR.2017.243},
-  //timestamp = {Mon, 10 Sep 2018 15:49:32 +0200},
+  //doi       = {10.1109/CVPR.2017.243},
-  //biburl    = {https://dblp.org/rec/journals/corr/HuangLW16a.bib},
+  //timestamp = {Wed, 16 Oct 2019 14:14:50 +0200},
+  //biburl    = {https://dblp.org/rec/conf/cvpr/HuangLMW17.bib},
  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
-@article{DBLP:journals/corr/abs-1810-10181,
+@inproceedings{DBLP:conf/emnlp/DouTWSZ18,
  author    = {Zi{-}Yi Dou and
               Zhaopeng Tu and
               Xing Wang and
               Shuming Shi and
               Tong Zhang},
  title     = {Exploiting Deep Representations for Neural Machine Translation},
-  journal   = {CoRR},
+  booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural
-  volume    = {abs/1810.10181},
+               Language Processing, Brussels, Belgium, October 31 - November 4, 2018},
+  pages     = {4253--4262},
+  publisher = {Association for Computational Linguistics},
  year      = {2018},
-  //url       = {http://arxiv.org/abs/1810.10181},
+  //url       = {https://doi.org/10.18653/v1/d18-1457},
-  //archivePrefix = {arXiv},
+  //doi       = {10.18653/v1/d18-1457},
-  //eprint    = {1810.10181},
+  //timestamp = {Tue, 28 Jan 2020 10:28:31 +0100},
-  //timestamp = {Tue, 15 Jan 2019 11:48:13 +0100},
+  //biburl    = {https://dblp.org/rec/conf/emnlp/DouTWSZ18.bib},
-  //biburl    = {https://dblp.org/rec/journals/corr/abs-1810-10181.bib},
  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
@@ -4973,22 +4963,26 @@ pages ={157-166},
  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
-@article{DBLP:journals/corr/XiaQCBYL17,
+@inproceedings{DBLP:conf/icml/XiaQCBYL17,
  author    = {Yingce Xia and
               Tao Qin and
               Wei Chen and
               Jiang Bian and
               Nenghai Yu and
               Tie{-}Yan Liu},
+  //editor    = {Doina Precup and
+               Yee Whye Teh},
  title     = {Dual Supervised Learning},
-  journal   = {CoRR},
+  booktitle = {Proceedings of the 34th International Conference on Machine Learning,
-  volume    = {abs/1707.00415},
+               {ICML} 2017, Sydney, NSW, Australia, 6-11 August 2017},
+  series    = {Proceedings of Machine Learning Research},
+  volume    = {70},
+  pages     = {3789--3798},
+  publisher = {{PMLR}},
  year      = {2017},
-  //url       = {http://arxiv.org/abs/1707.00415},
+  //url       = {http://proceedings.mlr.press/v70/xia17a.html},
-  //archivePrefix = {arXiv},
+  //timestamp = {Tue, 03 Sep 2019 16:31:10 +0200},
-  //eprint    = {1707.00415},
+  //biburl    = {https://dblp.org/rec/conf/icml/XiaQCBYL17.bib},
-  //timestamp = {Tue, 03 Sep 2019 16:31:11 +0200},
-  //biburl    = {https://dblp.org/rec/journals/corr/XiaQCBYL17.bib},
  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
@@ -5000,11 +4994,6 @@ pages ={157-166},
               Nenghai Yu and
               Tie{-}Yan Liu and
               Wei{-}Ying Ma},
-  editor    = {Daniel D. Lee and
-               Masashi Sugiyama and
-               Ulrike von Luxburg and
-               Isabelle Guyon and
-               Roman Garnett},
  title     = {Dual Learning for Machine Translation},
  booktitle = {Advances in Neural Information Processing Systems 29: Annual Conference
               on Neural Information Processing Systems 2016, December 5-10, 2016,
@@ -5022,9 +5011,6 @@ pages ={157-166},
               David A. McAllester and
               Satinder P. Singh and
               Yishay Mansour},
-  editor    = {Sara A. Solla and
-               Todd K. Leen and
-               Klaus{-}Robert M{\"{u}}ller},
  title     = {Policy Gradient Methods for Reinforcement Learning with Function Approximation},
  booktitle = {Advances in Neural Information Processing Systems 12, {[NIPS} Conference,
               Denver, Colorado, USA, November 29 - December 4, 1999]},
@@ -5063,16 +5049,6 @@ pages ={157-166},
  author    = {Anna Currey and
               Antonio Valerio Miceli Barone and
               Kenneth Heafield},
-  editor    = {Ondrej Bojar and
-               Christian Buck and
-               Rajen Chatterjee and
-               Christian Federmann and
-               Yvette Graham and
-               Barry Haddow and
-               Matthias Huck and
-               Antonio Jimeno{-}Yepes and
-               Philipp Koehn and
-               Julia Kreutzer},
  title     = {Copied Monolingual Data Improves Low-Resource Neural Machine Translation},
  booktitle = {Proceedings of the Second Conference on Machine Translation, {WMT}
               2017, Copenhagen, Denmark, September 7-8, 2017},
@@ -5108,10 +5084,6 @@ pages ={157-166},
               Myle Ott and
               Michael Auli and
               David Grangier},
-  editor    = {Ellen Riloff and
-               David Chiang and
-               Julia Hockenmaier and
-               Jun'ichi Tsujii},
  title     = {Understanding Back-Translation at Scale},
  booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural
               Language Processing, Brussels, Belgium, October 31 - November 4, 2018},
@@ -5128,9 +5100,6 @@ pages ={157-166},
 @inproceedings{DBLP:conf/emnlp/DomhanH17,
  author    = {Tobias Domhan and
               Felix Hieber},
-  editor    = {Martha Palmer and
-               Rebecca Hwa and
-               Sebastian Riedel},
  title     = {Using Target-side Monolingual Data for Neural Machine Translation
               through Multi-task Learning},
  booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural
@@ -5184,7 +5153,7 @@ pages ={157-166},
 @inproceedings{DBLP:conf/emnlp/KimR16,
  author    = {Yoon Kim and
               Alexander M. Rush},
-  editor    = {Jian Su and
+  //editor    = {Jian Su and
               Xavier Carreras and
               Kevin Duh},
  title     = {Sequence-Level Knowledge Distillation},
@@ -5301,38 +5270,62 @@ pages ={157-166},
  year= {1994}
 }
-@article{bengioCurriculumlearning,
+@inproceedings{DBLP:conf/icml/BengioLCW09,
-  author = {Yoshu Bengio and 
+  author    = {Yoshua Bengio and
-          Jerome Louradour and
+               J{\'{e}}r{\^{o}}me Louradour and
-		  Ronman Collobert and
+               Ronan Collobert and
-		  Jason Weston},
+               Jason Weston},
-  title	= {Curriculum learning},
+  //editor    = {Andrea Pohoreckyj Danyluk and
-  booktitle={Proceedings of the 26th annual international conference on machine learning},
+               L{\'{e}}on Bottou and
-  pages={41--48},
+               Michael L. Littman},
-  year={2009}
+  title     = {Curriculum learning},
-}
+  booktitle = {Proceedings of the 26th Annual International Conference on Machine
-@inproceedings{Hubara2016BinarizedNN,
+               Learning, {ICML} 2009, Montreal, Quebec, Canada, June 14-18, 2009},
-  title={Binarized Neural Networks},
+  series    = {{ACM} International Conference Proceeding Series},
-  author={Itay Hubara and 
+  volume    = {382},
-          Matthieu Courbariaux and 
+  pages     = {41--48},
-		  Daniel Soudry and 
+  publisher = {{ACM}},
-		  Ran El-Yaniv and 
+  year      = {2009},
-		  Yoshua Bengio},
+  //url       = {https://doi.org/10.1145/1553374.1553380},
-  booktitle={Advances in neural information processing systems},
+  //doi       = {10.1145/1553374.1553380},
-  pages={4107--4115},
+  //timestamp = {Wed, 14 Nov 2018 10:58:56 +0100},
-  year={2016}}
+  //biburl    = {https://dblp.org/rec/conf/icml/BengioLCW09.bib},
+  //bibsource = {dblp computer science bibliography, https://dblp.org}
+}
-@article{deeplearning,
+@inproceedings{DBLP:conf/nips/HubaraCSEB16,
-  title={deep learning},
+  author    = {Itay Hubara and
-  author={Yann LeCun and
+               Matthieu Courbariaux and
-		  Yoshua Bengio and
+               Daniel Soudry and
-		  Geoffrey Hinton},
+               Ran El{-}Yaniv and
-  journal={nature},
+               Yoshua Bengio},
-  volume={521},
+  title     = {Binarized Neural Networks},
-  number={7553},
+  booktitle = {Advances in Neural Information Processing Systems 29: Annual Conference
-  pages={436--444},
+               on Neural Information Processing Systems 2016, December 5-10, 2016,
-  year={2015},
+               Barcelona, Spain},
-  publisher={Nature Publishing Group}
+  pages     = {4107--4115},
+  year      = {2016},
+  //url       = {http://papers.nips.cc/paper/6573-binarized-neural-networks},
+  //timestamp = {Fri, 06 Mar 2020 17:00:15 +0100},
+  //biburl    = {https://dblp.org/rec/conf/nips/HubaraCSEB16.bib},
+  //bibsource = {dblp computer science bibliography, https://dblp.org}
+}
+@article{DBLP:journals/nature/LeCunBH15,
+  author    = {Yann LeCun and
+               Yoshua Bengio and
+               Geoffrey E. Hinton},
+  title     = {Deep learning},
+  journal   = {Nature},
+  volume    = {521},
+  number    = {7553},
+  pages     = {436--444},
+  year      = {2015},
+  //url       = {https://doi.org/10.1038/nature14539},
+  //doi       = {10.1038/nature14539},
+  //timestamp = {Wed, 14 Nov 2018 10:30:42 +0100},
+  //biburl    = {https://dblp.org/rec/journals/nature/LeCunBH15.bib},
+  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
 %%%%%%%%%%%%%%%%%%7.6%%%%%%%%%%%%%%%%
 @inproceedings{DBLP:conf/acl/ArtetxeLA19,
@@ -5386,11 +5379,23 @@ pages ={157-166},
  //biburl    = {https://dblp.org/rec/conf/acl/HitschlerSR16.bib},
  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
-@article{elliott2015multilingual,
-title={Multilingual Image Description with Neural Sequence Models},
+@article{DBLP:journals/corr/ElliottFH15,
-author={Elliott, Desmond and Frank, Stella and Hasler, Eva},
+  author    = {Desmond Elliott and
-journal={arXiv: Computation and Language},
+               Stella Frank and
-year={2015}}
+               Eva Hasler},
+  title     = {Multi-Language Image Description with Neural Sequence Models},
+  journal   = {CoRR},
+  volume    = {abs/1510.04709},
+  year      = {2015},
+  //url       = {http://arxiv.org/abs/1510.04709},
+  archivePrefix = {arXiv},
+  eprint    = {1510.04709},
+  //timestamp = {Mon, 13 Aug 2018 16:46:09 +0200},
+  //biburl    = {https://dblp.org/rec/journals/corr/ElliottFH15.bib},
+  //bibsource = {dblp computer science bibliography, https://dblp.org}
+}
 @inproceedings{DBLP:conf/wmt/HuangLSOD16,
  author    = {Po{-}Yao Huang and
               Frederick Liu and
@@ -5713,7 +5718,7 @@ year={2015}}
               Shuming Shi and
               Tong Zhang},
  title     = {Learning to Remember Translation History with a Continuous Cache},
-  journal   = {Trans. Assoc. Comput. Linguistics},
+  journal   = {Transactions of the Association for Computational Linguistics},
  volume    = {6},
  pages     = {407--420},
  year      = {2018},
@@ -5930,22 +5935,24 @@ year={2015}}
  //biburl    = {https://dblp.org/rec/conf/icassp/BerardBKP18.bib},
  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
-@article{DBLP:journals/corr/abs-1802-06003,
+@inproceedings{DBLP:conf/interspeech/KanoS017,
  author    = {Takatomo Kano and
               Sakriani Sakti and
               Satoshi Nakamura},
-  title     = {Structured-based Curriculum Learning for End-to-end English-Japanese
+  title     = {Structured-Based Curriculum Learning for End-to-End English-Japanese
               Speech Translation},
-  journal   = {CoRR},
+  booktitle = {Interspeech 2017, 18th Annual Conference of the International Speech
-  volume    = {abs/1802.06003},
+               Communication Association, Stockholm, Sweden, August 20-24, 2017},
-  year      = {2018},
+  pages     = {2630--2634},
-  //url       = {http://arxiv.org/abs/1802.06003},
+  publisher = {{ISCA}},
-  //archivePrefix = {arXiv},
+  year      = {2017},
-  //eprint    = {1802.06003},
+  //url       = {http://www.isca-speech.org/archive/Interspeech\_2017/abstracts/0944.html},
-  //timestamp = {Mon, 13 Aug 2018 16:47:19 +0200},
+  //timestamp = {Mon, 15 Jul 2019 08:29:02 +0200},
-  //biburl    = {https://dblp.org/rec/journals/corr/abs-1802-06003.bib},
+  //biburl    = {https://dblp.org/rec/conf/interspeech/KanoS017.bib},
  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
 @article{DBLP:journals/tacl/SperberNNW19,
  author    = {Matthias Sperber and
               Graham Neubig and
@@ -5953,7 +5960,7 @@ year={2015}}
               Alex Waibel},
  title     = {Attention-Passing Models for Robust and Data-Efficient End-to-End
               Speech Translation},
-  journal   = {Trans. Assoc. Comput. Linguistics},
+  journal   = {Transactions of the Association for Computational Linguistics},
  volume    = {7},
  pages     = {313--325},
  year      = {2019},
@@ -6022,7 +6029,7 @@ year={2015}}
               Jeffrey Dean},
  title     = {Google's Multilingual Neural Machine Translation System: Enabling
               Zero-Shot Translation},
-  journal   = {Trans. Assoc. Comput. Linguistics},
+  journal   = {Transactions of the Association for Computational Linguistics},
  volume    = {5},
  pages     = {339--351},
  year      = {2017},
@@ -6083,31 +6090,32 @@ year={2015}}
  //biburl    = {https://dblp.org/rec/conf/naacl/FiratCB16.bib},
  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
-@article{luong2015multi-task,
-title={Multi-task Sequence to Sequence Learning},
+@inproceedings{DBLP:journals/corr/LuongLSVK15,
-author={Luong, Minhthang and Le, Quoc V and Sutskever, Ilya and Vinyals, Oriol and Kaiser, Lukasz},
+  author    = {Minh{-}Thang Luong and
-journal={arXiv: Learning},
+               Quoc V. Le and
-year={2015}}
+               Ilya Sutskever and
-@article{Och2001Statistical,
+               Oriol Vinyals and
-  title={Statistical multi-source translation},
+               Lukasz Kaiser},
-  author={Och, Franz Josef and Ney, Hermann},
+  title     = {Multi-task Sequence to Sequence Learning},
-  journal={Mt Summit},
+  booktitle = {4th International Conference on Learning Representations, {ICLR} 2016,
-  year={2001},
+               San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings},
-}
+  year      = {2016},
-@article{DBLP:journals/jmlr/ElskenMH19,
+  //url       = {http://arxiv.org/abs/1511.06114},
-  author    = {Thomas Elsken and
+  //timestamp = {Thu, 25 Jul 2019 14:25:37 +0200},
-               Jan Hendrik Metzen and
+  //biburl    = {https://dblp.org/rec/journals/corr/LuongLSVK15.bib},
-               Frank Hutter},
-  title     = {Neural Architecture Search: {A} Survey},
-  journal   = {J. Mach. Learn. Res.},
-  volume    = {20},
-  pages     = {55:1--55:21},
-  year      = {2019},
-  //url       = {http://jmlr.org/papers/v20/18-598.html},
-  //timestamp = {Wed, 10 Jul 2019 15:28:24 +0200},
-  //biburl    = {https://dblp.org/rec/journals/jmlr/ElskenMH19.bib},
  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
+@article{elsken2019neural,
+title={Neural Architecture Search: A Survey},
+author={Elsken, Thomas and Metzen, Jan Hendrik and Hutter, Frank},
+journal={Journal of Machine Learning Research},
+volume={20},
+number={55},
+pages={1--21},
+year={2019}}
 @inproceedings{DBLP:conf/iclr/ZophL17,
  author    = {Barret Zoph and
               Quoc V. Le},
@@ -6161,11 +6169,25 @@ for Language Modeling},
 publisher = {Association for Computational Linguistics},
 year      = {2020},
 }
-@article{Luo2018Neural,
-  title={Neural Architecture Optimization},
+@inproceedings{DBLP:conf/nips/LuoTQCL18,
-  author={Luo, Renqian and Tian, Fei and Qin, Tao and Liu, Tie-Yan},
+  author    = {Renqian Luo and
-  year={2018},
+               Fei Tian and
+               Tao Qin and
+               Enhong Chen and
+               Tie{-}Yan Liu},
+  title     = {Neural Architecture Optimization},
+  booktitle = {Advances in Neural Information Processing Systems 31: Annual Conference
+               on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December
+               2018, Montr{\'{e}}al, Canada},
+  pages     = {7827--7838},
+  year      = {2018},
+  //url       = {http://papers.nips.cc/paper/8007-neural-architecture-optimization},
+  //timestamp = {Fri, 06 Mar 2020 17:00:31 +0100},
+  //biburl    = {https://dblp.org/rec/conf/nips/LuoTQCL18.bib},
+  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
 @inproceedings{DBLP:conf/aclnmt/KoehnK17,
  author    = {Philipp Koehn and
               Rebecca Knowles},
@@ -6340,11 +6362,24 @@ year      = {2020},
  pages={201-220},
  year={2017},
 }
-@inproceedings{Nepveu2004Adaptive,
-  title={Adaptive Language and Translation Models for Interactive Machine Translation},
+@inproceedings{DBLP:conf/emnlp/NepveuLLF04,
-  author={Nepveu, Laurent and Lapalme, Guy and Langlais, Philippe and Foster, George F.},
+  author    = {Laurent Nepveu and
-  booktitle={Conference on Empirical Methods in Natural Language Processing},
+               Guy Lapalme and
-  year={2004},
+               Philippe Langlais and
+               George F. Foster},
+  title     = {Adaptive Language and Translation Models for Interactive Machine Translation},
+  booktitle = {Proceedings of the 2004 Conference on Empirical Methods in Natural
+               Language Processing , {EMNLP} 2004, {A} meeting of SIGDAT, a Special
+               Interest Group of the ACL, held in conjunction with {ACL} 2004, 25-26
+               July 2004, Barcelona, Spain},
+  pages     = {190--197},
+  publisher = {{ACL}},
+  year      = {2004},
+  //url       = {https://www.aclweb.org/anthology/W04-3225/},
+  //timestamp = {Fri, 13 Sep 2019 13:08:45 +0200},
+  //biburl    = {https://dblp.org/rec/conf/emnlp/NepveuLLF04.bib},
+  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
 @inproceedings{wang-etal-2018-tencent,
@@ -6359,27 +6394,26 @@ year      = {2020},
    year = "2018",
    address = "Belgium, Brussels",
    publisher = "Association for Computational Linguistics",
-    url = "https://www.aclweb.org/anthology/W18-6429",
+   //url = "https://www.aclweb.org/anthology/W18-6429",
-    doi = "10.18653/v1/W18-6429",
+    //doi = "10.18653/v1/W18-6429",
    pages = "522--527",
-    abstract = "We participated in the WMT 2018 shared news translation task on English鈫擟hinese language pair. Our systems are based on attentional sequence-to-sequence models with some form of recursion and self-attention. Some data augmentation methods are also introduced to improve the translation performance. The best translation result is obtained with ensemble and reranking techniques. Our Chinese鈫扙nglish system achieved the highest cased BLEU score among all 16 submitted systems, and our English鈫扖hinese system ranked the third out of 18 submitted systems.",
+  //abstract = "We participated in the WMT 2018 shared news translation task on English鈫擟hinese language pair. Our systems are based on attentional sequence-to-sequence models with some form of recursion and self-attention. Some data augmentation methods are also introduced to improve the translation performance. The best translation result is obtained with ensemble and reranking techniques. Our Chinese鈫扙nglish system achieved the highest cased BLEU score among all 16 submitted systems, and our English鈫扖hinese system ranked the third out of 18 submitted systems.",
 }
-@article{DBLP:journals/corr/LeeCH16,
+@article{DBLP:journals/tacl/LeeCH17,
  author    = {Jason Lee and
               Kyunghyun Cho and
               Thomas Hofmann},
  title     = {Fully Character-Level Neural Machine Translation without Explicit
               Segmentation},
-  journal   = {CoRR},
+  journal   = {Trans. Assoc. Comput. Linguistics},
-  volume    = {abs/1610.03017},
+  volume    = {5},
-  year      = {2016},
+  pages     = {365--378},
-  url       = {http://arxiv.org/abs/1610.03017},
+  year      = {2017},
-  archivePrefix = {arXiv},
+  //url       = {https://transacl.org/ojs/index.php/tacl/article/view/1051},
-  eprint    = {1610.03017},
+  //timestamp = {Thu, 02 Apr 2020 08:34:57 +0200},
-  timestamp = {Mon, 13 Aug 2018 16:47:21 +0200},
+  //biburl    = {https://dblp.org/rec/journals/tacl/LeeCH17.bib},
-  biburl    = {https://dblp.org/rec/journals/corr/LeeCH16.bib},
+  //bibsource = {dblp computer science bibliography, https://dblp.org}
-  bibsource = {dblp computer science bibliography, https://dblp.org}
 }
 @INPROCEEDINGS{6289079, 
@@ -6400,14 +6434,52 @@ year      = {2020},
  number={1},
  pages={145-151},}
-  @article{Mengzhou2019Graph,
+@inproceedings{DBLP:conf/aaai/XiaHLS19,
-  title={Graph Based Translation Memory for Neural Machine Translation},
+  author    = {Mengzhou Xia and
-  author={Mengzhou Xia and Guoping Huang and Lemao Liu and Shuming Shi},
+               Guoping Huang and
-  year={2019},
+               Lemao Liu and
+               Shuming Shi},
+  title     = {Graph Based Translation Memory for Neural Machine Translation},
+  booktitle = {The Thirty-Third {AAAI} Conference on Artificial Intelligence, {AAAI}
+               2019, The Thirty-First Innovative Applications of Artificial Intelligence
+               Conference, {IAAI} 2019, The Ninth {AAAI} Symposium on Educational
+               Advances in Artificial Intelligence, {EAAI} 2019, Honolulu, Hawaii,
+               USA, January 27 - February 1, 2019},
+  pages     = {7297--7304},
+  publisher = {{AAAI} Press},
+  year      = {2019},
+ //url       = {https://doi.org/10.1609/aaai.v33i01.33017297},
+ //doi       = {10.1609/aaai.v33i01.33017297},
+ //timestamp = {Wed, 25 Sep 2019 11:05:09 +0200},
+ //biburl    = {https://dblp.org/rec/conf/aaai/XiaHLS19.bib},
+ //bibsource = {dblp computer science bibliography, https://dblp.org}
+}
+@inproceedings{DBLP:conf/nlpcc/HeHLL19,
+  author    = {Qiuxiang He and
+               Guoping Huang and
+               Lemao Liu and
+               Li Li},
+  title     = {Word Position Aware Translation Memory for Neural Machine Translation},
+  booktitle = {Natural Language Processing and Chinese Computing - 8th {CCF} International
+               Conference, {NLPCC} 2019, Dunhuang, China, October 9-14, 2019, Proceedings,
+               Part {I}},
+  series    = {Lecture Notes in Computer Science},
+  volume    = {11838},
+  pages     = {367--379},
+  publisher = {Springer},
+  year      = {2019},
+  //url       = {https://doi.org/10.1007/978-3-030-32233-5\_29},
+  //doi       = {10.1007/978-3-030-32233-5\_29},
+  //timestamp = {Fri, 04 Oct 2019 08:40:42 +0200},
+  //biburl    = {https://dblp.org/rec/conf/nlpcc/HeHLL19.bib},
+  //bibsource = {dblp computer science bibliography, https://dblp.org}
 }
-@book{Qiuxiang2019Word,
+@INPROCEEDINGS{Och01statisticalmulti-source,
-  title={Word Position Aware Translation Memory for Neural Machine Translation},
+    author = {Franz Josef Och and Hermann Ney},
-  author={Qiuxiang He and Guoping Huang and Lemao Liu and Li Li},
+    title = {Statistical multi-source translation},
-  year={2019},
+    booktitle = {In MT Summit 2001},
+    year = {2001},
+    pages = {253--258}
 }
\ No newline at end of file