合并分支 'caorunzhe' 到 'master'

Caorunzhe 查看合并请求 !429

合并分支 'caorunzhe' 到 'master'
Caorunzhe 查看合并请求 !429
b8d2ec4b · 曹润柘 · 1db380f9 · ba941cd7 · b8d2ec4b · b8d2ec4b
Commit b8d2ec4b authored Nov 17, 2020 by 曹润柘
--- a/Chapter11/chapter11.tex
+++ b/Chapter11/chapter11.tex
@@ -567,7 +567,7 @@

 \parinterval 卷积是一种高效处理网格数据的计算方式，在图像、语音等领域取得了令人瞩目的成绩。本章介绍了卷积的概念及其特性，并对池化、填充等操作进行了详细的讨论。前面介绍的基于循环神经网络的翻译模型在引入注意力机制后已经大幅度超越了基于统计的机器翻译模型，但由于循环神经网络的计算方式导致网络整体的并行能力差，训练耗时。本章介绍了具有高并行计算的能力的模型范式，即基于卷积神经网络的编码器-解码器框架。其在机器翻译任务上取得了与基于循环神经网络的GNMT模型相当的性能，并大幅度缩短了模型的训练周期。除了基础部分，本章还针对卷积计算进行了延伸，包括逐通道卷积、逐点卷积、轻量卷积和动态卷积等。除了上述提及的内容，卷积神经网络及其变种在文本分类、命名实体识别等其他自然语言处理任务上也有许多应用。

-\parinterval 和机器翻译任务不同的是，文本分类任务侧重于对序列特征的提取，然后通过压缩后的特征表示做出类别预测。卷积神经网络可以对序列中一些$n$-gram特征进行提取，也可以用在文本分类任务中，其基本结构包括输入层、卷积层、池化层和全连接层。除了在本章介绍过的TextCNN模型\upcite{Kim2014ConvolutionalNN}，不少研究工作在此基础上对其进行改进。比如，通过改变输入层来引入更多特征\upcite{DBLP:conf/acl/NguyenG15,DBLP:conf/aaai/LaiXLZ15}，对卷积层的改进\upcite{DBLP:conf/acl/ChenXLZ015,DBLP:conf/emnlp/LeiBJ15}以及对池化层的改进\upcite{Kalchbrenner2014ACN,DBLP:conf/acl/ChenXLZ015}。在命名实体识别任务中，同样可以使用卷积神经网络来进行特征提取\upcite{DBLP:journals/jmlr/CollobertWBKKK11,DBLP:conf/cncl/ZhouZXQBX17}，或者使用更高效的空洞卷积对更长的上下文进行建模\upcite{DBLP:conf/emnlp/StrubellVBM17}。此外，也有一些研究工作尝试使用卷积神经网络来提取字符级特征\upcite{DBLP:conf/acl/MaH16,DBLP:conf/emnlp/LiDWCM17,DBLP:conf/acl-codeswitch/WangCK18}。
+\parinterval 和机器翻译任务不同的是，文本分类任务侧重于对序列特征的提取，然后通过压缩后的特征表示做出类别预测。卷积神经网络可以对序列中一些$n$-gram特征进行提取，也可以用在文本分类任务中，其基本结构包括输入层、卷积层、池化层和全连接层。除了在本章介绍过的TextCNN模型\upcite{Kim2014ConvolutionalNN}，不少研究工作在此基础上对其进行改进。比如，通过改变输入层来引入更多特征\upcite{DBLP:conf/acl/NguyenG15,DBLP:conf/aaai/LaiXLZ15}，对卷积层的改进\upcite{DBLP:conf/acl/ChenXLZ015,DBLP:conf/emnlp/LeiBJ15}以及对池化层的改进\upcite{Kalchbrenner2014ACN,DBLP:conf/acl/ChenXLZ015}。在命名实体识别任务中，同样可以使用卷积神经网络来进行特征提取\upcite{2011Natural,DBLP:conf/cncl/ZhouZXQBX17}，或者使用更高效的空洞卷积对更长的上下文进行建模\upcite{DBLP:conf/emnlp/StrubellVBM17}。此外，也有一些研究工作尝试使用卷积神经网络来提取字符级特征\upcite{DBLP:conf/acl/MaH16,DBLP:conf/emnlp/LiDWCM17,DBLP:conf/acl-codeswitch/WangCK18}。




--- a/Chapter16/Figures/figure-three-common-methods-of-adding-noise.tex
+++ b/Chapter16/Figures/figure-three-common-methods-of-adding-noise.tex
@@ -138,7 +138,7 @@

 \node [anchor=south,inner sep=2pt,minimum height=1.5em,minimum width=3.0em] (c10) at (c11.north) {\scriptsize{源语言}};
 \node [anchor=south,inner sep=2pt,minimum height=1.5em,minimum width=3.0em] (c30) at (c31.north) {\small{$n$=3}};
-\node [anchor=south,inner sep=2pt,minimum height=1.5em,minimum width=3.0em] (c50) at (c51.north) {\small{$S$}};
+\node [anchor=south,inner sep=2pt,minimum height=1.5em,minimum width=3.0em] (c50) at (c51.north) {\small{$\mathbi{S}$}};
 \node [anchor=south,inner sep=2pt] (c60) at (c61.north) {\scriptsize{进行排序}};
 \node [anchor=south,inner sep=2pt] (c60-2) at (c60.north) {\scriptsize{由小到大}};


--- a/Chapter16/chapter16.tex
+++ b/Chapter16/chapter16.tex
@@ -46,14 +46,15 @@

 \subsubsection{1. 回译}

-\parinterval {\small\bfnew{回译}}（Back Translation, BT）是目前机器翻译任务上最常用的一种数据增强方法（{\color{red} 参考文献！有很多}）。回译的主要思想是：利用目标语言-源语言模型（反向翻译模型）来生成伪双语句对，用于训练源语言-目标语言翻译模型（前向翻译模型）。假设我们的目标是训练一个英汉翻译模型。首先，使用双语数据训练汉英翻译模型，即反向翻译模型。然后通过该模型将额外的汉语单语句子翻译为英语句子，从而得到大量的生成英语- 真实汉语伪双语句对。然后，将回译得到的伪双语句对和真实双语句对混合，训练得到最终的英汉神经机器翻译模型。
-回译方法是模型无关的，只需要训练一个反向翻译模型，就可以简单有效地利用单语数据来增加训练数据的数量，因此在工业界也得到了广泛采用（{\color{red} 参考文献！可以引用google和fb的论文，是不是多语言或者无监督的方法里有}）。图\ref{fig:16-1-xc} 给出了回译方法的一个简要流程。
+\parinterval {\small\bfnew{回译}}（Back Translation, BT）是目前机器翻译任务上最常用的一种数据增强方法（{\color{red} 参考文献！有很多}）。回译的主要思想是：利用目标语言$-$源语言模型（反向翻译模型）来生成伪双语句对，用于训练源语言$-$目标语言翻译模型（前向翻译模型）。假设我们的目标是训练一个英汉翻译模型。首先，使用双语数据训练汉英翻译模型，即反向翻译模型。然后通过该模型将额外的汉语单语句子翻译为英语句子，从而得到大量的生成英语- 真实汉语伪双语句对。然后，将回译得到的伪双语句对和真实双语句对混合，训练得到最终的英汉神经机器翻译模型。
+
+\parinterval 回译方法是模型无关的，只需要训练一个反向翻译模型，就可以简单有效地利用单语数据来增加训练数据的数量，因此在工业界也得到了广泛采用（{\color{red} 参考文献！可以引用google和fb的论文，是不是多语言或者无监督的方法里有}）。图\ref{fig:16-1-xc} 给出了回译方法的一个简要流程。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter16/Figures/figure-application-process-of-back-translation}
-\caption{回译方法的流程}
+\caption{\red{回译方法的流程}}
 \label{fig:16-1-xc}
 \end{figure}
 %-------------------------------------------
@@ -66,7 +67,7 @@
 \begin{figure}[htp]
 \centering
 \input{./Chapter16/Figures/figure-example-of-iterative-back-translation}
-\caption{迭代式回译方法的流程}
+\caption{\red{迭代式回译方法的流程}}
 \label{fig:16-2-xc}
 \end{figure}
 %-------------------------------------------
@@ -97,7 +98,7 @@
    \vspace{0.5em}
 \end{itemize}

-图\ref{fig:16-4-xc}展示了三种加噪方法的示例。这里，$\funp{P}_{\rm{drop}}$和$\funp{P}_{\rm{mask}}$均设置为0.1，表示每个词有$10\%$的概率被丢弃或屏蔽。打乱顺序的操作略微复杂，一种实现方法是，通过一个数字来表示每个词在句子中的位置，如“我”是第一个词，“你”是第三个词，然后，在每个位置生成一个$1$到$n$的随机数，$n$一般设置为3，然后将每个词的位置数和对应的随机数相加，即图中的$S$（{\color{red} 在图中把数重新算一下，前面我改了}）。 对$S$ 按照从小到大排序，根据排序后每个位置的索引从原始句子中选择对应的词，从而得到最终打乱顺序后的结果。比如，在排序后，$S_1$的值小于$S_0$，其余词则保持递增顺序，则将原始句子中的第零个词和第一个词的顺序进行交换，其他词保持不变。
+图\ref{fig:16-4-xc}展示了三种加噪方法的示例。这里，$\funp{P}_{\rm{drop}}$和$\funp{P}_{\rm{mask}}$均设置为0.1，表示每个词有$10\%$的概率被丢弃或屏蔽。打乱顺序的操作略微复杂，一种实现方法是，通过一个数字来表示每个词在句子中的位置，如“我”是第一个词，“你”是第三个词，然后，在每个位置生成一个$1$到$n$的随机数，$n$一般设置为3，然后将每个词的位置数和对应的随机数相加，即图中的$\mathbi{S}$。 对$\mathbi{S}$ 按照从小到大排序，根据排序后每个位置的索引从原始句子中选择对应的词，从而得到最终打乱顺序后的结果。比如，在排序后，$S_2$的值小于$S_1$，其余词则保持递增顺序，则将原始句子中的第零个词和第一个词的顺序进行交换，其他词保持不变。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -108,7 +109,7 @@
 \end{figure}
 %-------------------------------------------

-\parinterval 和回译方法相似，加噪方法一般仅在源语言句子上进行操作，既保证了目标语言句子的流畅度，又可以提高训练数据量，增加数据的多样性（{\color{red} 参考文献!}）。加噪方法也被用于训练降噪自编码器，在无监督机器翻译中也得到了广泛应用，详细方法可以参考xxx节。
+\parinterval 和回译方法相似，加噪方法一般仅在源语言句子上进行操作，既保证了目标语言句子的流畅度，又可以提高训练数据量，增加数据的多样性（{\color{red} 参考文献!}）。加噪方法也被用于训练降噪自编码器，在无监督机器翻译中也得到了广泛应用，详细方法可以参考\ref{unsupervised-NMT}节。

 \vspace{0.5em}
 \item {\small\sffamily\bfnew{单词替换}}
@@ -155,10 +156,7 @@

 \parinterval 可比语料大多存在于网页中，内容较为复杂，可能会存在较大比例的噪声，如HTML字符、乱码等。首先需要进行充分的数据清洗操作，得到干净的可比语料，然后从中抽取出可用的双语句对。传统的抽取方法一般通过统计模型或双语词典来得到，比如，通过计算两个不同语言句子之间的单词重叠数或BLEU值\upcite{finding2006adafre,method2008keiji},通过排序模型或二分类器判断一个目标语言句子和一个源语言句子互译的可能性\upcite{DBLP:journals/coling/MunteanuM05,DBLP:conf/naacl/SmithQT10} 。

-\parinterval 另外一种比较有效的方法是根据两种语言中每个句子的表示向量来抽取。首先，对于两种语言的每个句子，分别使用词嵌入加权平均等方法计算得到句子的表示向量，然后计算每个源语言句子和目标语言句子之间的余弦相似度，相似度大于一定阈值的句对则认为是可用的双语句对\upcite{DBLP:conf/emnlp/WuZHGQLL19}。然而，不同语言单独训练得到的词嵌入可能多对应不同的表示空间，因此得到的句向量无法用于衡量两个句子的相似度。为了解决这个问题，一般使用在同一表示空间的跨语言词嵌入来表示两种语言的单词。在跨语言词嵌入中，不同语言相同意思的词对应的词嵌入具有较高的相似性，因此得到的句向量也就可以用于衡量两个句子是否表示相似的语义（{\color{red} 参考文献！}）。关于跨语言词嵌入的具体内容，可以参考xxx节({\color{red} 双语词典归纳一节!})。
-
-(扩展阅读)
-\parinterval 除此之外，还有很多工作对数据增强方法进行了深入的研究与探讨。探索源语言单语数据在神经机器翻译中的使用方法\upcite{DBLP:conf/emnlp/ZhangZ16}；选择何种单语数据来生成伪数据带来的收益更大\upcite{DBLP:conf/emnlp/FadaeeM18,DBLP:conf/nlpcc/XuLXLLXZ19}；通过特别标识对真实双语和回译生成的伪双语数据进行区分\upcite{DBLP:conf/wmt/CaswellCG19}；在回译过程中对训练数据进行动态选择与加权\upcite{DBLP:journals/corr/abs200403672}；利用目标端单语数据和相关的富资源语言进行数据增强\upcite{DBLP:conf/acl/XiaKAN19}；通过在源语言或目标语言中随机选择某些词，将这些词替换为词表中随机的一个词，可以得到伪双语数据\upcite{DBLP:conf/emnlp/WangPDN18}；随机选择句子中的某个词，将这个词的词嵌入替换为多个语义相似词的加权表示融合\upcite{DBLP:conf/acl/GaoZWXQCZL19}；基于模型的不确定性来量化预测结果的置信度，从而提升回译方法的性能\upcite{DBLP:conf/emnlp/WangLWLS19}；探索如何利用大规模单语数据\upcite{DBLP:conf/emnlp/WuWXQLL19}；还有一些工作对数据增强进行了理论分析\upcite{DBLP:conf/emnlp/LiLHZZ19}{\color{red}，发现XXXX？}。（{\color{red} 这部分写得不错}）
+\parinterval 另外一种比较有效的方法是根据两种语言中每个句子的表示向量来抽取。首先，对于两种语言的每个句子，分别使用词嵌入加权平均等方法计算得到句子的表示向量，然后计算每个源语言句子和目标语言句子之间的余弦相似度，相似度大于一定阈值的句对则认为是可用的双语句对\upcite{DBLP:conf/emnlp/WuZHGQLL19}。然而，不同语言单独训练得到的词嵌入可能多对应不同的表示空间，因此得到的句向量无法用于衡量两个句子的相似度。为了解决这个问题，一般使用在同一表示空间的跨语言词嵌入来表示两种语言的单词。在跨语言词嵌入中，不同语言相同意思的词对应的词嵌入具有较高的相似性，因此得到的句向量也就可以用于衡量两个句子是否表示相似的语义（{\color{red} 参考文献！}）。关于跨语言词嵌入的具体内容，可以参考\ref{unsupervised-dictionary-induction}节。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -173,14 +171,13 @@

 \subsubsection{1. 语言模型融合（{\color{red} 参考文献较少}）}

-\parinterval 融合目标语言端的语言模型是一种最直接的使用单语数据的方法。实际上，神经机器翻译模型本身也具备了语言模型的作用，因为在解码器本质上也是一个语言模型，用于描述生成译文词串的规律。类似于语言模型，神经机器翻译模型可以自回归地生成翻译结果。对于一个双语句对$(x, y)$，神经机器翻译模型根据源语言句子$x$和前面生成的词来预测当前位置词的概率分布：
+\parinterval 融合目标语言端的语言模型是一种最直接的使用单语数据的方法。实际上，神经机器翻译模型本身也具备了语言模型的作用，因为在解码器本质上也是一个语言模型，用于描述生成译文词串的规律。类似于语言模型，神经机器翻译模型可以自回归地生成翻译结果。对于一个双语句对$(\mathbi{x}, \mathbi{y})$，神经机器翻译模型根据源语言句子$\mathbi{x}$和前面生成的词来预测当前位置词的概率分布：

 \begin{eqnarray}
-\log{P(y | x; \theta)} = \sum_{t}{\log{P(y_t | x, y_{<t}; \theta)}}
+\log{P(\mathbi{y} | \mathbi{x}; \theta)} = \sum_{t}{\log{P(y_t | \mathbi{x}, {\mathbi{y}}_{<t}; \theta)}}
 \label{eq:16-1-xc}
 \end{eqnarray}

-{\color{red} 这个公式和第九章的公式最好一致!!!}

 \noindent 这里，$\theta$是神经机器翻译模型的参数，$y_{<t}$表示第$t$个词前面生成的词。语言模型可以与上述过程融合，具体分为浅融合和深融合两种方法\upcite{2015OnGulcehre}，如图\ref{fig:16-6-xc}所示。

@@ -188,14 +185,14 @@
 \begin{figure}[htp]
    \centering
    \input{./Chapter16/Figures/lm-fusion}
-    \caption{语言模型的浅融合与深融合}
+    \caption{\red{语言模型的浅融合与深融合}}
    \label{fig:16-6-xc}
 \end{figure}
 %-------------------------------------------

 \parinterval 浅融合通过对神经机器翻译模型和语言模型的预测概率进行插值来得到最终的预测概率：
 \begin{eqnarray}
-\log{\funp{P}(y_t | x, y_{<t})} = \log{\funp{P}(y_t | x, y_{<t}; \theta_{TM})} + \beta \log{\funp{P}(y_t | y_{<t}; \theta_{LM})}
+\log{\funp{P}(y_t | \mathbi{x}, \mathbi{y}_{<t})} = \log{\funp{P}(y_t | \mathbi{x}, \mathbi{y}_{<t}; \theta_{TM})} + \beta \log{\funp{P}(y_t | \mathbi{y}_{<t}; \theta_{LM})}
 \label{eq:16-2-xc}
 \end{eqnarray}

@@ -207,7 +204,7 @@

 \parinterval 深融合的预测方式为：
 \begin{eqnarray}
-\log{\funp{P}(y_t | x, y_{<t})}= \log{\funp{P}(y_t | x, y_{<t}; s_{t})}
+\log{\funp{P}(y_t | \mathbi{x}, \mathbi{y}_{<t})}= \log{\funp{P}(y_t | \mathbi{x}, \mathbi{y}_{<t}; s_{t})}
 \label{eq:16-3-xc}
 \end{eqnarray}

@@ -305,7 +302,7 @@ g_{t} = \sigma (w^{T}s_{t}^{TM} + b)
 \begin{figure}[htp]
 \centering
 \input{./Chapter16/Figures/figure-target-side-multi-task-learning}
-\caption{机器翻译中的单任务学习和多任务学习}
+\caption{\red{机器翻译中的单任务学习和多任务学习}}
 \label{fig:16-9-xc}
 \end{figure}
 %-------------------------------------------
@@ -458,11 +455,11 @@ Joint training for neural machine translation models with monolingual data

 \subsection{基于枢轴语的方法}

-\parinterval 传统的多语言翻译中，广泛使用的是{\small\bfnew{基于枢轴语言的翻译}}（Pivot-based Translation）\upcite{DBLP:conf/emnlp/KimPPKN19}。在这种方法中，会使用一种数据丰富语言作为{\small\bfnew{中介语言}}或者{\small\bfnew{枢轴语言}}（Pivot Language），之后让源语言和目标语言向枢轴语言进行翻译。这样，通过资源丰富的中介语言将源语言和目标语言桥接在一起，达到解决源语言-目标语言双语数据缺乏的问题。比如，想要得到泰语到波兰语的翻译，可以通过英语做枢轴语言。通过“泰语$\rightarrow$英语$\rightarrow$波兰语”的翻译过程完成泰语到波兰语的转换。
+\parinterval 传统的多语言翻译中，广泛使用的是{\small\bfnew{基于枢轴语言的翻译}}（Pivot-based Translation）\upcite{DBLP:conf/emnlp/KimPPKN19}。在这种方法中，会使用一种数据丰富语言作为{\small\bfnew{中介语言}}或者{\small\bfnew{枢轴语言}}（Pivot Language），之后让源语言和目标语言向枢轴语言进行翻译。这样，通过资源丰富的中介语言将源语言和目标语言桥接在一起，达到解决源语言-目标语言双语数据缺乏的问题。比如，想要得到泰语到波兰语的翻译，可以通过英语做枢轴语言。通过“泰语$\to$英语$\to$波兰语”的翻译过程完成泰语到波兰语的转换。

 \parinterval 基于枢轴语的方法很早就出现在基于统计机器翻译中。在基于短语的机器翻译中，已经有很多方法建立了源到枢轴和枢轴到目标的短语/单词级别特征，并基于这些特征开发了源语言到目标语言的系统\upcite{DBLP:conf/naacl/UtiyamaI07,DBLP:journals/mt/WuW07,Farsi2010somayeh,DBLP:conf/acl/ZahabiBK13,DBLP:conf/emnlp/ZhuHWZWZ14,DBLP:conf/acl/MiuraNSTN15}，这些系统也已经广泛用于翻译稀缺资源语言对\upcite{DBLP:conf/acl/CohnL07,DBLP:journals/mt/WuW07,DBLP:conf/acl/WuW09}。由于基于枢轴语的方法与模型结构无关，因此该方法也快速适用于神经机器翻译，并且取得了不错的效果\upcite{DBLP:conf/emnlp/KimPPKN19,DBLP:journals/corr/ChengLYSX16}。比如，可以直接使用源到枢轴和枢轴到目标的两个神经机器翻译模型，之后分别用两个模型进行翻译，得到最终的结果\upcite{DBLP:conf/interspeech/KauersVFW02,de2006catalan}。在实现过程中，可以在枢轴语言中保留多个最佳翻译假设，以减少预测偏差\upcite{DBLP:conf/naacl/UtiyamaI07}，并通过多系统融合改进最终翻译\upcite{DBLP:conf/ijcnlp/Costa-JussaHB11}。

-\parinterval 基于枢轴的方法可以被描述为如图\ref{fig:16-1-ll}所示的过程。这里，使用虚线表示具有双语平行语料库的语言对，并使用带有箭头的实线表示翻译方向，令$x$，$y$和$p$分别表示源语言，目标语言和枢轴语言，对于输入源语言句子$x$和目标语言句子$y$，其翻译过程可以被建模为如下公式：
+\parinterval 基于枢轴的方法可以被描述为如图\ref{fig:16-1-ll}所示的过程。这里，使用虚线表示具有双语平行语料库的语言对，并使用带有箭头的实线表示翻译方向，令$\mathbi{x}$，$\mathbi{y}$和$\mathbi{p}$分别表示源语言，目标语言和枢轴语言，对于输入源语言句子$\mathbi{x}$和目标语言句子$\mathbi{y}$，其翻译过程可以被建模为如下公式：

 \begin{figure}[h]
 \centering
@@ -472,13 +469,13 @@ Joint training for neural machine translation models with monolingual data
 \end{figure}

 \begin{equation}
-\funp{P}(y|x) =\sum_{p}{\funp{P}(p|x)\funp{P}(y|p)}
+\funp{P}(\mathbi{y}|\mathbi{x}) =\sum_{\mathbi{p}}{\funp{P}(\mathbi{p}|\mathbi{x})\funp{P}(\mathbi{y}|\mathbi{p})}
 \label{eq:ll-1}
 \end{equation}

-\noindent 其中，$p$表示一个枢轴语言句子， $\funp{P(y|x)}$为从源语句子$x$翻译到目标语句子$y$的概率，$\funp{P}(p|x)$为从源语言句子$x$翻译到枢轴语言语句子$p$的概率，$\funp{P}(y|p)$为从枢轴语言句子$p$到目标语言句子$y$的概率。
+\noindent 其中，$\mathbi{p}$表示一个枢轴语言句子， $\funp{P(\mathbi{y}|\mathbi{x})}$为从源语句子$\mathbi{x}$翻译到目标语句子$\mathbi{y}$的概率，$\funp{P}(\mathbi{p}|\mathbi{x})$为从源语言句子$\mathbi{x}$翻译到枢轴语言语句子$\mathbi{p}$的概率，$\funp{P}(\mathbi{y}|\mathbi{p})$为从枢轴语言句子$\mathbi{p}$到目标语言句子$\mathbi{y}$的概率。

-\parinterval $\funp{P}(p|x)$和$\funp{P}(y|p)$可以直接复用既有的模型和方法。不过，枚举所有的枢轴语言语句子$p$是不可行的。因此一部分研究工作也探讨了如何选择有效的路径，从$x$经过少量$p$到达$y$\upcite{DBLP:conf/naacl/PaulYSN09}。
+\parinterval $\funp{P}(\mathbi{p}|\mathbi{x})$和$\funp{P}(\mathbi{y}|\mathbi{p})$可以直接复用既有的模型和方法。不过，枚举所有的枢轴语言语句子$\mathbi{p}$是不可行的。因此一部分研究工作也探讨了如何选择有效的路径，从$\mathbi{x}$经过少量$\mathbi{p}$到达$\mathbi{y}$\upcite{DBLP:conf/naacl/PaulYSN09}。

 \parinterval 虽然基于枢轴语的方法简单且易于实现，但该方法仍有一些不足。例如，它需要两次翻译过程，因此增加了翻译时间。而且在两次翻译中，翻译错误会进行累积从而产生错误传播问题，导致模型翻译准确性降低。此外，基于枢轴的语言仍然假设源语言和枢轴语言（或者目标语言和枢轴语言）之间存在一定规模的双语平行数据，但是这个假设在很多情况下并不成立。比如，对于一些资源极度稀缺的语言，其到英语或者汉语的双语数据仍然十分缺乏，这时使用基于枢轴的方法的效果往往也并不理想。虽然存在以上问题，但是基于枢轴的方法仍然受到工业界的青睐，很多在线翻译引擎也在大量使用这种方法进行多语言的翻译。

@@ -488,7 +485,7 @@ Joint training for neural machine translation models with monolingual data

 \subsection{基于知识蒸馏的方法}

-\parinterval 为了解决基于使用枢轴语言的问题，研究人员提出基于知识蒸馏的方法\upcite{DBLP:conf/acl/ChenLCL17,DBLP:conf/iclr/TanRHQZL19}。知识蒸馏是一种常用的模型压缩方法\upcite{DBLP:journals/corr/HintonVD15}，基于教师-学生框架，在第十三章已经进行了详细介绍。针对稀缺资源任务，基于教师-学生框架的方法基本思想如图\ref{fig:16-2-ll}所示。其中，虚线表示具有平行语料库的语言对，带有箭头的实线表示翻译方向。这里，将枢轴语言（$p$）到目标语言（$y$）的翻译模型$\funp{P}(y|p)$当作教师模型，源语言（$x$）到目标语言（$y$）的翻译模型$\funp{P}(y|x)$当作学生模型。然后，用教师模型来指导学生模型的训练，这个过程中学习的目标就是让$\funp{P}(y|x)$尽可能地接近$\funp{P}(y|p)$，这样学生模型就可以学习到源语言到目标语言的翻译知识。
+\parinterval 为了解决基于使用枢轴语言的问题，研究人员提出基于知识蒸馏的方法\upcite{DBLP:conf/acl/ChenLCL17,DBLP:conf/iclr/TanRHQZL19}。知识蒸馏是一种常用的模型压缩方法\upcite{DBLP:journals/corr/HintonVD15}，基于教师-学生框架，在第十三章已经进行了详细介绍。针对稀缺资源任务，基于教师-学生框架的方法基本思想如图\ref{fig:16-2-ll}所示。其中，虚线表示具有平行语料库的语言对，带有箭头的实线表示翻译方向。这里，将枢轴语言（$\mathbi{p}$）到目标语言（$\mathbi{y}$）的翻译模型$\funp{P}(\mathbi{y}|\mathbi{p})$当作教师模型，源语言（$\mathbi{x}$）到目标语言（$\mathbi{y}$）的翻译模型$\funp{P}(\mathbi{y}|\mathbi{x})$当作学生模型。然后，用教师模型来指导学生模型的训练，这个过程中学习的目标就是让$\funp{P}(\mathbi{y}|\mathbi{x})$尽可能地接近$\funp{P}(\mathbi{y}|\mathbi{p})$，这样学生模型就可以学习到源语言到目标语言的翻译知识。

 \begin{figure}[h]
 \centering
@@ -497,16 +494,16 @@ Joint training for neural machine translation models with monolingual data
 \label{fig:16-2-ll}
 \end{figure}

-\parinterval 需要注意的是，基于知识蒸馏的方法需要基于翻译对等假设，该假设为：如果源语言句子$x$、枢轴语言句子$p$和目标语言句子$y$这三个句子互译，则从源语言句子$x$生成目标语言句子$y$的概率$\funp{P}(y|x)$应接近与源语言句子$x$对应的$p$的概率$\funp{P}(y|p)$，即：
+\parinterval 需要注意的是，基于知识蒸馏的方法需要基于翻译对等假设，该假设为：如果源语言句子$\mathbi{x}$、枢轴语言句子$\mathbi{p}$和目标语言句子$\mathbi{y}$这三个句子互译，则从源语言句子$\mathbi{x}$生成目标语言句子$\mathbi{y}$的概率$\funp{P}(\mathbi{y}|\mathbi{x})$应接近与源语言句子$\mathbi{x}$对应的$p$的概率$\funp{P}(\mathbi{y}|\mathbi{p})$，即：

 \begin{equation}
-\funp{P}(y|x) \approx \funp{P}(y|p)
+\funp{P}(\mathbi{y}|\mathbi{x}) \approx \funp{P}(\mathbi{y}|\mathbi{p})
 \label{eq:ll-2}
 \end{equation}

-\parinterval 和基于枢轴语言的方法相比，基于教师-学生框架的方法无需训练源语言到枢轴语言的翻译模型，也就无需经历两次翻译过程，翻译效率有所提升，又避免了两次翻译所面临的错误传播问题。举个例子，假如图\ref{fig:16-2-ll}中$x$为源语言德语 “hallo”，$p$为中间语言英语 “hello”，$y$为目标语言法语“bonjour”，则德语“hallo”翻译为法语“bonjour”的概率应该与英语“hello”翻译为法语“bonjour”的概率相近。
+\parinterval 和基于枢轴语言的方法相比，基于教师-学生框架的方法无需训练源语言到枢轴语言的翻译模型，也就无需经历两次翻译过程，翻译效率有所提升，又避免了两次翻译所面临的错误传播问题。举个例子，假如图\ref{fig:16-2-ll}中$\mathbi{x}$为源语言德语 “hallo”，$\mathbi{p}$为中间语言英语 “hello”，$\mathbi{y}$为目标语言法语“bonjour”，则德语“hallo”翻译为法语“bonjour”的概率应该与英语“hello”翻译为法语“bonjour”的概率相近。

-\parinterval 相较于基于枢轴语言的方法，基于知识蒸馏的方法无论在性能还是效率上都具有一定优势。但是，它仍然需要显性的使用枢轴语言进行桥接，因此仍然面临着“源语言$\rightarrow$枢轴语言$\rightarrow$目标语言”转换中信息丢失的问题。比如，当枢轴语言到目标语言翻译效果较差时，由于教师模型无法提供准确的指导，学生模型也无法取得很好的学习效果。
+\parinterval 相较于基于枢轴语言的方法，基于知识蒸馏的方法无论在性能还是效率上都具有一定优势。但是，它仍然需要显性的使用枢轴语言进行桥接，因此仍然面临着“源语言$\to$枢轴语言$\to$目标语言”转换中信息丢失的问题。比如，当枢轴语言到目标语言翻译效果较差时，由于教师模型无法提供准确的指导，学生模型也无法取得很好的学习效果。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -516,7 +513,7 @@ Joint training for neural machine translation models with monolingual data

 \parinterval {\small\bfnew{迁移学习}}（Transfer Learning）是一种机器学习的方法，指的是一个预训练的模型被重新用在另一个任务中，而并不是从头训练一个新的模型\upcite{DBLP:conf/ijcnlp/Costa-JussaHB11,DBLP:journals/corr/HintonVD15}。迁移学习的目标是将某个领域或任务上学习到的知识应用到不同但相关的领域或问题中。在机器翻译中，可以用资源丰富的语言对中知识来改进稀缺资源语言对上的神经机器翻译性能，即将富资源语言对中知识迁移到稀缺资源知识中。

-\parinterval 基于枢轴语言的方法需要显性的建立“源语言$\rightarrow$枢轴语言$\rightarrow$目标语言”的路径。这时，如果路径中某处出现了问题，就会成为整个路径的瓶颈。如果使用多个枢轴语言，这个问题会更加严重。不同于基于枢轴语言的方法，迁移学习无需进行两步解码，也就避免了翻译路径中累积错误的问题。
+\parinterval 基于枢轴语言的方法需要显性的建立“源语言$\to$枢轴语言$\to$目标语言”的路径。这时，如果路径中某处出现了问题，就会成为整个路径的瓶颈。如果使用多个枢轴语言，这个问题会更加严重。不同于基于枢轴语言的方法，迁移学习无需进行两步解码，也就避免了翻译路径中累积错误的问题。

 \parinterval 基于迁移学习的方法思想非常简单，如图\ref{fig:16-3-ll}所示。这种方法无需像传统的机器学习一样为每个任务单独训练一个模型，它将所有任务分类为源任务和目标任务，目标就是将源任务中的知识迁移到目标任务当中

@@ -554,9 +551,9 @@ Joint training for neural machine translation models with monolingual data

 \parinterval 多语言单模型方法也可以被看做是一种迁移学习。多语言单模型方法能够有效地改善低资源神经机器翻译性能\upcite{DBLP:journals/tacl/JohnsonSLKWCTVW17,DBLP:conf/lrec/RiktersPK18,dabre2019brief}，尤其适用于翻译方向较多的情况，因为为每一个翻译方向单独训练一个模型是不实际的，不仅由于设备资源和时间上的限制，还由于很多翻译方向都没有双语平行数据。比如，要翻译100个语言之间的互译的系统，理论上就需要训练$100 \times 99$个翻译模型，代价是十分巨大的。这时就需要用到{\small\bfnew{多语言单模型方法}}\index{多语言单模型方法}（Multi-lingual Single Model-based Method\index{Multi-lingual Single Model-based Method}）。

-\parinterval 多语言单模型系统即用单个模型训练具有多个语言翻译方向的系统。对于源语言集合$G_x$和目标语言集合$G_y$，多语言单模型的学习目标是学习一个单一的模型，这个模型可以进行任意源语言到任意目标语言的翻译，即同时支持所有$(x,y) \in (G_x,G_y)$的翻译。多语言单模型方法又可以进一步分为一对多\upcite{DBLP:conf/acl/DongWHYW15}、多对一\upcite{DBLP:journals/tacl/LeeCH17}和多对多\upcite{DBLP:conf/naacl/FiratCB16}的方法。不过这些方法本质上是相同的，因此这里以多对多翻译为例进行介绍。
+\parinterval 多语言单模型系统即用单个模型训练具有多个语言翻译方向的系统。对于源语言集合$\seq{G}_x$和目标语言集合$\seq{G}_y$，多语言单模型的学习目标是学习一个单一的模型，这个模型可以进行任意源语言到任意目标语言的翻译，即同时支持所有$(x,y) \in (\seq{G}_x,\seq{G}_y)$的翻译。多语言单模型方法又可以进一步分为一对多\upcite{DBLP:conf/acl/DongWHYW15}、多对一\upcite{DBLP:journals/tacl/LeeCH17}和多对多\upcite{DBLP:conf/naacl/FiratCB16}的方法。不过这些方法本质上是相同的，因此这里以多对多翻译为例进行介绍。

-\parinterval 在模型结构方面，多语言模型与普通的神经机器翻译模型相同，都是标准的编码-解码结构。多语言单模型方法的一个假设是：不同语言可以共享同一个表示空间。因此，该方法使用同一个编码器处理所有的源语言句子，使用同一个解码器处理所有的目标语言句子。为了使多个语言共享同一个解码器（或编码器），一种简单的方法是直接在输入句子上加入语言标记，让模型显性地知道当前句子属于哪个语言。如图\ref{fig:16-5-ll}所示，在此示例中，标记“ <spanish>”指示目标句子为西班牙语，标记“ <german>”指示目标句子为德语，则模型在进行翻译时便会将句子开头加<spanish>标签的句子翻译为西班牙语\upcite{DBLP:journals/tacl/JohnsonSLKWCTVW17}。假设训练时有英语到西班牙语 “<spanish> Hello”$\rightarrow$“Hola”和法语到德语“<german> Bonjour”$\rightarrow$“Hallo” 的双语句对，则在解码时候输入英语“<german> Hello”时就会得到解码结果“Hallo”。
+\parinterval 在模型结构方面，多语言模型与普通的神经机器翻译模型相同，都是标准的编码-解码结构。多语言单模型方法的一个假设是：不同语言可以共享同一个表示空间。因此，该方法使用同一个编码器处理所有的源语言句子，使用同一个解码器处理所有的目标语言句子。为了使多个语言共享同一个解码器（或编码器），一种简单的方法是直接在输入句子上加入语言标记，让模型显性地知道当前句子属于哪个语言。如图\ref{fig:16-5-ll}所示，在此示例中，标记“ <spanish>”指示目标句子为西班牙语，标记“ <german>”指示目标句子为德语，则模型在进行翻译时便会将句子开头加<spanish>标签的句子翻译为西班牙语\upcite{DBLP:journals/tacl/JohnsonSLKWCTVW17}。假设训练时有英语到西班牙语 “<spanish> Hello”$\to$“Hola”和法语到德语“<german> Bonjour”$\to$“Hallo” 的双语句对，则在解码时候输入英语“<german> Hello”时就会得到解码结果“Hallo”。

 \begin{figure}[h]
 \centering
@@ -601,7 +598,7 @@ Joint training for neural machine translation models with monolingual data
 %    NEW SUB-SECTION
 %----------------------------------------------------------------------------------------

-\subsection{无监督词典归纳}
+\subsection{无监督词典归纳}\label{unsupervised-dictionary-induction}

 \parinterval {\small\bfnew{词典归纳}}\index{词典归纳}（Bilingual Dictionary Induction，BDI\index{Bilingual Dictionary Induction}），也叫{\small\bfnew{词典推断}}，是实现语种间单词级别翻译的任务。在统计机器翻译中，词典归纳是一项核心的任务，它从双语平行语料中发掘互为翻译的单词，是翻译知识的主要来源\upcite{黄书剑0统计机器翻译中的词对齐研究}。在端到端神经机器翻译中，词典归纳通常作为一个下游任务被用到无监督机器翻译、多语言机器翻译等任务中。在神经机器翻译中，单词通过连续化的向量来表示，即词嵌入。所有单词分布在一个高维的空间中，基于人们对词嵌入空间的观察发现：连续的单词嵌入空间在各种语言中显示出类似的结构，这使得直接利用词嵌入来构建双语词典成为可能（{\color{red} 参考文献！}）。其基本想法是先将来自不同语言的词嵌入投影到共享嵌入空间中，然后在此共享空间中归纳出双语词典（{\color{red} 最好有一个图！}）。研究人员们进行了众多的尝试，较早的尝试是使用一个包含数千词对的种子词典作为锚点来学习从源语到目标语词嵌入空间的线性映射，将两个语言的词汇投影到共享的嵌入空间之后，执行一些对齐算法即可得到双语词典\upcite{DBLP:journals/corr/MikolovLS13}。最近的研究表明，词典归纳可以在更弱的监督信号下完成，这些监督信号来自数百对小词典\upcite{DBLP:conf/acl/VulicK16}、 相同的字符串\upcite{DBLP:conf/iclr/SmithTHH17}，甚至仅仅是共享的数字\upcite{DBLP:conf/acl/ArtetxeLA17}。

@@ -627,9 +624,9 @@ Joint training for neural machine translation models with monolingual data

 \begin{itemize}
 \vspace{0.5em}
-\item 对于图XX(a)中的分布在不同空间中的两个单语词嵌入X和Y，基于两者近似同构的假设，利用无监督匹配的方法来得到一个粗糙的线性映射W，结果如图XX(b)所示。
+\item 对于图XX(a)中的分布在不同空间中的两个单语词嵌入$\mathbi{X}$和$\mathbi{Y}$，基于两者近似同构的假设，利用无监督匹配的方法来得到一个粗糙的线性映射$\mathbi{W}$，结果如图XX(b)所示。
 \vspace{0.5em}
-\item 利用映射W可以执行对齐算法从而归纳出一个种子词典，如图XX(c)所示。
+\item 利用映射$\mathbi{W}$可以执行对齐算法从而归纳出一个种子词典，如图XX(c)所示。
 \vspace{0.5em}
 \item 利用种子词典不断迭代微调进一步提高映射性能，最终映射的效果如图XX(d)所示，之后即可从中推断出词典作为最后的结果。
 \vspace{0.5em}
@@ -646,23 +643,23 @@ Joint training for neural machine translation models with monolingual data

 \begin{itemize}
 \vspace{0.5em}
-\item 基于GAN的方法\upcite{DBLP:conf/iclr/LampleCRDJ18,DBLP:conf/acl/ZhangLLS17,DBLP:conf/emnlp/XuYOW18,DBLP:conf/naacl/MohiuddinJ19}。在这个任务中，通过生成器来产生映射W，鉴别器负责区分随机抽样的元素WX 和Y，两者共同优化收敛后即可得到映射W。
+\item 基于GAN的方法\upcite{DBLP:conf/iclr/LampleCRDJ18,DBLP:conf/acl/ZhangLLS17,DBLP:conf/emnlp/XuYOW18,DBLP:conf/naacl/MohiuddinJ19}。在这个任务中，通过生成器来产生映射$\mathbi{W}$，鉴别器负责区分随机抽样的元素$\mathbi{W}\cdot \mathbi{X}$ 和$\mathbi{Y}$，两者共同优化收敛后即可得到映射$\mathbi{W}$。
 \vspace{0.5em}
-\item 基于Gromov-Wasserstein 的方法\upcite{DBLP:conf/emnlp/Alvarez-MelisJ18,DBLP:conf/lrec/GarneauGBDL20,DBLP:journals/corr/abs-1811-01124,DBLP:conf/emnlp/XuYOW18}。Wasserstein距离是度量空间中定义两个概率分布之间距离的函数。在这个任务中，它用来衡量不同语言中单词对之间的相似性，利用空间近似同构的信息可以定义出一些目标函数，之后通过优化该目标函数也可以得到映射W。
+\item 基于Gromov-Wasserstein 的方法\upcite{DBLP:conf/emnlp/Alvarez-MelisJ18,DBLP:conf/lrec/GarneauGBDL20,DBLP:journals/corr/abs-1811-01124,DBLP:conf/emnlp/XuYOW18}。Wasserstein距离是度量空间中定义两个概率分布之间距离的函数。在这个任务中，它用来衡量不同语言中单词对之间的相似性，利用空间近似同构的信息可以定义出一些目标函数，之后通过优化该目标函数也可以得到映射$\mathbi{W}$。
 \vspace{0.5em}
 \end{itemize}

-\parinterval 在得到映射W之后，对于X中的任意一个单词x，通过 Wx将其映射到空间Y中，然后在Y中找到该点的最近邻点y，于是y就是x的翻译词，重复该过程即可归纳出种子词典D，第一阶段结束。事实上，由于第一阶段缺乏监督信号，得到的种子词典D会包含大量的噪音，性能并不高，因此需要进行进一步的微调。
+\parinterval 在得到映射$\mathbi{W}$之后，对于$\mathbi{X}$中的任意一个单词$x$，通过$W_x$将其映射到空间$\mathbi{Y}$中，然后在$\mathbi{Y}$中找到该点的最近邻点$y$，于是$y$就是$x$的翻译词，重复该过程即可归纳出种子词典D，第一阶段结束。事实上，由于第一阶段缺乏监督信号，得到的种子词典D会包含大量的噪音，性能并不高，因此需要进行进一步的微调。

-\parinterval 微调的原理普遍基于普氏分析\upcite{DBLP:journals/corr/MikolovLS13}。假设现在有一个种子词典$D=\left\{x_{i}, y_{i}\right\}$其中${i \in\{1, n\}}$，和两个单语词嵌入X和Y，那么就可以将D 作为{\small\bfnew{映射锚点}}\index{映射锚点}（Anchor\index{Anchor}）学习一个转移矩阵 W，使得 WX与 Y这两个空间尽可能相近，此外通过对W施加正交约束可以显著提高能\upcite{DBLP:conf/naacl/XingWLL15}，于是这个优化问题就转变成了{\small\bfnew{普鲁克问题}}\index{普鲁克问题}（Procrustes Problem\index{Procrustes Problem}）\upcite{DBLP:conf/iclr/SmithTHH17}，可以通过{\small\bfnew{奇异值分解}}\index{奇异值分解}（Singular Value Decomposition，SVD\index{Singular Value Decomposition，SVD}）来获得近似解：
+\parinterval 微调的原理普遍基于普氏分析\upcite{DBLP:journals/corr/MikolovLS13}。假设现在有一个种子词典$D=\left\{x_{i}, y_{i}\right\}$其中${i \in\{1, n\}}$，和两个单语词嵌入$\mathbi{X}$和$\mathbi{Y}$，那么就可以将D作为{\small\bfnew{映射锚点}}\index{映射锚点}（Anchor\index{Anchor}）学习一个转移矩阵$\mathbi{W}$，使得$\mathbi{W}\cdot \mathbi{X}$与$\mathbi{Y}$这两个空间尽可能相近，此外通过对$\mathbi{W}$施加正交约束可以显著提高能\upcite{DBLP:conf/naacl/XingWLL15}，于是这个优化问题就转变成了{\small\bfnew{普鲁克问题}}\index{普鲁克问题}（Procrustes Problem\index{Procrustes Problem}）\upcite{DBLP:conf/iclr/SmithTHH17}，可以通过{\small\bfnew{奇异值分解}}\index{奇异值分解}（Singular Value Decomposition，SVD\index{Singular Value Decomposition，SVD}）来获得近似解：

 \begin{eqnarray}
-W^{\star} & = &\underset{W \in O_{d}(\mathbb{R})}{\operatorname{argmin}}\|W X-Y\|_{\mathrm{F}}=U V^{T} \\
-\textrm{s.t.\ \ \ \ } U \Sigma V^{T} &= &\operatorname{SVD}\left(Y X^{T}\right)
+\mathbi{W}^{\star} & = &\underset{\mathbi{W} \in O_{d}(\mathbb{R})}{\operatorname{argmin}}\|\mathbi{W}\cdot \mathbi{X}- \mathbi{Y} \|_{\mathrm{F}}=\mathbi{U}\cdot \mathbi{V}^{T} \\
+\textrm{s.t.\ \ \ \ } \mathbi{U} \Sigma \mathbi{V}^{T} &= &\operatorname{SVD}\left(\mathbi{Y}\cdot \mathbi{X}^{T}\right)
 \label{eq:16-1}
 \end{eqnarray}

-\noindent 其中，{\color{red} $\operatorname{SVD}(\cdot)$表示XXX}，Y和X行对齐。利用上式可以获得新的W，通过W可以归纳出新的D，如此迭代进行微调最后即可以得到收敛的D。
+\noindent 其中，{\color{red} $\operatorname{SVD}(\cdot)$表示XXX}，$\mathbi{Y}$和$\mathbi{X}$行对齐。利用上式可以获得新的$\mathbi{W}$，通过$\mathbi{W}$可以归纳出新的D，如此迭代进行微调最后即可以得到收敛的D。

 \parinterval 目前，无监督词典归纳工作主要集中在两个方向，一个方向是通过用新的建模方法或改进上述两阶段方法来提升无监督词典归纳的性能。{\color{red} 稍微扩展一下说，把下面的参考文献使用上可以}。

@@ -715,18 +712,18 @@ W^{\star} & = &\underset{W \in O_{d}(\mathbb{R})}{\operatorname{argmin}}\|W X-Y\

 \parinterval 回顾统计机器翻译中的短语表，其实它类似于一个词典，对一个源语言短语给出相应的短语翻译（{\color{red} 引用PBMT的论文，NAACL2003！}）。只不过词典的基本单元是词，而短语表的基本单元是短语（或$n$-gram）。此外短语表还提供短语翻译的得分。既然短语表跟词典如此相似，那么很容易就可以把无监督词典归纳的方法移植到处理短语上，也就是把里面的词替换成短语，就可以无监督地得到短语表。

-\parinterval 如XXX节所示，无监督词典归纳的方法依赖于词的分布式表达，也就是词嵌入。因此当把无监督词典归纳拓展到短语上时，首先需要获得短语的分布式表达。比较简单的方法是把词换成短语，然后借助无监督词典归纳相同的算法得到短语的分布式表达。最后直接应用无监督词典归纳方法，得到源语言短语与目标语言短语之间的对应。
+\parinterval 如\ref{unsupervised-dictionary-induction}节所示，无监督词典归纳的方法依赖于词的分布式表达，也就是词嵌入。因此当把无监督词典归纳拓展到短语上时，首先需要获得短语的分布式表达。比较简单的方法是把词换成短语，然后借助无监督词典归纳相同的算法得到短语的分布式表达。最后直接应用无监督词典归纳方法，得到源语言短语与目标语言短语之间的对应。

 \parinterval 尽管已经得到了短语的翻译，短语表的另外一个重要的组成部分，也就是短语对的得分（概率）无法直接由词典归纳方法直接给出，而这些得分在统计机器翻译模型中非常重要。在无监督词典归纳中，在推断词典的时候会为一对源语言单词和目标语言单词打分（词嵌入之间的相似度），然后根据打分来决定哪一个目标语言单词更有可能是当前源语言单词的翻译。在无监督短语归纳中，这样一个打分已经提供了对短语对质量的度量，因此经过适当的归一化处理后就可以得到短语对的得分：

 \begin{eqnarray}
-P(t|s)=\frac{\mathrm{cos}(s,t)/\tau}{\sum_{t'}\mathrm{cos}(s,t')\tau}
+P(t|s)=\frac{\mathrm{cos}(\mathbi{x},\mathbi{y})/\tau}{\sum_{\mathbi{y}^{'}}\mathrm{cos}(\mathbi{x},\mathbi{y}^{'})\tau}
 \label{eq:16-2}
 \end{eqnarray}

-\noindent 其中，$\mathrm{cos}$是余弦相似度，$s$是经过无监督词典归纳里$W$转换的源语言短语嵌入，$t$是目标语言短语嵌入，$t'$是所有可能的目标语短语嵌入，$\tau$控制产生的分布$P$的尖锐程度的一个超参数。
+\noindent 其中，$\mathrm{cos}$是余弦相似度，$\mathbi{x}$是经过无监督词典归纳里$\mathbi{W}$转换的源语言短语嵌入，$\mathbi{y}$是目标语言短语嵌入，$\mathbi{y}^{'}$是所有可能的目标语短语嵌入，$\tau$控制产生的分布$P$的尖锐程度的一个超参数。

-\parinterval 一个问题是在无监督的情景下我们没有任何双语数据，那么如何得到最优的$\tau$？这里，可以寻找一个$\tau$使得所有$P(t|s)$ 最大（{\color{red} 参考文献！}）。通常，取离一个给定的$t$最接近的$s$ 而不是给定$s$ 选取最近的$t$来计算$P(t|s)$，因为给定$s$得到的最近$t$总是$P(t|s)$里概率最大的元素，这时候$\tau$总是可以通过逼近0来使得所有$P$的取值都接近1。实际中为了选取最优$\tau$我们会为$P(t|s)$ 和$P(s|t)$ 同时优化$\tau$。
+\parinterval 一个问题是在无监督的情景下我们没有任何双语数据，那么如何得到最优的$\tau$？这里，可以寻找一个$\tau$使得所有$P(\mathbi{y}|\mathbi{x})$ 最大（{\color{red} 参考文献！}）。通常，取离一个给定的$\mathbi{y}$最接近的$\mathbi{x}$ 而不是给定$\mathbi{x}$ 选取最近的$\mathbi{y}$来计算$P(\mathbi{y}|\mathbi{x})$，因为给定$\mathbi{x}$得到的最近$\mathbi{y}$总是$P(\mathbi{y}|\mathbi{x})$里概率最大的元素，这时候$\tau$总是可以通过逼近0来使得所有$P$的取值都接近1。实际中为了选取最优$\tau$我们会为$P(\mathbi{y}|\mathbi{x})$ 和$P(\mathbi{x}|\mathbi{y})$ 同时优化$\tau$。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SUB-SECTION
@@ -738,21 +735,20 @@ P(t|s)=\frac{\mathrm{cos}(s,t)/\tau}{\sum_{t'}\mathrm{cos}(s,t')\tau}

 \parinterval 经过上述的无监督模型调优后，就获得了一个比未经调优效果更好的翻译模型。这时候一个自然的想法就是可以使用这个更好更强的翻译模型去产生质量更高的数据，然后用这些数据来继续对翻译模型进行调优，如此反复迭代一定次数后停止。这个方法也被称为{\small\bfnew{迭代优化}}\index{迭代优化}（Iterative Refinement\index{Iterative Refinement}）（{\color{red} 参考文献！}）。

-\parinterval 迭代优化也会带来另外一个问题：在每一次迭代中都会产生新的模型，应该什么时候停止生成新模型，挑选哪一个模型？因为在无监督的场景当中，没有任何真实的双语数据可以使用，所以无法使用监督学习里的校验集来对每个模型进行检验并筛选。另外，即使有很少量的双语数据（比如数百条双语句对），直接在上面挑选模型和调整超参数会导致过拟合问题，使得最后结果越来越差。一个经验上非常高效的模型选择方法是：可以（{\color{red} 从？？？里}）挑选一些句子，然后使用当前的模型把这些句子翻译过去之后再翻译回来（源语言$\to$目标语言$\to$源语言，或者目标语言$\to$源语言$\to$目标语言），得到的结果跟原始的结果计算BLEU，得分越高则效果越好。这样一种无监督模型挑选标准经验上被证明是跟使用大的双语校验集的结果高度相关的\upcite{DBLP:conf/emnlp/LampleOCDR18}。
+\parinterval 迭代优化也会带来另外一个问题：在每一次迭代中都会产生新的模型，应该什么时候停止生成新模型，挑选哪一个模型？因为在无监督的场景当中，没有任何真实的双语数据可以使用，所以无法使用监督学习里的校验集来对每个模型进行检验并筛选。另外，即使有很少量的双语数据（比如数百条双语句对），直接在上面挑选模型和调整超参数会导致过拟合问题，使得最后结果越来越差。一个经验上非常高效的模型选择方法是：可以（{\color{red} 从？？？里}）挑选一些句子，然后使用当前的模型把这些句子翻译过去之后再翻译回来（源语言$\to $目标语言$\to$源语言，或者目标语言$\to$源语言$\to$目标语言），得到的结果跟原始的结果计算BLEU，得分越高则效果越好。这样一种无监督模型挑选标准经验上被证明是跟使用大的双语校验集的结果高度相关的\upcite{DBLP:conf/emnlp/LampleOCDR18}。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
 %----------------------------------------------------------------------------------------

-\subsection{无监督神经机器翻译（{\color{red} 参考文献较少！或者加一个拓展阅读！}）}
+\subsection{无监督神经机器翻译（{\color{red} 参考文献较少！或者加一个拓展阅读！}）}\label{unsupervised-NMT}

 \parinterval 既然神经机器翻译已经在很多任务上优于统计机器翻译，为什么不直接做无监督神经机器翻译呢？实际上，由于神经网络的黑盒特性使得我们无法像统计机器翻译那样对其进行拆解，并定位问题。因此需要借用其它无监督翻译系统来训练神经机器翻译模型。
-
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SUB-SECTION
 %----------------------------------------------------------------------------------------

-\subsubsection{基于无监督统计机器翻译的方法}
+\subsubsection{1. 基于无监督统计机器翻译的方法}

 一个简单的方法是，借助已经成功的无监督方法来为神经机器翻译模型提供少量双语监督信号，然后在这个基础上训练模型。由于初始的监督信号可能很少或者包含大量噪声，因此需要逐步优化数据来重新训练出更好的模型。这个方案最简单直接的实现就是借助已经成功的无监督统计机器翻译模型产生伪双语数据来训练神经机器翻译模型 ，然后模型进行迭代回译来进行数据优化，如图\ref{fig:16-1} 所示\upcite{DBLP:conf/acl/ArtetxeLA19}。这个方法的优点是直观，并且性能稳定，容易调试（所有模块都互相独立）。缺点是复杂繁琐，涉及许多超参数调整工作，而且训练代价较大（{\color{red} 再来一些参考文献？}）。

@@ -767,7 +763,7 @@ P(t|s)=\frac{\mathrm{cos}(s,t)/\tau}{\sum_{t'}\mathrm{cos}(s,t')\tau}
 %    NEW SUB-SUB-SECTION
 %----------------------------------------------------------------------------------------

-\subsubsection{基于无监督词典归纳的方法}
+\subsubsection{2. 基于无监督词典归纳的方法}

 \parinterval 既然无监督神经机器翻译问题的核心在于通过无监督方法提供初始的监督信号，另一个思路就是直接从无监督词典归纳中得到神经机器翻译模型，从而避免繁琐的无监督统计机器翻译模型训练过程，同时也避免神经机器翻译模型继承统计机器翻译模型的错误，如图\ref{fig:16-2}所示。这种方法的核心就是把翻译看成一个两阶段的过程：

@@ -792,7 +788,7 @@ P(t|s)=\frac{\mathrm{cos}(s,t)/\tau}{\sum_{t'}\mathrm{cos}(s,t')\tau}
 %    NEW SUB-SUB-SECTION
 %----------------------------------------------------------------------------------------

-\subsubsection{更深层的融合}
+\subsubsection{3. 更深层的融合}

 \parinterval 为了获得更好的神经机器翻译模型，可以对训练流程和模型做更深度的整合。{\chapternine}已经介绍，神经机器翻译模型的训练包含两个阶段：初始化和优化，而无监督神经机器翻译的核心思路也是对应的两个阶段：无监督方法提供初始的监督信号和数据优化，因此可以考虑通过在模型的初始化阶段使用无监督方法提供初始的监督信号，然后优化过程不但优化模型的参数，还优化训练使用的数据，从而避免流水线带来的错误传播，如图\ref{fig:16-3}所示\upcite{DBLP:conf/nips/ConneauL19}。

@@ -832,21 +828,16 @@ P(t|s)=\frac{\mathrm{cos}(s,t)/\tau}{\sum_{t'}\mathrm{cos}(s,t')\tau}
 \label{fig:16-4}
 \end{figure}

+\begin{itemize}
 \vspace{0.5em}
-\noindent {\small\bfnew{(1) 模型参数初始化}}
-\vspace{0.5em}
-
-\parinterval 无监督神经机器翻译的关键在于如何提供最开始的监督信号，从而启动后续的迭代流程。无监督词典归纳已经可以提供一些可靠的监督信号，那么如何在模型初始化中融入这些信息？既然神经机器翻译模型都使用词嵌入层作为输入，而无监督词典归纳总是首先把两个语言各自的单语词嵌入映射到一个空间后才归纳双语词典，那么可以使用这些映射后的词嵌入来初始化模型的词嵌入层，然后在这个基础上训练模型，因为这些映射后的词嵌入天然就包含了大量的监督信号，比如，两个语言里意思相近的词对应的词嵌入会比其他词更靠近对方\upcite{DBLP:journals/ipm/FarhanTAJATT20}。 为了防止训练过程中模型参数的更新会破坏词嵌入当中的词对齐信息，通常初始化后会固定模型的词嵌入层不让其更新。
+\item {\small\bfnew{模型参数初始化}}。无监督神经机器翻译的关键在于如何提供最开始的监督信号，从而启动后续的迭代流程。无监督词典归纳已经可以提供一些可靠的监督信号，那么如何在模型初始化中融入这些信息？既然神经机器翻译模型都使用词嵌入层作为输入，而无监督词典归纳总是首先把两个语言各自的单语词嵌入映射到一个空间后才归纳双语词典，那么可以使用这些映射后的词嵌入来初始化模型的词嵌入层，然后在这个基础上训练模型，因为这些映射后的词嵌入天然就包含了大量的监督信号，比如，两个语言里意思相近的词对应的词嵌入会比其他词更靠近对方\upcite{DBLP:journals/ipm/FarhanTAJATT20}。 为了防止训练过程中模型参数的更新会破坏词嵌入当中的词对齐信息，通常初始化后会固定模型的词嵌入层不让其更新。

 \parinterval 进一步的研究表明，无监督神经机器翻译能在提供更少监督信号的情况下启动，也就是可以去除无监督词典归纳这一步骤（{\color{red} 参考文献！}）。这时候模型的初始化直接使用共享词表的预训练模型的参数作为起始点。这个预训练模型直接使用前面提到的预训练方法（如MASS）进行训练，区别在于模型的大小如宽度和深度需要严格匹配翻译模型。此外，这个模型不仅仅只在一个语言的单语数据上进行训练，而是同时在两个语言的单语数据上进行训练，并且两个语言的词表进行共享。前面提到，在共享词表特别是共享子词词表的情况下，已经隐式的告诉模型源语言和目标语言里一样的（子）词互为翻译，相当于模型使用了少量的监督信号。在这基础上使用两个语言的单语数据进行预训练，则通过模型共享进一步挖掘了语言之间共通的部分。因此，使用预训练模型进行初始化后，无监督神经机器翻译模型已经得到大量的监督信号，从而得以不断通过优化来提升模型性能。

 \vspace{0.5em}
-\noindent {\small\bfnew{(2) 语言模型的使用}}
-\vspace{0.5em}
+\item {\small\bfnew{语言模型的使用}}。无监督神经机器翻译的一个重要部分就是来自语言模型的目标函数。因为翻译模型本质上是在完成文本生成任务，所以只有文本生成类型的语言模型建模方法才可以应用到无监督神经机器翻译里。比如，经典的给定前文预测下一词就是一个典型的自回归生成任务（见{\chaptertwo}），因此可以运用到无监督神经机器翻译里。但是，目前在预训练里流行的BERT等模型是自编码模型（{\color{red} 参考文献！}），就不能直接在无监督神经翻译里使用。

-\parinterval 无监督神经机器翻译的一个重要部分就是来自语言模型的目标函数。因为翻译模型本质上是在完成文本生成任务，所以只有文本生成类型的语言模型建模方法才可以应用到无监督神经机器翻译里。比如，经典的给定前文预测下一词就是一个典型的自回归生成任务（见{\chaptertwo}），因此可以运用到无监督神经机器翻译里。但是，目前在预训练里流行的BERT等模型是自编码模型（{\color{red} 参考文献！}），就不能直接在无监督神经翻译里使用。
-
-\parinterval 另外一个在无监督神经机器翻译中比较常见的语言模型目标函数则是{\small\bfnew{降噪自编码器}}\index{降噪自编码器}（Denoising Autoencoder\index{降噪自编码器}）。它也是文本生成类型的语言模型建模方法。对于一个句子$x$，首先使用一个噪声函数$x'=\mathrm{noise}(x)$ 来对$x$注入噪声，产生一个质量较差的句子$x'$。然后，让模型学习如何从$x'$还原出$x$。这样一个目标函数比预测下一词更贴近翻译任务的本质，因为它是一个序列到序列的映射，并且输入输出两个序列在语义上是等价的。通常来说，噪声函数$\mathrm{noise}$有三种形式，如表\ref{tab:16-1}所示。
+\parinterval 另外一个在无监督神经机器翻译中比较常见的语言模型目标函数则是{\small\bfnew{降噪自编码器}}\index{降噪自编码器}（Denoising Autoencoder\index{降噪自编码器}）。它也是文本生成类型的语言模型建模方法。对于一个句子$\mathbi{x}$，首先使用一个噪声函数$\mathbi{x}^{'}=\mathrm{noise}(\mathbi{x})$ 来对$x$注入噪声，产生一个质量较差的句子$\mathbi{x}^{'}$。然后，让模型学习如何从$\mathbi{x}^{'}$还原出$\mathbi{x}$。这样一个目标函数比预测下一词更贴近翻译任务的本质，因为它是一个序列到序列的映射，并且输入输出两个序列在语义上是等价的。通常来说，噪声函数$\mathrm{noise}$有三种形式，如表\ref{tab:16-1}所示。

 \begin{table}[h]
 \centering
@@ -864,6 +855,19 @@ P(t|s)=\frac{\mathrm{cos}(s,t)/\tau}{\sum_{t'}\mathrm{cos}(s,t')\tau}
 \end{table}

 \parinterval 实际当中三种形式的噪声函数都会被使用到，其中在交换方法中越相近的词越容易被交换，并且保证被交换的词的对数有限，而删除和空白方法里词的删除和替换概率通常都非常低，如$0.1$等。
+\vspace{0.5em}
+\end{itemize}

 {\color{red} 降噪自编码器需要再多说一下，因为这部分还是挺新颖的。比如，它解决了什么问题？为什么要降噪？数学本质是什么？常用的结构？等等}

+%----------------------------------------------------------------------------------------
+%    NEW SECTION
+%----------------------------------------------------------------------------------------
+
+\section{小结及深入阅读}
+
+(扩展阅读)
+\parinterval 除此之外，还有很多工作对数据增强方法进行了深入的研究与探讨。探索源语言单语数据在神经机器翻译中的使用方法\upcite{DBLP:conf/emnlp/ZhangZ16}；选择何种单语数据来生成伪数据带来的收益更大\upcite{DBLP:conf/emnlp/FadaeeM18,DBLP:conf/nlpcc/XuLXLLXZ19}；通过特别标识对真实双语和回译生成的伪双语数据进行区分\upcite{DBLP:conf/wmt/CaswellCG19}；在回译过程中对训练数据进行动态选择与加权\upcite{DBLP:journals/corr/abs200403672}；利用目标端单语数据和相关的富资源语言进行数据增强\upcite{DBLP:conf/acl/XiaKAN19}；通过在源语言或目标语言中随机选择某些词，将这些词替换为词表中随机的一个词，可以得到伪双语数据\upcite{DBLP:conf/emnlp/WangPDN18}；随机选择句子中的某个词，将这个词的词嵌入替换为多个语义相似词的加权表示融合\upcite{DBLP:conf/acl/GaoZWXQCZL19}；基于模型的不确定性来量化预测结果的置信度，从而提升回译方法的性能\upcite{DBLP:conf/emnlp/WangLWLS19}；探索如何利用大规模单语数据\upcite{DBLP:conf/emnlp/WuWXQLL19}；还有一些工作对数据增强进行了理论分析\upcite{DBLP:conf/emnlp/LiLHZZ19}{\color{red}，发现XXXX？}。（{\color{red} 这部分写得不错}）
+
+
+
--- a/Chapter9/chapter9.tex
+++ b/Chapter9/chapter9.tex
@@ -2166,6 +2166,6 @@ Jobs was the CEO of {\red{\underline{apple}}}.
 \vspace{0.5em}
 \item 为了进一步提高神经语言模型性能，除了改进模型，还可以在模型中引入新的结构或是其他有效信息，该领域也有很多典型工作值得关注。例如在神经语言模型中引入除了词嵌入以外的单词特征，如语言特征（形态、语法、语义特征等）\upcite{Wu2012FactoredLM,Adel2015SyntacticAS}、上下文信息\upcite{mikolov2012context,Wang2015LargerContextLM}、知识图谱等外部知识\upcite{Ahn2016ANK}；或是在神经语言模型中引入字符级信息，将其作为字符特征单独\upcite{Kim2016CharacterAwareNL,Hwang2017CharacterlevelLM}或与单词特征一起\upcite{Onoe2016GatedWR,Verwimp2017CharacterWordLL}送入模型中；在神经语言模型中引入双向模型也是一种十分有效的尝试，在单词预测时可以同时利用来自过去和未来的文本信息\upcite{Graves2013HybridSR,bahdanau2014neural,Peters2018DeepCW}。
 \vspace{0.5em}
-\item 词嵌入是自然语言处理近些年的重要进展。所谓“嵌入”是一类方法，理论上，把一个事物进行分布式表示的过程都可以被看作是广义上的“嵌入”。基于这种思想的表示学习也成为了自然语言处理中的前沿方法。比如，如何对树结构，甚至图结构进行分布式表示成为了分析自然语言的重要方法\upcite{DBLP:journals/corr/abs-1809-01854,Yin2018StructVAETL,Aharoni2017TowardsSN,Bastings2017GraphCE,KoncelKedziorski2019TextGF}。此外，除了语言建模，还有很多方式可以进行词嵌入的学习，比如，SENNA\upcite{collobert2011natural}、word2vec\upcite{DBLP:journals/corr/abs-1301-3781,mikolov2013distributed}、Glove\upcite{DBLP:conf/emnlp/PenningtonSM14}、CoVe\upcite{mccann2017learned} 等。
+\item 词嵌入是自然语言处理近些年的重要进展。所谓“嵌入”是一类方法，理论上，把一个事物进行分布式表示的过程都可以被看作是广义上的“嵌入”。基于这种思想的表示学习也成为了自然语言处理中的前沿方法。比如，如何对树结构，甚至图结构进行分布式表示成为了分析自然语言的重要方法\upcite{DBLP:journals/corr/abs-1809-01854,Yin2018StructVAETL,Aharoni2017TowardsSN,Bastings2017GraphCE,KoncelKedziorski2019TextGF}。此外，除了语言建模，还有很多方式可以进行词嵌入的学习，比如，SENNA\upcite{2011Natural}、word2vec\upcite{DBLP:journals/corr/abs-1301-3781,mikolov2013distributed}、Glove\upcite{DBLP:conf/emnlp/PenningtonSM14}、CoVe\upcite{mccann2017learned} 等。
 \vspace{0.5em}
 \end{itemize}
--- a/bibliography.bib
+++ b/bibliography.bib
@@ -3867,8 +3867,7 @@ year = {2012}
  volume={18},
  number={4},
  pages={467--479},
-  year={1992},
-  publisher={MIT Press}
+  year={1992}
 }

 @inproceedings{mikolov2012context,
@@ -3877,10 +3876,9 @@ year = {2012}
            Tomas and
            Zweig and
            Geoffrey},
-  booktitle={2012 IEEE Spoken Language Technology Workshop (SLT)},
+  publisher={IEEE Spoken Language Technology Workshop},
  pages={234--239},
-  year={2012},
-  organization={IEEE}
+  year={2012}
 }

 @article{zaremba2014recurrent,
@@ -3905,7 +3903,7 @@ year = {2012}
            Jan and
            Schmidhuber and
            Jurgen},
-  journal={arXiv: Learning},
+  journal={International Conference on Machine Learning},
  year={2016}
 }

@@ -3917,7 +3915,7 @@ year = {2012}
             Nitish Shirish and
             Socher and
             Richard},
-  journal={arXiv: Computation and Language},
+  journal={International Conference on Learning Representations},
  year={2017}
 }

@@ -3934,12 +3932,11 @@ year = {2012}
 @article{baydin2017automatic,
  title ={Automatic differentiation in machine learning: a survey},
  author ={Baydin, At{\i}l{\i}m G{\"u}nes and Pearlmutter, Barak A and Radul, Alexey Andreyevich and Siskind, Jeffrey Mark},
-  journal ={The Journal of Machine Learning Research},
+  journal ={Journal of Machine Learning Research},
  volume ={18},
  number ={1},
  pages ={5595--5637},
-  year ={2017},
-  publisher ={JMLR. org}
+  year ={2017}
 }

 @article{qian1999momentum,
@@ -3977,9 +3974,8 @@ year = {2012}
  author    = {Diederik P. Kingma and
               Jimmy Ba},
  title     = {Adam: {A} Method for Stochastic Optimization},
-  booktitle = {3rd International Conference on Learning Representations, {ICLR} 2015,
-               San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings},
-  year      = {2015},
+  publisher = {International Conference on Learning Representations},
+  year      = {2015}
 }

 @inproceedings{ioffe2015batch,
@@ -3987,13 +3983,10 @@ year = {2012}
               Christian Szegedy},
  title     = {Batch Normalization: Accelerating Deep Network Training by Reducing
               Internal Covariate Shift},
-  booktitle = {Proceedings of the 32nd International Conference on Machine Learning,
-               {ICML} 2015, Lille, France, 6-11 July 2015},
-  series    = {{JMLR} Workshop and Conference Proceedings},
+  publisher = {International Conference on Machine Learning},
  volume    = {37},
  pages     = {448--456},
-  publisher = {JMLR.org},
-  year      = {2015},
+  year      = {2015}
 }

 @article{Ba2016LayerN,
@@ -4003,7 +3996,7 @@ year = {2012}
  title     = {Layer Normalization},
  journal   = {CoRR},
  volume    = {abs/1607.06450},
-  year      = {2016},
+  year      = {2016}
 }

 @inproceedings{mikolov2013distributed,
@@ -4013,11 +4006,9 @@ year = {2012}
               Gregory S. Corrado and
               Jeffrey Dean},
  title     = {Distributed Representations of Words and Phrases and their Compositionality},
-  booktitle = {Advances in Neural Information Processing Systems 26: 27th Annual
-               Conference on Neural Information Processing Systems 2013. Proceedings
-               of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States},
+  publisher = {Conference on Neural Information Processing Systems},
  pages     = {3111--3119},
-  year      = {2013},
+  year      = {2013}
 }

 @inproceedings{arthur2016incorporating,
@@ -4025,12 +4016,9 @@ year = {2012}
               Graham Neubig and
               Satoshi Nakamura},
  title     = {Incorporating Discrete Translation Lexicons into Neural Machine Translation},
-  booktitle = {Proceedings of the 2016 Conference on Empirical Methods in Natural
-               Language Processing, {EMNLP} 2016, Austin, Texas, USA, November 1-4,
-               2016},
  pages     = {1557--1567},
-  publisher = {The Association for Computational Linguistics},
-  year      = {2016},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2016}
 }

 @inproceedings{stahlberg2016syntactically,
@@ -4039,10 +4027,7 @@ year = {2012}
               Aurelien Waite and
               Bill Byrne},
  title     = {Syntactically Guided Neural Machine Translation},
-  booktitle = {Proceedings of the 54th Annual Meeting of the Association for Computational
-               Linguistics, {ACL} 2016, August 7-12, 2016, Berlin, Germany, Volume
-               2: Short Papers},
-  publisher = {The Association for Computer Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics}
  year      = {2016},
 }

@@ -4051,12 +4036,9 @@ year = {2012}
               Alessandro Moschitti},
  title     = {Embedding Semantic Similarity in Tree Kernels for Domain Adaptation
               of Relation Extraction},
-  booktitle = {Proceedings of the 51st Annual Meeting of the Association for Computational
-               Linguistics, {ACL} 2013, 4-9 August 2013, Sofia, Bulgaria, Volume
-               1: Long Papers},
  pages     = {1498--1507},
-  publisher = {The Association for Computer Linguistics},
-  year      = {2013},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2013}
 }

 @inproceedings{perozzi2014deepwalk,
@@ -4064,42 +4046,32 @@ year = {2012}
               Rami Al-Rfou and
               Steven Skiena},
  title     = {DeepWalk: online learning of social representations},
-  booktitle = {The 20th {ACM} {SIGKDD} International Conference on Knowledge Discovery
-               and Data Mining, {KDD} '14, New York, NY, {USA} - August 24 - 27,
-               2014},
+  publisher = {ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
  pages     = {701--710},
-  publisher = {{ACM}},
-  year      = {2014},
+  year      = {2014}
 }

-@article{collobert2011natural,
-  author    = {Ronan Collobert and
-               Jason Weston and
-			   L{\'{e}}on Bottou and
-               Michael Karlen and
-               Koray Kavukcuoglu and
-               Pavel P. Kuksa},
-  title     = {Natural Language Processing (Almost) from Scratch},
-  journal   = {Journal of Machine Learning Research},
-  volume    = {12},
-  pages     = {2493--2537},
-  year      = {2011},
+@article{2011Natural,
+  title={Natural Language Processing (almost) from Scratch},
+  author={ Collobert, Ronan  and  Weston, Jason  and Bottou, Léon and  Karlen, Michael  and  Kavukcuoglu, Koray  and  Kuksa, Pavel },
+  journal={Journal of Machine Learning Research},
+  volume={12},
+  number={1},
+  pages={2493-2537},
+  year={2011}
 }
-
 @inproceedings{mccann2017learned,
  author    = {Bryan McCann and
               James Bradbury and
               Caiming Xiong and
               Richard Socher},
  title     = {Learned in Translation: Contextualized Word Vectors},
-  booktitle = {Advances in Neural Information Processing Systems 30: Annual Conference
-               on Neural Information Processing Systems 2017, 4-9 December 2017,
-               Long Beach, CA, {USA}},
+  booktitle = {Conference on Neural Information Processing Systems},
  pages     = {6294--6305},
-  year      = {2017},
+  year      = {2017}
 }

-%%%%%%%%%%%%%%%%%%%%%%%神经语言模型，待检查修改%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%神经语言模型，已检查修改%%%%%%%%%%%%%%%%%%%%%%%%%
 @inproceedings{Peters2018DeepCW,
  title={Deep contextualized word representations},
  author={Matthew E. Peters and 
@@ -4135,13 +4107,13 @@ year = {2012}
 }

 @inproceedings{Onoe2016GatedWR,
-  title={Gated Word-Character Recurrent Language Model},
-  author={Yasumasa Miyamoto and 
-          Kyunghyun Cho},
-  publisher={arXiv preprint arXiv:1606.01700},
-  year={2016}
+  author    = {Yasumasa Miyamoto and
+               Kyunghyun Cho},
+  title     = {Gated Word-Character Recurrent Language Model},
+  pages     = {1992--1997},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2016}
 }
-
 @inproceedings{Hwang2017CharacterlevelLM,
  title={Character-level language modeling with hierarchical recurrent neural networks},
  author={Kyuyeon Hwang and 
@@ -4216,12 +4188,11 @@ year = {2012}
 		  Ruocheng Guo and 
 		  Adrienne Raglin and 
 		  Huan Liu},
-  journal={ACM SIGKDD Explorations Newsletter},
+  journal={ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
  volume={22},
  number={1},
  pages={18--33},
-  year={2020},
-  publisher={ACM New York, NY, USA}
+  year={2020}
 }

 @incollection{nguyen2019understanding,
@@ -4231,7 +4202,7 @@ year = {2012}
 		  Jeff Clune},
  pages={55--76},
  year={2019},
-  publisher={Explainable AI}
+  publisher={Springer}
 }
 @inproceedings{yang2017improving,
  title={Improving adversarial neural machine translation with prior knowledge},
@@ -4250,15 +4221,16 @@ year = {2012}
  title={Incorporating source syntax into transformer-based neural machine translation},
  author={Anna Currey and 
          Kenneth Heafield},
-  publisher={Proceedings of the Fourth Conference on Machine Translation},
+  publisher={Annual Meeting of the Association for Computational Linguistics},
  pages={24--33},
  year={2019}
 }
+
 @article{currey2018multi,
  title={Multi-source syntactic neural machine translation},
  author={Anna Currey and 
          Kenneth Heafield},
-  journal={arXiv preprint arXiv:1808.10267},
+  journal={Conference on Empirical Methods in Natural Language Processing},
  year={2018}
 }
 @inproceedings{marevcek2018extracting,
@@ -4272,7 +4244,7 @@ year = {2012}
 @article{blevins2018deep,
  title={Deep rnns encode soft hierarchical syntax},
  author={Blevins, Terra and Levy, Omer and Zettlemoyer, Luke},
-  journal={arXiv preprint arXiv:1805.04218},
+  journal={Annual Meeting of the Association for Computational Linguistics},
  year={2018}
 }
 @inproceedings{Yin2018StructVAETL,
@@ -4288,7 +4260,7 @@ year = {2012}
  title={Towards String-To-Tree Neural Machine Translation},
  author={Roee Aharoni and 
          Yoav Goldberg},
-  journal={arXiv preprint arXiv:1704.04743},
+  journal={Annual Meeting of the Association for Computational Linguistics},
  year={2017}
 }

@@ -4308,9 +4280,8 @@ year = {2012}
          Dhanush Bekal and Yi Luan and 
 		  Mirella Lapata and 
 		  Hannaneh Hajishirzi},
-  journal={ArXiv},
-  year={2019},
-  volume={abs/1904.02342}
+  journal={Annual Conference of the North American Chapter of the Association for Computational Linguistics},
+  year={2019}
 }

 @article{Kovalerchuk2020SurveyOE,
@@ -4327,7 +4298,7 @@ year = {2012}
  title={Towards A Rigorous Science of Interpretable Machine Learning},
  author={Finale Doshi-Velez and 
          Been Kim},
-  journal={arXiv: Machine Learning},
+  journal={arXiv preprint arXiv:1702.08608},
  year={2017}
 }

@@ -4349,7 +4320,7 @@ year = {2012}
  title     = {Does Multi-Encoder Help? {A} Case Study on Context-Aware Neural Machine
               Translation},
  pages     = {3512--3518},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2020}
 }

@@ -4359,7 +4330,7 @@ year = {2012}
               Abe Ittycheriah},
  title     = {Supervised Attentions for Neural Machine Translation},
  pages     = {2283--2288},
-  publisher = {The Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2016}
 }

@@ -4370,7 +4341,7 @@ year = {2012}
               Eiichiro Sumita},
  title     = {Neural Machine Translation with Supervised Attention},
  pages     = {3093--3102},
-  publisher = {The Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2016}
 }

@@ -4384,16 +4355,16 @@ year = {2012}
  title     = {Fast and Robust Neural Network Joint Models for Statistical Machine
               Translation},
  pages     = {1370--1380},
-  publisher = {The Association for Computer Linguistics},
-  year      = {2014},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2014}
 }
 @inproceedings{Schwenk_continuousspace,
  author    = {Holger Schwenk},
  title     = {Continuous Space Translation Models for Phrase-Based Statistical Machine
               Translation},
  pages     = {1071--1080},
-  publisher = {Indian Institute of Technology Bombay},
-  year      = {2012},
+  publisher = {International Conference on Computational Linguistics},
+  year      = {2012}
 }
 @inproceedings{kalchbrenner-blunsom-2013-recurrent,
  author    = {Nal Kalchbrenner and
@@ -4401,25 +4372,24 @@ year = {2012}
  title     = {Recurrent Continuous Translation Models},
  pages     = {1700--1709},
  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2013},
+  year      = {2013}
 }
 @article{HochreiterThe,
  author    = {Sepp Hochreiter},
  title     = {The Vanishing Gradient Problem During Learning Recurrent Neural Nets
               and Problem Solutions},
-  journal   = {International Journal of Uncertainty, Fuzziness and Knowledge-Based
-               Systems},
+  journal   = {International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems},
  volume    = {6},
  number    = {2},
  pages     = {107--116},
-  year      = {1998},
+  year      = {1998}
 }
 @article{BENGIO1994Learning,
 author    = {Yoshua Bengio and
               Patrice Y. Simard and
               Paolo Frasconi},
  title     = {Learning long-term dependencies with gradient descent is difficult},
-  journal   = {Institute of Electrical and Electronics Engineers},
+  journal   = {IEEE Transportation Neural Networks},
  volume    = {5},
  number    = {2},
  pages     = {157--166},
@@ -4435,15 +4405,14 @@ author    = {Yoshua Bengio and
               Lukasz Kaiser and
               Illia Polosukhin},
  title     = {Attention is All you Need},
-  publisher = {Advances in Neural Information Processing Systems 30: Annual Conference
-               on Neural Information Processing Systems},
+  publisher = {Conference on Neural Information Processing Systems},
  pages     = {5998--6008},
-  year      = {2017},
+  year      = {2017}
 }
 @article{StahlbergNeural,
  title={Neural Machine Translation: A Review},
  author={Felix Stahlberg},
-  journal={journal of artificial intelligence research},
+  journal={Journal of Artificial Intelligence Research},
  year={2020},
  volume={69},
  pages={343-418}
@@ -4455,8 +4424,8 @@ author    = {Yoshua Bengio and
               Marcello Federico},
  title     = {Neural versus Phrase-Based Machine Translation Quality: a Case Study},
  pages     = {257--267},
-  publisher = {The Association for Computational Linguistics},
-  year      = {2016},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2016}
 }
 @article{Hassan2018AchievingHP,
  author    = {Hany Hassan and
@@ -4498,19 +4467,19 @@ author    = {Yoshua Bengio and
               Lidia S. Chao},
  title     = {Learning Deep Transformer Models for Machine Translation},
  pages     = {1810--1822},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
-@article{Li2020NeuralMT,
+@inproceedings{Li2020NeuralMT,
  author    = {Yanyang Li and
               Qiang Wang and
               Tong Xiao and
               Tongran Liu and
               Jingbo Zhu},
  title     = {Neural Machine Translation with Joint Representation},
-  journal   = {CoRR},
-  volume    = {abs/2002.06546},
-  year      = {2020},
+  pages     = {8285--8292},
+  publisher = {AAAI Conference on Artificial Intelligence},
+  year      = {2020}
 }
 @article{HochreiterLong,
  author = {Hochreiter, Sepp and Schmidhuber, Jürgen},
@@ -4519,7 +4488,7 @@ author    = {Yoshua Bengio and
  pages = {1735-80},
  title = {Long Short-term Memory},
  volume = {9},
-  journal = {Neural computation},
+  journal = {Neural Computation}
 }
 @inproceedings{Cho2014Learning,
  author    = {Kyunghyun Cho and
@@ -4531,24 +4500,18 @@ author    = {Yoshua Bengio and
               Yoshua Bengio},
  title     = {Learning Phrase Representations using {RNN} Encoder-Decoder for Statistical
               Machine Translation},
-  publisher = {Proceedings of the 2014 Conference on Empirical Methods in Natural
-               Language Processing, {EMNLP} 2014, October 25-29, 2014, Doha, Qatar,
-               {A} meeting of SIGDAT, a Special Interest Group of the {ACL}},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  pages     = {1724--1734},
-  //publisher = {{ACL}},
-  year      = {2014},
+  year      = {2014}
 }
 @inproceedings{pmlr-v9-glorot10a,
  author    = {Xavier Glorot and
               Yoshua Bengio},
  title     = {Understanding the difficulty of training deep feedforward neural networks},
-  publisher = {Proceedings of the Thirteenth International Conference on Artificial
-               Intelligence and Statistics, {AISTATS} 2010, Chia Laguna Resort, Sardinia,
-               Italy, May 13-15, 2010},
+  publisher = {International Conference on Artificial Intelligence and Statistics},
  volume    = {9},
  pages     = {249--256},
-  //publisher = {JMLR.org},
-  year      = {2010},
+  year      = {2010}
 }
 @inproceedings{xiao2017fast,
  author    = {Tong Xiao and
@@ -4556,12 +4519,9 @@ author    = {Yoshua Bengio and
               Tongran Liu and
               Chunliang Zhang},
  title     = {Fast Parallel Training of Neural Language Models},
-  publisher = {Proceedings of the Twenty-Sixth International Joint Conference on
-               Artificial Intelligence, {IJCAI} 2017, Melbourne, Australia, August
-               19-25, 2017},
+  publisher = {International Joint Conference on Artificial Intelligence},
  pages     = {4193--4199},
-  //publisher = {ijcai.org},
-  year      = {2017},
+  year      = {2017}
 }
 @inproceedings{Gu2017NonAutoregressiveNM,
  author    = {Jiatao Gu and
@@ -4571,7 +4531,7 @@ author    = {Yoshua Bengio and
               Richard Socher},
  title     = {Non-Autoregressive Neural Machine Translation},
  publisher = {International Conference on Learning Representations},
-  year      = {2018},
+  year      = {2018}
 }
 @inproceedings{li-etal-2018-simple,
  author    = {Yanyang Li and
@@ -4581,12 +4541,9 @@ author    = {Yoshua Bengio and
               Changming Xu and
               Jingbo Zhu},
  title     = {A Simple and Effective Approach to Coverage-Aware Neural Machine Translation},
-  publisher = {Proceedings of the 56th Annual Meeting of the Association for Computational
-               Linguistics, {ACL} 2018, Melbourne, Australia, July 15-20, 2018, Volume
-               2: Short Papers},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  pages     = {292--297},
-  //publisher = {Association for Computational Linguistics},
-  year      = {2018},
+  year      = {2018}
 }
 @inproceedings{TuModeling,
  author    = {Zhaopeng Tu and
@@ -4595,11 +4552,8 @@ author    = {Yoshua Bengio and
               Xiaohua Liu and
               Hang Li},
  title     = {Modeling Coverage for Neural Machine Translation},
-  publisher = {Proceedings of the 54th Annual Meeting of the Association for Computational
-               Linguistics, {ACL} 2016, August 7-12, 2016, Berlin, Germany, Volume
-               1: Long Papers},
-  //publisher = {The Association for Computer Linguistics},
-  year      = {2016},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2016}
 }
 @inproceedings{DBLP:journals/corr/SennrichFCBHHJL17,
  author    = {Rico Sennrich and
@@ -4614,23 +4568,17 @@ author    = {Yoshua Bengio and
               Jozef Mokry and
               Maria Nadejde},
  title     = {Nematus: a Toolkit for Neural Machine Translation},
-  publisher = {Proceedings of the 15th Conference of the European Chapter of the
-               Association for Computational Linguistics, {EACL} 2017, Valencia,
-               Spain, April 3-7, 2017, Software Demonstrations},
+  publisher = {European Association of Computational Linguistics},
  pages     = {65--68},
-  //publisher = {Association for Computational Linguistics},
-  year      = {2017},
+  year      = {2017}
 }
 @inproceedings{DBLP:journals/corr/abs-1905-13324,
  author    = {Biao Zhang and
               Rico Sennrich},
  title     = {A Lightweight Recurrent Network for Sequence Modeling},
-  publisher = {Proceedings of the 57th Conference of the Association for Computational
-               Linguistics, {ACL} 2019, Florence, Italy, July 28- August 2, 2019,
-               Volume 1: Long Papers},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  pages     = {1538--1548},
-  //publisher = {Association for Computational Linguistics},
-  year      = {2019},
+  year      = {2019}
 }
 @article{Lei2017TrainingRA,
  author    = {Tao Lei and
@@ -4639,7 +4587,7 @@ author    = {Yoshua Bengio and
  title     = {Training RNNs as Fast as CNNs},
  journal   = {CoRR},
  volume    = {abs/1709.02755},
-  year      = {2017},
+  year      = {2017}
 }
 @inproceedings{Zhang2018SimplifyingNM,
  author    = {Biao Zhang and
@@ -4649,22 +4597,18 @@ author    = {Yoshua Bengio and
               Huiji Zhang},
  title     = {Simplifying Neural Machine Translation with Addition-Subtraction Twin-Gated
               Recurrent Networks},
-  publisher = {Proceedings of the 2018 Conference on Empirical Methods in Natural
-               Language Processing, Brussels, Belgium, October 31 - November 4, 2018},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  pages     = {4273--4283},
-  //publisher = {Association for Computational Linguistics},
-  year      = {2018},
+  year      = {2018}
 }
 @inproceedings{Liu_2019_CVPR,
  author    = {Shikun Liu and
               Edward Johns and
               Andrew J. Davison},
  title     = {End-To-End Multi-Task Learning With Attention},
-  publisher = {{IEEE} Conference on Computer Vision and Pattern Recognition, {CVPR}
-               2019, Long Beach, CA, USA, June 16-20, 2019},
+  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
  pages     = {1871--1880},
-  //publisher = {Computer Vision Foundation / {IEEE}},
-  year      = {2019},
+  year      = {2019}
 }
 @inproceedings{DBLP:journals/corr/abs-1811-00498,
  author    = {Ra{\'{u}}l V{\'{a}}zquez and
@@ -4672,11 +4616,9 @@ author    = {Yoshua Bengio and
               J{\"{o}}rg Tiedemann and
               Mathias Creutz},
  title     = {Multilingual {NMT} with a Language-Independent Attention Bridge},
-  publisher = {Proceedings of the 4th Workshop on Representation Learning for NLP,
-               RepL4NLP@ACL 2019, Florence, Italy, August 2, 2019},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  pages     = {33--39},
-  //publisher = {Association for Computational Linguistics},
-  year      = {2019},
+  year      = {2019}
 }
 @inproceedings{MoradiInterrogating,
  author    = {Pooya Moradi and
@@ -4684,11 +4626,9 @@ author    = {Yoshua Bengio and
               Anoop Sarkar},
  title     = {Interrogating the Explanatory Power of Attention in Neural Machine
               Translation},
-  publisher = {Proceedings of the 3rd Workshop on Neural Generation and Translation@EMNLP-IJCNLP
-               2019, Hong Kong, November 4, 2019},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  pages     = {221--230},
-  //publisher = {Association for Computational Linguistics},
-  year      = {2019},
+  year      = {2019}
 }
 @inproceedings{WangNeural,
  author    = {Xing Wang and
@@ -4698,11 +4638,9 @@ author    = {Yoshua Bengio and
               Deyi Xiong and
               Min Zhang},
  title     = {Neural Machine Translation Advised by Statistical Machine Translation},
-  publisher = {Proceedings of the Thirty-First {AAAI} Conference on Artificial Intelligence,
-               February 4-9, 2017, San Francisco, California, {USA}},
+  publisher = {AAAI Conference on Artificial Intelligence},
  pages     = {3330--3336},
-  //publisher = {{AAAI} Press},
-  year      = {2017},
+  year      = {2017}
 }
 @inproceedings{Xiao2019SharingAW,
  author    = {Tong Xiao and
@@ -4711,12 +4649,9 @@ author    = {Yoshua Bengio and
               Zhengtao Yu and
               Tongran Liu},
  title     = {Sharing Attention Weights for Fast Transformer},
-  publisher = {Proceedings of the Twenty-Eighth International Joint Conference on
-               Artificial Intelligence, {IJCAI} 2019, Macao, China, August 10-16,
-               2019},
+  publisher = {International Joint Conference on Artificial Intelligence},
  pages     = {5292--5298},
-  //publisher = {ijcai.org},
-  year      = {2019},
+  year      = {2019}
 }
 @inproceedings{Yang2017TowardsBH,
  author    = {Baosong Yang and
@@ -4726,36 +4661,27 @@ author    = {Yoshua Bengio and
               Jingbo Zhu},
  title     = {Towards Bidirectional Hierarchical Representations for Attention-based
               Neural Machine Translation},
-  publisher = {Proceedings of the 2017 Conference on Empirical Methods in Natural
-               Language Processing, {EMNLP} 2017, Copenhagen, Denmark, September
-               9-11, 2017},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  pages     = {1432--1441},
-  //publisher = {Association for Computational Linguistics},
-  year      = {2017},
+  year      = {2017}
 }
 @inproceedings{Wang2019TreeTI,
  author    = {Yau-Shian Wang and
               Hung-yi Lee and
               Yun-Nung Chen},
  title     = {Tree Transformer: Integrating Tree Structures into Self-Attention},
-  publisher = {Proceedings of the 2019 Conference on Empirical Methods in Natural
-               Language Processing and the 9th International Joint Conference on
-               Natural Language Processing, {EMNLP-IJCNLP} 2019, Hong Kong, China,
-               November 3-7, 2019},
-  //publisher = {Association for Computational Linguistics},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  pages     = {1061--1070},
-  year      = {2019},
+  year      = {2019}
 }
 @inproceedings{DBLP:journals/corr/abs-1809-01854,
  author    = {Jetic Gu and
               Hassan S. Shavarani and
               Anoop Sarkar},
  title     = {Top-down Tree Structured Decoding with Syntactic Connections for Neural Machine Translation and Parsing},
-  publisher = {Proceedings of the 2018 Conference on Empirical Methods in Natural
-               Language Processing, Brussels, Belgium, October 31 - November 4, 2018},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  pages     = {401--413},
-  //publisher = {Association for Computational Linguistics},
-  year      = {2018},
+  year      = {2018}
 }
 @inproceedings{DBLP:journals/corr/abs-1808-09374,
  author    = {Xinyi Wang and
@@ -4763,11 +4689,9 @@ author    = {Yoshua Bengio and
               Pengcheng Yin and
               Graham Neubig},
  title     = {A Tree-based Decoder for Neural Machine Translation},
-  publisher = {Proceedings of the 2018 Conference on Empirical Methods in Natural
-               Language Processing, Brussels, Belgium, October 31 - November 4, 2018},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  pages     = {4772--4777},
-  //publisher = {Association for Computational Linguistics},
-  year      = {2018},
+  year      = {2018}
 }
 @article{DBLP:journals/corr/ZhangZ16c,
  author    = {Jiajun Zhang and
@@ -4775,7 +4699,7 @@ author    = {Yoshua Bengio and
  title     = {Bridging Neural Machine Translation and Bilingual Dictionaries},
  journal   = {CoRR},
  volume    = {abs/1610.07272},
-  year      = {2016},
+  year      = {2016}
 }
 @article{Dai2019TransformerXLAL,
  author    = {Zihang Dai and
@@ -4787,7 +4711,7 @@ author    = {Yoshua Bengio and
  title     = {Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context},
  journal   = {CoRR},
  volume    = {abs/1901.02860},
-  year      = {2019},
+  year      = {2019}
 }
 @inproceedings{li-etal-2019-word,
  author    = {Xintong Li and
@@ -4796,12 +4720,9 @@ author    = {Yoshua Bengio and
               Max Meng and
               Shuming Shi},
  title     = {On the Word Alignment from Neural Machine Translation},
-  publisher = {Proceedings of the 57th Conference of the Association for Computational
-               Linguistics, {ACL} 2019, Florence, Italy, July 28- August 2, 2019,
-               Volume 1: Long Papers},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  pages     = {1293--1303},
-  //publisher = {Association for Computational Linguistics},
-  year      = {2019},
+  year      = {2019}
 }

 @inproceedings{Werlen2018DocumentLevelNM,
@@ -4811,11 +4732,9 @@ author    = {Yoshua Bengio and
               James Henderson},
  title     = {Document-Level Neural Machine Translation with Hierarchical Attention
               Networks},
-  publisher = {Proceedings of the 2018 Conference on Empirical Methods in Natural
-               Language Processing, Brussels, Belgium, October 31 - November 4, 2018},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  pages     = {2947--2954},
-  //publisher = {Association for Computational Linguistics},
-  year      = {2018},
+  year      = {2018}
 }
 @inproceedings{DBLP:journals/corr/abs-1805-10163,
  author    = {Elena Voita and
@@ -4823,12 +4742,9 @@ author    = {Yoshua Bengio and
               Rico Sennrich and
               Ivan Titov},
  title     = {Context-Aware Neural Machine Translation Learns Anaphora Resolution},
-  publisher = {Proceedings of the 56th Annual Meeting of the Association for Computational
-               Linguistics, {ACL} 2018, Melbourne, Australia, July 15-20, 2018, Volume
-               1: Long Papers},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  pages     = {1264--1274},
-  //publisher = {Association for Computational Linguistics},
-  year      = {2018},
+  year      = {2018}
 }
 @article{DBLP:journals/corr/abs-1906-00532,
  author    = {Aishwarya Bhandare and
@@ -4842,7 +4758,7 @@ author    = {Yoshua Bengio and
               Translation Model},
  journal   = {CoRR},
  volume    = {abs/1906.00532},
-  year      = {2019},
+  year      = {2019}
 }

 @inproceedings{Zhang2018SpeedingUN,
@@ -4852,22 +4768,18 @@ author    = {Yoshua Bengio and
               Lei Shen and
               Qun Liu},
  title     = {Speeding Up Neural Machine Translation Decoding by Cube Pruning},
-  publisher = {Proceedings of the 2018 Conference on Empirical Methods in Natural
-               Language Processing, Brussels, Belgium, October 31 - November 4, 2018},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  pages     = {4284--4294},
-  //publisher = {Association for Computational Linguistics},
-  year      = {2018},
+  year      = {2018}
 }
 @inproceedings{DBLP:journals/corr/SeeLM16,
  author    = {Abigail See and
               Minh-Thang Luong and
               Christopher D. Manning},
  title     = {Compression of Neural Machine Translation Models via Pruning},
-  publisher = {Proceedings of the 20th {SIGNLL} Conference on Computational Natural
-               Language Learning, CoNLL 2016, Berlin, Germany, August 11-12, 2016},
+  publisher = {International Conference on Computational Linguistics},
  pages     = {291--301},
-  //publisher = {{ACL}},
-  year      = {2016},
+  year      = {2016}
 }
 @inproceedings{DBLP:journals/corr/ChenLCL17,
  author    = {Yun Chen and
@@ -4875,12 +4787,9 @@ author    = {Yoshua Bengio and
               Yong Cheng and
               Victor O. K. Li},
  title     = {A Teacher-Student Framework for Zero-Resource Neural Machine Translation},
-  publisher = {Proceedings of the 55th Annual Meeting of the Association for Computational
-               Linguistics, {ACL} 2017, Vancouver, Canada, July 30 - August 4, Volume
-               1: Long Papers},
  pages     = {1925--1935},
-  //publisher = {Association for Computational Linguistics},
-  year      = {2017},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2017}
 }
 @article{Hinton2015Distilling,
  author    = {Geoffrey E. Hinton and
@@ -4889,13 +4798,13 @@ author    = {Yoshua Bengio and
  title     = {Distilling the Knowledge in a Neural Network},
  journal   = {CoRR},
  volume    = {abs/1503.02531},
-  year      = {2015},
+  year      = {2015}
 }

 @inproceedings{Ott2018ScalingNM,
  title={Scaling Neural Machine Translation},
  author={Myle Ott and Sergey Edunov and David Grangier and M. Auli},
-  publisher={Workshop on Machine Translation},
+  publisher={Annual Meeting of the Association for Computational Linguistics},
  year={2018}
 }
 @inproceedings{Lin2020TowardsF8,
@@ -4915,7 +4824,7 @@ author    = {Yoshua Bengio and
               Alexander M. Rush},
  title     = {Sequence-Level Knowledge Distillation},
  pages     = {1317--1327},
-  publisher = {The Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2016}
 }
 @article{Akaike1969autoregressive,
@@ -4946,13 +4855,13 @@ author    = {Yoshua Bengio and
  title     = {The Best of Both Worlds: Combining Recent Advances in Neural Machine
               Translation},
  pages     = {76--86},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2018}
 }
 @inproceedings{He2018LayerWiseCB,
  title={Layer-Wise Coordination between Encoder and Decoder for Neural Machine Translation},
  author={Tianyu He and X. Tan and Yingce Xia and D. He and T. Qin and Zhibo Chen and T. Liu},
-  publisher={Conference and Workshop on Neural Information Processing Systems},
+  publisher={Conference on Neural Information Processing Systems},
  year={2018}
 }
 @inproceedings{cho-etal-2014-properties,
@@ -4962,7 +4871,7 @@ author    = {Yoshua Bengio and
               Yoshua Bengio},
  title     = {On the Properties of Neural Machine Translation: Encoder-Decoder Approaches},
  pages     = {103--111},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2014}
 }

@@ -4973,7 +4882,7 @@ author    = {Yoshua Bengio and
               Yoshua Bengio},
  title     = {On Using Very Large Target Vocabulary for Neural Machine Translation},
  pages     = {1--10},
-  publisher = {The Association for Computer Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2015}
 }

@@ -4982,8 +4891,7 @@ author    = {Yoshua Bengio and
               Hieu Pham and
               Christopher D. Manning},
  title     = {Effective Approaches to Attention-based Neural Machine Translation},
-  publisher = {Conference on Empirical Methods in Natural
-               Language Processing},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  pages     = {1412--1421},
  year      = {2015}
 }
@@ -4994,7 +4902,7 @@ author    = {Yoshua Bengio and
               Haifeng Wang},
  title     = {Improved Neural Machine Translation with {SMT} Features},
  pages     = {151--157},
-  publisher = {the Association for the Advance of Artificial Intelligence},
+  publisher = {AAAI Conference on Artificial Intelligence},
  year      = {2016}
 }
 @inproceedings{zhang-etal-2017-prior,
@@ -5005,7 +4913,7 @@ author    = {Yoshua Bengio and
      Xu, Jingfang  and
      Sun, Maosong},
    year = {2017},
-    publisher = {Association for Computational Linguistics},
+    publisher = {Annual Meeting of the Association for Computational Linguistics},
    pages = {1514--1523},
 }

@@ -5021,7 +4929,7 @@ author    = {Yoshua Bengio and
  title     = {Bilingual Dictionary Based Neural Machine Translation without Using
               Parallel Sentences},
  pages     = {1570--1579},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2020}
 }

@@ -5030,7 +4938,7 @@ author    = {Yoshua Bengio and
               Deyi Xiong},
  title     = {Encoding Gated Translation Memory into Neural Machine Translation},
  pages     = {3042--3047},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2018}
 }
 @inproceedings{yang-etal-2016-hierarchical,
@@ -5042,7 +4950,7 @@ author    = {Yoshua Bengio and
               Eduard H. Hovy},
  title     = {Hierarchical Attention Networks for Document Classification},
  pages     = {1480--1489},
-  publisher = {The Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2016}
 }
 %%%%% chapter 10------------------------------------------------------
@@ -5056,7 +4964,7 @@ author    = {Yoshua Bengio and
               Douwe Kiela},
  title     = {Code-Switched Named Entity Recognition with Embedding Attention},
  pages     = {154--158},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2018}
 }

@@ -5069,7 +4977,7 @@ author    = {Yoshua Bengio and
  title     = {Leveraging Linguistic Structures for Named Entity Recognition with
               Bidirectional Recursive Neural Networks},
  pages     = {2664--2669},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2017}
 }

@@ -5077,7 +4985,7 @@ author    = {Yoshua Bengio and
  author    = {Xuezhe Ma and
               Eduard H. Hovy},
  title     = {End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF},
-  publisher = {The Association for Computer Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2016}
 }

@@ -5088,7 +4996,7 @@ author    = {Yoshua Bengio and
               Andrew McCallum},
  title     = {Fast and Accurate Entity Recognition with Iterated Dilated Convolutions},
  pages     = {2670--2680},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2017}
 }

@@ -5107,26 +5015,21 @@ author    = {Yoshua Bengio and
  year      = {2017}
 }

-@article{DBLP:journals/jmlr/CollobertWBKKK11,
-  author    = {Ronan Collobert and
-               Jason Weston and
-               L{\'{e}}on Bottou and
-               Michael Karlen and
-               Koray Kavukcuoglu and
-               Pavel P. Kuksa},
-  title     = {Natural Language Processing (Almost) from Scratch},
-  journal   = {J. Mach. Learn. Res.},
-  volume    = {12},
-  pages     = {2493--2537},
-  year      = {2011}
+@article{2011Natural,
+  title={Natural Language Processing (almost) from Scratch},
+  author={ Collobert, Ronan  and  Weston, Jason  and Bottou, Léon and  Karlen, Michael  and  Kavukcuoglu, Koray  and  Kuksa, Pavel },
+  journal={Journal of Machine Learning Research},
+  volume={12},
+  number={1},
+  pages={2493-2537},
+  year={2011},
 }
-
 @inproceedings{DBLP:conf/acl/NguyenG15,
  author    = {Thien Huu Nguyen and
               Ralph Grishman},
  title     = {Event Detection and Domain Adaptation with Convolutional Neural Networks},
  pages     = {365--371},
-  publisher = {The Association for Computer Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2015}
 }

@@ -5137,7 +5040,7 @@ author    = {Yoshua Bengio and
               Jun Zhao},
  title     = {Recurrent Convolutional Neural Networks for Text Classification},
  pages     = {2267--2273},
-  publisher = {the Association for the Advance of Artificial Intelligence},
+  publisher = {AAAI Conference on Artificial Intelligence},
  year      = {2015}
 }

@@ -5149,7 +5052,7 @@ author    = {Yoshua Bengio and
               Jun Zhao},
  title     = {Event Extraction via Dynamic Multi-Pooling Convolutional Neural Networks},
  pages     = {167--176},
-  publisher = {The Association for Computer Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2015}
 }

@@ -5159,7 +5062,7 @@ author    = {Yoshua Bengio and
               Tommi S. Jaakkola},
  title     = {Molding CNNs for text: non-linear, non-consecutive convolutions},
  pages     = {1565--1575},
-  publisher = {The Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2015}
 }

@@ -5169,7 +5072,7 @@ author    = {Yoshua Bengio and
  title     = {Effective Use of Word Order for Text Categorization with Convolutional
               Neural Networks},
  pages     = {103--112},
-  publisher = {The Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2015}
 }

@@ -5178,14 +5081,14 @@ author    = {Yoshua Bengio and
               Ralph Grishman},
  title     = {Relation Extraction: Perspective from Convolutional Neural Networks},
  pages     = {39--48},
-  publisher = {The Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2015}
 }

 @article{StahlbergNeural,
  title={Neural Machine Translation: A Review},
  author={Felix Stahlberg},
-  journal={journal of artificial intelligence research},
+  journal={Journal of Artificial Intelligence Research},
  year={2020},
  volume={69},
  pages={343-418}
@@ -5211,7 +5114,7 @@ author    = {Yoshua Bengio and
 @article{Waibel1989PhonemeRU,
  title={Phoneme recognition using time-delay neural networks},
  author={Alexander H. Waibel and Toshiyuki Hanazawa and Geoffrey E. Hinton and K. Shikano and K. Lang},
-  journal={IEEE Trans. Acoust. Speech Signal Process.},
+  journal={IEEE Transactions on Acoustics, Speech, and Signal Processing},
  year={1989},
  volume={37},
  pages={328-339}
@@ -5226,7 +5129,7 @@ author    = {Yoshua Bengio and
  pages={541-551}
 }

-@ARTICLE{726791,
+@article{726791,
  author={Y. {Lecun} and L. {Bottou} and Y. {Bengio} and P. {Haffner}},
  journal={Proceedings of the IEEE}, 
  title={Gradient-based learning applied to document recognition}, 
@@ -5234,7 +5137,6 @@ author    = {Yoshua Bengio and
  volume={86},
  number={11},
  pages={2278-2324},
-  //doi={10.1109/5.726791}
 }

 @inproceedings{DBLP:journals/corr/HeZRS15,
@@ -5262,7 +5164,7 @@ author    = {Yoshua Bengio and
 @article{Girshick2015FastR,
  title={Fast R-CNN},
  author={Ross B. Girshick},
-  journal={2015 IEEE International Conference on Computer Vision (ICCV)},
+  journal={International Conference on Computer Vision},
  year={2015},
  pages={1440-1448}
 }
@@ -5279,7 +5181,7 @@ author    = {Yoshua Bengio and
 @inproceedings{Kalchbrenner2014ACN,
  title={A Convolutional Neural Network for Modelling Sentences},
  author={Nal Kalchbrenner and Edward Grefenstette and P. Blunsom},
-  booktitle={ACL},
+  publisher={Annual Meeting of the Association for Computational Linguistics},
  pages={655--665},
  year={2014}
 }
@@ -5287,7 +5189,7 @@ author    = {Yoshua Bengio and
 @inproceedings{Kim2014ConvolutionalNN,
  title={Convolutional Neural Networks for Sentence Classification},
  author={Yoon Kim},
-  booktitle={Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing},
+  publisher={Conference on Empirical Methods in Natural Language Processing},
  pages = {1746--1751},
  year={2014}
 }
@@ -5299,7 +5201,7 @@ author    = {Yoshua Bengio and
               Bowen Zhou and
               Bing Xiang},
  pages = {174--179},
-  booktitle={The Association for Computer Linguistics},
+  publisher={Annual Meeting of the Association for Computational Linguistics},
  year={2015}
 }

@@ -5308,7 +5210,7 @@ author    = {Yoshua Bengio and
  author    = {C{\'{\i}}cero Nogueira dos Santos and
               Maira Gatti},
  pages     = {69--78},
-  publisher = {The Association for Computer Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year={2014}
 }

@@ -5318,7 +5220,7 @@ author    = {Yoshua Bengio and
               Angela Fan and
               Michael Auli and
               David Grangier},
-  booktitle={Proceedings of the 34th International Conference on Machine Learning},
+  publisher={International Conference on Machine Learning},
  volume    = {70},
  pages     = {933--941},
  year={2017}
@@ -5330,7 +5232,7 @@ author    = {Yoshua Bengio and
               Michael Auli and
               David Grangier and
               Yann N. Dauphin},
-  booktitle={The Association for Computer Linguistics},
+  publisher={Annual Meeting of the Association for Computational Linguistics},
  pages     = {123--135},
  year={2017}
 }
@@ -5353,7 +5255,7 @@ author    = {Yoshua Bengio and
  author    = {Lukasz Kaiser and
               Aidan N. Gomez and
               Fran{\c{c}}ois Chollet},
-  publisher = {OpenReview.net},
+  journal = {International Conference on Learning Representations},
  year={2018},
 }

@@ -5364,7 +5266,7 @@ author    = {Yoshua Bengio and
 		 Yann N. Dauphin and
 		 Michael Auli},
 title = {Pay Less Attention with Lightweight and Dynamic Convolutions},
- publisher = {7th International Conference on Learning Representations},
+ publisher = {International Conference on Learning Representations},
 year = {2019},
 }

@@ -5421,7 +5323,7 @@ author    = {Yoshua Bengio and
               Shaoqing Ren and
               Jian Sun},
  title     = {Deep Residual Learning for Image Recognition},
-  publisher = {{IEEE} Conference on Computer Vision and Pattern Recognition},
+  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
  pages     = {770--778},
  year      = {2016},
 }
@@ -5432,26 +5334,26 @@ author    = {Yoshua Bengio and
               Arthur Szlam and
               Jason Weston and
               Rob Fergus},
-  booktitle={Conference and Workshop on Neural Information Processing Systems},
+  publisher={Conference on Neural Information Processing Systems},
  pages     = {2440--2448},
  year={2015}
 }

-@article{Islam2020HowMP,
-  title={How Much Position Information Do Convolutional Neural Networks Encode?},
-  author={Md. Amirul Islam and Sen Jia and Neil D. B. Bruce},
-  journal={ArXiv},
-  year={2020},
-  volume={abs/2001.08248}
+@inproceedings{Islam2020HowMP,
+  author    = {Md. Amirul Islam and
+               Sen Jia and
+               Neil D. B. Bruce},
+  title     = {How much Position Information Do Convolutional Neural Networks Encode?},
+  publisher = {International Conference on Learning Representations},
+  year      = {2020},
 }
-
 @inproceedings{Sutskever2013OnTI,
  title={On the importance of initialization and momentum in deep learning},
  author    = {Ilya Sutskever and
               James Martens and
               George E. Dahl and
               Geoffrey E. Hinton},
-  booktitle={International Conference on Machine Learning},
+  publisher = {International Conference on Machine Learning},
  pages     = {1139--1147},
  year={2013}
 }
@@ -5459,7 +5361,7 @@ author    = {Yoshua Bengio and
 @article{Bengio2013AdvancesIO,
  title={Advances in optimizing recurrent networks},
  author={Yoshua Bengio and Nicolas Boulanger-Lewandowski and Razvan Pascanu},
-  journal={2013 IEEE International Conference on Acoustics, Speech and Signal Processing},
+  journal={IEEE Transactions on Acoustics, Speech, and Signal Processing},
  year={2013},
  pages={8624-8628}
 }
@@ -5476,7 +5378,7 @@ author    = {Yoshua Bengio and
 @article{Chollet2017XceptionDL,
  title={Xception: Deep Learning with Depthwise Separable Convolutions},
  author    = {Fran{\c{c}}ois Chollet},
-  journal={2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
+  journal={IEEE Conference on Computer Vision and Pattern Recognition},
  year={2017},
  pages={1800-1807}
 }
@@ -5512,7 +5414,7 @@ author    = {Yoshua Bengio and
  title={Rotation, Scaling and Deformation Invariant Scattering for Texture Discrimination},
  author    = {Laurent Sifre and
               St{\'{e}}phane Mallat},
-  journal={2013 IEEE Conference on Computer Vision and Pattern Recognition},
+  journal={IEEE Conference on Computer Vision and Pattern Recognition},
  year={2013},
  pages={1233-1240}
 }
@@ -5520,7 +5422,7 @@ author    = {Yoshua Bengio and
 @article{Taigman2014DeepFaceCT,
  title={DeepFace: Closing the Gap to Human-Level Performance in Face Verification},
  author={Yaniv Taigman and Ming Yang and Marc'Aurelio Ranzato and Lior Wolf},
-  journal={2014 IEEE Conference on Computer Vision and Pattern Recognition},
+  journal={IEEE Conference on Computer Vision and Pattern Recognition},
  year={2014},
  pages={1701-1708}
 }
@@ -5533,7 +5435,7 @@ author    = {Yoshua Bengio and
               Mirk{\'{o}} Visontai and
               Raziel Alvarez and
               Carolina Parada},
-  booktitle={the International Speech Communication Association},
+  publisher={Conference of the International Speech Communication Association},
  pages     = {1136--1140},
  year={2015}
 }
@@ -5546,7 +5448,7 @@ author    = {Yoshua Bengio and
               Dongdong Chen and
               Lu Yuan and
               Zicheng Liu},
-  publisher = {Institute of Electrical and Electronics Engineers},
+  journal = {IEEE Conference on Computer Vision and Pattern Recognition},
  year={2020},
  pages={11027-11036}
 }
@@ -5563,7 +5465,7 @@ author    = {Yoshua Bengio and
               Chloe Hillier and
               Timothy P. Lillicrap},
  title     = {Compressive Transformers for Long-Range Sequence Modelling},
-  publisher = {OpenReview.net},
+  publisher = {International Conference on Learning Representations},
  year      = {2020}
 }

@@ -5597,7 +5499,7 @@ author    = {Yoshua Bengio and
               Yujun Lin and
               Song Han},
  title     = {Lite Transformer with Long-Short Range Attention},
-  publisher = {OpenReview.net},
+  publisher = {International Conference on Learning Representations},
  year      = {2020}
 }

@@ -5610,7 +5512,7 @@ author    = {Yoshua Bengio and
  title     = {Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy
               Lifting, the Rest Can Be Pruned},
  pages     = {5797--5808},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019},
 }

@@ -5623,7 +5525,7 @@ author    = {Yoshua Bengio and
               Bowen Zhou and
               Yoshua Bengio},
  title     = {A Structured Self-Attentive Sentence Embedding},
-  publisher = {5th International Conference on Learning Representations},
+  publisher = {International Conference on Learning Representations},
  year      = {2017},
 }
 @inproceedings{Shaw2018SelfAttentionWR,
@@ -5631,8 +5533,8 @@ author    = {Yoshua Bengio and
               Jakob Uszkoreit and
               Ashish Vaswani},
  title     = {Self-Attention with Relative Position Representations},
-  publisher = {Proceedings of the 2018 Conference of the North American Chapter of
-               the Association for Computational Linguistics: Human Language Technologies},
+  publisher = {Proceedings of the Human Language Technology Conference of 
+               the North American Chapter of the Association for Computational Linguistics},
  pages     = {464--468},
  year      = {2018},
 }
@@ -5642,7 +5544,7 @@ author    = {Yoshua Bengio and
               Shaoqing Ren and
               Jian Sun},
  title     = {Deep Residual Learning for Image Recognition},
-  publisher = {{IEEE} Conference on Computer Vision and Pattern Recognition},
+  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
  pages     = {770--778},
  year      = {2016},
 }
@@ -5661,7 +5563,7 @@ author    = {Yoshua Bengio and
               Jonathon Shlens and
               Zbigniew Wojna},
  title     = {Rethinking the Inception Architecture for Computer Vision},
-  publisher = {{IEEE} Conference on Computer Vision and Pattern Recognition},
+  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
  pages     = {2818--2826},
  year      = {2016},
 }
@@ -5670,8 +5572,7 @@ author    = {Yoshua Bengio and
               Deyi Xiong and
               Jinsong Su},
  title     = {Accelerating Neural Transformer via an Average Attention Network},
-  publisher = {Proceedings of the 56th Annual Meeting of the Association for Computational
-               Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  pages     = {1789--1798},
  year      = {2018},
 }
@@ -5691,7 +5592,7 @@ author    = {Yoshua Bengio and
 		 Yann N. Dauphin and
 		 Michael Auli},
 title = {Pay Less Attention with Lightweight and Dynamic Convolutions},
- publisher = {7th International Conference on Learning Representations},
+ publisher = {International Conference on Learning Representations},
 year = {2019},
 }

@@ -5704,7 +5605,7 @@ author    = {Yoshua Bengio and
               Ruslan Salakhutdinov},
  title     = {Transformer-XL: Attentive Language Models beyond a Fixed-Length Context},
  pages     = {2978--2988},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
 @article{Liu2020LearningTE,
@@ -5729,7 +5630,7 @@ author    = {Yoshua Bengio and
               Tong Zhang},
  title     = {Modeling Localness for Self-Attention Networks},
  pages     = {4449--4458},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2018}
 }
 @inproceedings{DBLP:journals/corr/abs-1904-03107,
@@ -5740,7 +5641,7 @@ author    = {Yoshua Bengio and
 			Zhaopeng Tu},
 	title = {Convolutional Self-Attention Networks},
 	pages = {4040--4045},
-	publisher = {Association for Computational Linguistics},
+	publisher = {Annual Meeting of the Association for Computational Linguistics},
 	year = {2019},
 }
 @article{Wang2018MultilayerRF,
@@ -5759,7 +5660,7 @@ author    = {Yoshua Bengio and
  title     = {Training Deeper Neural Machine Translation Models with Transparent
               Attention},
  pages     = {3028--3033},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2018}
 }
 @inproceedings{Dou2018ExploitingDR,
@@ -5770,7 +5671,7 @@ author    = {Yoshua Bengio and
               Tong Zhang},
  title     = {Exploiting Deep Representations for Neural Machine Translation},
  pages     = {4253--4262},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2018}
 }
 @inproceedings{Wang2019ExploitingSC,
@@ -5789,13 +5690,13 @@ author    = {Yoshua Bengio and
               Tong Zhang},
  title     = {Dynamic Layer Aggregation for Neural Machine Translation with Routing-by-Agreement},
  pages     = {86--93},
-  publisher = {the Association for the Advance of Artificial Intelligence},
+  publisher = {AAAI Conference on Artificial Intelligence},
  year      = {2019}
 }
 @inproceedings{Wei2020MultiscaleCD,
  title={Multiscale Collaborative Deep Models for Neural Machine Translation},
  author={Xiangpeng Wei and Heng Yu and Yue Hu and Yue Zhang and Rongxiang Weng and Weihua Luo},
-  booktitle={Annual Meeting of the Association for Computational Linguistics},
+  publisher={Annual Meeting of the Association for Computational Linguistics},
  year={2020}
 }

@@ -5824,7 +5725,7 @@ author    = {Yoshua Bengio and
               Lukasz Kaiser and
               Anselm Levskaya},
  title     = {Reformer: The Efficient Transformer},
-  publisher = {OpenReview.net},
+  journal = {International Conference on Learning Representations},
  year      = {2020}
 }

@@ -5839,7 +5740,7 @@ author    = {Yoshua Bengio and
 @article{li2020shallow,
  title={Shallow-to-Deep Training for Neural Machine Translation},
  author={Li, Bei and Wang, Ziyang and Liu, Hui and Jiang, Yufan and Du, Quan and Xiao, Tong and Wang, Huizhen and Zhu, Jingbo},
-  publisher={Conference on Empirical Methods in Natural Language Processing},
+  journal={Conference on Empirical Methods in Natural Language Processing},
  year={2020}
 }
 %%%%% chapter 12------------------------------------------------------
@@ -6673,15 +6574,7 @@ author    = {Yoshua Bengio and
  publisher = {Annual Meeting of the Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
-@inproceedings{DBLP:conf/naacl/MohiuddinJ19,
-  author    = {Tasnim Mohiuddin and
-               Shafiq R. Joty},
-  title     = {Revisiting Adversarial Autoencoder for Unsupervised Word Translation
-               with Cycle Consistency and Improved Training},
-  pages     = {3857--3867},
-  publisher = {Annual Meeting of the Annual Meeting of the Association for Computational Linguistics},
-  year      = {2019}
-}
+
 @article{DBLP:journals/corr/abs-1811-01124,
  author    = {Jean Alaux and
               Edouard Grave and
@@ -6896,394 +6789,6 @@ author    = {Yoshua Bengio and
  publisher = {Annual Meeting of the Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
-@article{2019ADabre,
-  title={A Survey of Multilingual Neural Machine Translation},
-  author={Dabre, Raj  and  Chu, Chenhui  and  Kunchukuttan, Anoop },
-  year={2019},
-}
-@inproceedings{DBLP:conf/naacl/ZophK16,
-  author    = {Barret Zoph and
-               Kevin Knight},
-  title     = {Multi-Source Neural Translation},
-  pages     = {30--34},
-  publisher = {Annual Conference of the North American Chapter of the Association for Computational Linguistics},
-  year      = {2016}
-}
-@inproceedings{DBLP:conf/naacl/FiratCB16,
-  author    = {Orhan Firat and
-               Kyunghyun Cho and
-               Yoshua Bengio},
-  title     = {Multi-Way, Multilingual Neural Machine Translation with a Shared Attention
-               Mechanism},
-  pages     = {866--875},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2016}
-}
-@article{DBLP:journals/tacl/JohnsonSLKWCTVW17,
-  author    = {Melvin Johnson and
-               Mike Schuster and
-               Quoc V. Le and
-               Maxim Krikun and
-               Yonghui Wu and
-               Zhifeng Chen and
-               Nikhil Thorat and
-               Fernanda B. Vi{\'{e}}gas and
-               Martin Wattenberg and
-               Greg Corrado and
-               Macduff Hughes and
-               Jeffrey Dean},
-  title     = {Google's Multilingual Neural Machine Translation System: Enabling
-               Zero-Shot Translation},
-  journal   = {Trans. Assoc. Comput. Linguistics},
-  volume    = {5},
-  pages     = {339--351},
-  year      = {2017}
-}
-@inproceedings{DBLP:conf/emnlp/KimPPKN19,
-  author    = {Yunsu Kim and
-               Petre Petrov and
-               Pavel Petrushkov and
-               Shahram Khadivi and
-               Hermann Ney},
-  title     = {Pivot-based Transfer Learning for Neural Machine Translation between
-               Non-English Languages},
-  pages     = {866--876},
-  publisher = {Association for Computational Linguistics},
-  year      = {2019}
-}
-@inproceedings{DBLP:conf/acl/ChenLCL17,
-  author    = {Yun Chen and
-               Yang Liu and
-               Yong Cheng and
-               Victor O. K. Li},
-  title     = {A Teacher-Student Framework for Zero-Resource Neural Machine Translation},
-  pages     = {1925--1935},
-  publisher = {Association for Computational Linguistics},
-  year      = {2017}
-}
-@article{DBLP:journals/mt/WuW07,
-  author    = {Hua Wu and
-               Haifeng Wang},
-  title     = {Pivot language approach for phrase-based statistical machine translation},
-  journal   = {Mach. Transl.},
-  volume    = {21},
-  number    = {3},
-  pages     = {165--181},
-  year      = {2007}
-}
-@article{Farsi2010somayeh,
-  author    = {Somayeh Bakhshaei and Shahram Khadivi and Noushin Riahi },
-  title     = {Farsi-german statistical machine translation through bridge language},
-  publisher   = {International Telecommunications Symposium},
-  pages     = {165--181},
-  year      = {2010}
-}
-@inproceedings{DBLP:conf/acl/ZahabiBK13,
-  author    = {Samira Tofighi Zahabi and
-               Somayeh Bakhshaei and
-               Shahram Khadivi},
-  title     = {Using Context Vectors in Improving a Machine Translation System with
-               Bridge Language},
-  pages     = {318--322},
-  publisher = {The Association for Computer Linguistics},
-  year      = {2013}
-}
-@inproceedings{DBLP:conf/emnlp/ZhuHWZWZ14,
-  author    = {Xiaoning Zhu and
-               Zhongjun He and
-               Hua Wu and
-               Conghui Zhu and
-               Haifeng Wang and
-               Tiejun Zhao},
-  title     = {Improving Pivot-Based Statistical Machine Translation by Pivoting
-               the Co-occurrence Count of Phrase Pairs},
-  pages     = {1665--1675},
-  publisher = {{ACL}},
-  year      = {2014}
-}
-@inproceedings{DBLP:conf/acl/MiuraNSTN15,
-  author    = {Akiva Miura and
-               Graham Neubig and
-               Sakriani Sakti and
-               Tomoki Toda and
-               Satoshi Nakamura},
-  title     = {Improving Pivot Translation by Remembering the Pivot},
-  pages     = {573--577},
-  publisher = {The Association for Computer Linguistics},
-  year      = {2015}
-}
-@inproceedings{DBLP:conf/acl/CohnL07,
-  author    = {Trevor Cohn and
-               Mirella Lapata},
-  title     = {Machine Translation by Triangulation: Making Effective Use of Multi-Parallel
-               Corpora},
-  publisher = {The Association for Computational Linguistics},
-  year      = {2007}
-}
-@article{DBLP:journals/mt/WuW07,
-  author    = {Hua Wu and
-               Haifeng Wang},
-  title     = {Pivot language approach for phrase-based statistical machine translation},
-  journal   = {Mach. Transl.},
-  volume    = {21},
-  number    = {3},
-  pages     = {165--181},
-  year      = {2007}
-}
-@inproceedings{DBLP:conf/acl/WuW09,
-  author    = {Hua Wu and
-               Haifeng Wang},
-  title     = {Revisiting Pivot Language Approach for Machine Translation},
-  pages     = {154--162},
-  publisher = {The Association for Computer Linguistics},
-  year      = {2009}
-}
-@article{DBLP:journals/corr/ChengLYSX16,
-  author    = {Yong Cheng and
-               Yang Liu and
-               Qian Yang and
-               Maosong Sun and
-               Wei Xu},
-  title     = {Neural Machine Translation with Pivot Languages},
-  journal   = {CoRR},
-  volume    = {abs/1611.04928},
-  year      = {2016}
-}
-@inproceedings{DBLP:conf/interspeech/KauersVFW02,
-  author    = {Manuel Kauers and
-               Stephan Vogel and
-               Christian F{\"{u}}gen and
-               Alex Waibel},
-  title     = {Interlingua based statistical machine translation},
-  publisher = {International Symposium on Computer Architecture},
-  year      = {2002}
-}
-@inproceedings{de2006catalan,
-  title={Catalan-English statistical machine translation without parallel corpus: bridging through Spanish},
-  author={De Gispert, Adri{\`a} and Marino, Jose B},
-  booktitle={Proc. of 5th International Conference on Language Resources and Evaluation (LREC)},
-  pages={65--68},
-  year={2006}
-}
-@inproceedings{DBLP:conf/naacl/UtiyamaI07,
-  author    = {Masao Utiyama and
-               Hitoshi Isahara},
-  title     = {A Comparison of Pivot Methods for Phrase-Based Statistical Machine
-               Translation},
-  pages     = {484--491},
-  publisher = {The Association for Computational Linguistics},
-  year      = {2007}
-}
-@inproceedings{DBLP:conf/ijcnlp/Costa-JussaHB11,
-  author    = {Marta R. Costa-juss{\`{a}} and
-               Carlos A. Henr{\'{\i}}quez Q. and
-               Rafael E. Banchs},
-  title     = {Enhancing scarce-resource language translation through pivot combinations},
-  pages     = {1361--1365},
-  publisher = {The Association for Computer Linguistics},
-  year      = {2011}
-}
-@article{DBLP:journals/corr/HintonVD15,
-  author    = {Geoffrey E. Hinton and
-               Oriol Vinyals and
-               Jeffrey Dean},
-  title     = {Distilling the Knowledge in a Neural Network},
-  journal   = {CoRR},
-  volume    = {abs/1503.02531},
-  year      = {2015}
-}
-@article{gu2018meta,
-  title={Meta-learning for low-resource neural machine translation},
-  author={Gu, Jiatao and Wang, Yong and Chen, Yun and Cho, Kyunghyun and Li, Victor OK},
-  journal={arXiv preprint arXiv:1808.08437},
-  year={2018}
-}
-@inproceedings{DBLP:conf/naacl/GuHDL18,
-  author    = {Jiatao Gu and
-               Hany Hassan and
-               Jacob Devlin and
-               Victor O. K. Li},
-  title     = {Universal Neural Machine Translation for Extremely Low Resource Languages},
-  pages     = {344--354},
-  publisher = {Association for Computational Linguistics},
-  year      = {2018}
-}
-@inproceedings{DBLP:conf/icml/FinnAL17,
-  author    = {Chelsea Finn and
-               Pieter Abbeel and
-               Sergey Levine},
-  title     = {Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks},
-  series    = {Proceedings of Machine Learning Research},
-  volume    = {70},
-  pages     = {1126--1135},
-  publisher = {International Conference on Machine Learning},
-  year      = {2017}
-}
-@inproceedings{DBLP:conf/acl/DongWHYW15,
-  author    = {Daxiang Dong and
-               Hua Wu and
-               Wei He and
-               Dianhai Yu and
-               Haifeng Wang},
-  title     = {Multi-Task Learning for Multiple Language Translation},
-  pages     = {1723--1732},
-  publisher = {The Association for Computer Linguistics},
-  year      = {2015}
-}
-@article{DBLP:journals/tacl/LeeCH17,
-  author    = {Jason Lee and
-               Kyunghyun Cho and
-               Thomas Hofmann},
-  title     = {Fully Character-Level Neural Machine Translation without Explicit
-               Segmentation},
-  journal   = {Trans. Assoc. Comput. Linguistics},
-  volume    = {5},
-  pages     = {365--378},
-  year      = {2017}
-}
-@inproceedings{DBLP:conf/lrec/RiktersPK18,
-  author    = {Matiss Rikters and
-               Marcis Pinnis and
-               Rihards Krislauks},
-  title     = {Training and Adapting Multilingual {NMT} for Less-resourced and Morphologically
-               Rich Languages},
-  publisher = {European Language Resources Association},
-  year      = {2018}
-}
-@article{DBLP:journals/tkde/PanY10,
-  author    = {Sinno Jialin Pan and
-               Qiang Yang},
-  title     = {A Survey on Transfer Learning},
-  journal   = {{IEEE} Trans. Knowl. Data Eng.},
-  volume    = {22},
-  number    = {10},
-  pages     = {1345--1359},
-  year      = {2010}
-}
-@article{DBLP:journals/tacl/JohnsonSLKWCTVW17,
-  author    = {Melvin Johnson and
-               Mike Schuster and
-               Quoc V. Le and
-               Maxim Krikun and
-               Yonghui Wu and
-               Zhifeng Chen and
-               Nikhil Thorat and
-               Fernanda B. Vi{\'{e}}gas and
-               Martin Wattenberg and
-               Greg Corrado and
-               Macduff Hughes and
-               Jeffrey Dean},
-  title     = {Google's Multilingual Neural Machine Translation System: Enabling
-               Zero-Shot Translation},
-  journal   = {Trans. Assoc. Comput. Linguistics},
-  volume    = {5},
-  pages     = {339--351},
-  year      = {2017}
-}
-@book{2009Handbook,
-  title={Handbook Of Research On Machine Learning Applications and Trends: Algorithms, Methods and Techniques - 2 Volumes},
-  author={ Olivas, Emilio Soria  and  Guerrero, Jose David Martin  and  Sober, Marcelino Martinez  and  Benedito, Jose Rafael Magdalena  and  Lopez, Antonio Jose Serrano },
-  publisher={Information Science Reference - Imprint of: IGI Publishing},
-  year={2009},
-}
-@incollection{DBLP:books/crc/aggarwal14/Pan14,
-  author    = {Sinno Jialin Pan},
-  title     = {Transfer Learning},
-  booktitle = {Data Classification: Algorithms and Applications},
-  pages     = {537--570},
-  publisher = {{CRC} Press},
-  year      = {2014}
-}
-@inproceedings{DBLP:conf/iclr/TanRHQZL19,
-  author    = {Xu Tan and
-               Yi Ren and
-               Di He and
-               Tao Qin and
-               Zhou Zhao and
-               Tie-Yan Liu},
-  title     = {Multilingual Neural Machine Translation with Knowledge Distillation},
-  publisher = {OpenReview.net},
-  year      = {2019}
-}
-@article{platanios2018contextual,
-  title={Contextual parameter generation for universal neural machine translation},
-  author={Platanios, Emmanouil Antonios and Sachan, Mrinmaya and Neubig, Graham and Mitchell, Tom},
-  journal={arXiv preprint arXiv:1808.08493},
-  year={2018}
-}
-@inproceedings{ji2020cross,
-  title={Cross-Lingual Pre-Training Based Transfer for Zero-Shot Neural Machine Translation},
-  author={Ji, Baijun and Zhang, Zhirui and Duan, Xiangyu and Zhang, Min and Chen, Boxing and Luo, Weihua},
-  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
-  volume={34},
-  number={01},
-  pages={115--122},
-  year={2020}
-}
-@inproceedings{DBLP:conf/wmt/KocmiB18,
-  author    = {Tom Kocmi and
-               Ondrej Bojar},
-  title     = {Trivial Transfer Learning for Low-Resource Neural Machine Translation},
-  pages     = {244--252},
-  publisher = {Association for Computational Linguistics},
-  year      = {2018}
-}
-@inproceedings{DBLP:conf/acl/ZhangWTS20,
-  author    = {Biao Zhang and
-               Philip Williams and
-               Ivan Titov and
-               Rico Sennrich},
-  title     = {Improving Massively Multilingual Neural Machine Translation and Zero-Shot
-               Translation},
-  pages     = {1628--1639},
-  publisher = {Association for Computational Linguistics},
-  year      = {2020}
-}
-@inproceedings{DBLP:conf/naacl/PaulYSN09,
-  author    = {Michael Paul and
-               Hirofumi Yamamoto and
-               Eiichiro Sumita and
-               Satoshi Nakamura},
-  title     = {On the Importance of Pivot Language Selection for Statistical Machine
-               Translation},
-  pages     = {221--224},
-  publisher = {The Association for Computational Linguistics},
-  year      = {2009}
-}
-@article{dabre2019brief,
-  title={A Brief Survey of Multilingual Neural Machine Translation},
-  author={Dabre, Raj and Chu, Chenhui and Kunchukuttan, Anoop},
-  journal={arXiv preprint arXiv:1905.05395},
-  year={2019}
-}
-@article{dabre2020survey,
-  title={A survey of multilingual neural machine translation},
-  author={Dabre, Raj and Chu, Chenhui and Kunchukuttan, Anoop},
-  journal={ACM Computing Surveys (CSUR)},
-  volume={53},
-  number={5},
-  pages={1--38},
-  year={2020}
-}
-@inproceedings{DBLP:conf/emnlp/VulicGRK19,
-  author    = {Ivan Vulic and
-               Goran Glavas and
-               Roi Reichart and
-               Anna Korhonen},
-  title     = {Do We Really Need Fully Unsupervised Cross-Lingual Embeddings?},
-  pages     = {4406--4417},
-  publisher = {Association for Computational Linguistics},
-  year      = {2019}
-}
-@article{DBLP:journals/corr/MikolovLS13,
-  author    = {Tomas Mikolov and
-               Quoc V. Le and
-               Ilya Sutskever},
-  title     = {Exploiting Similarities among Languages for Machine Translation},
-  journal   = {CoRR},
-  volume    = {abs/1309.4168},
-  year      = {2013}
-}
 %%%%% chapter 16------------------------------------------------------
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%