合并分支 'caorunzhe' 到 'master'

Caorunzhe 查看合并请求 !222

合并分支 'caorunzhe' 到 'master'
Caorunzhe 查看合并请求 !222
dcdd721a · 曹润柘 · 1488ff18 · 4950b5b5 · dcdd721a · dcdd721a
Commit dcdd721a authored Sep 19, 2020 by 曹润柘
--- a/Chapter10/Figures/dog-hat-new.jpg
+++ b/Chapter10/Figures/dog-hat-new.jpg
--- a/Chapter10/chapter10.tex
+++ b/Chapter10/chapter10.tex
@@ -191,23 +191,25 @@ NMT                     & 21.7          & 18.7           & -13.7      \\
 \end{table}
 %----------------------------------------------

-\parinterval  在最近两年，神经机器翻译的发展更加迅速，新的模型、方法层出不穷。表\ref{tab:10-3}给出了到2019年为止一些主流的神经机器翻译模型的对比\upcite{WangLearning}（{\color{red} 是否可以把2020年的工作加上，因为书是明年出版}）。可以看到，相比2017 年，2018-2019年中机器翻译仍然有明显的进步（{\color{red} 到2020年？？？}）。
+\parinterval  在最近两年，神经机器翻译的发展更加迅速，新的模型、方法层出不穷。表\ref{tab:10-3}给出了到2020年为止一些主流的神经机器翻译模型的对比。可以看到，相比2017年，2018-2020年中机器翻译仍然有明显的进步。

 \vspace{0.5em}%全局布局使用
 %----------------------------------------------
 \begin{table}[htp]
 \centering
-\caption{WMT14英德数据集上不同神经机器翻译系统的表现\upcite{WangLearning}}
+\caption{WMT14英德数据集上不同神经机器翻译系统的表现}
 \label{tab:10-3}
 \begin{tabular}{ l | l l l}
   模型         		 &作者	& 年份	& BLEU[\%] \\ \hline
-   ConvS2S                			&Gehring等 		&2017 			&25.2 \\
-   Transformer-Base 			&Vaswani等 		&2017 			&27.3 \\
-   Transformer-Big   			&Vaswani等 		&2017 			&28.4 \\
-   RNMT+					&Chen等 	  	&2018  			&28.5 \\
-   Layer-Wise Coordination 	&Xu等 	 		&2018 			&29.0 \\
-   Transformer-RPR		 	&Shaw等 	 	&2018 			&29.2 \\
-   Transformer-DLCL			 &Wang等 	 	&2019 			&29.3 \\
+   ConvS2S \upcite{DBLP:journals/corr/GehringAGYD17}                			&Gehring等 		&2017 			&25.2 \\
+   Transformer-Base \upcite{vaswani2017attention}			&Vaswani等 		&2017 			&27.3 \\
+   Transformer-Big  \upcite{vaswani2017attention} 			&Vaswani等 		&2017 			&28.4 \\
+   RNMT+		 \upcite{Chen2018TheBO}			&Chen等 	  	&2018  			&28.5 \\
+   Layer-Wise Coordination \upcite{He2018LayerWiseCB}	&He等 	 		&2018 			&29.0 \\
+   Transformer-RPR	\upcite{Shaw2018SelfAttentionWR}	 	&Shaw等 	 	&2018 			&29.2 \\
+   Transformer-DLCL	\upcite{Wang2019LearningDT}		 &Wang等 	 	&2019 			&29.3 \\
+   Msc                  \upcite{Wei2020MultiscaleCD}    &Wei等   &2020  &30.56 \\
+  （北哥的论文） & &2020 & \\
 \end{tabular}
 \end{table}
 %----------------------------------------------
@@ -381,16 +383,16 @@ NMT                     & 21.7          & 18.7           & -13.7      \\
 %----------------------------------------------
 \begin{table}[htp]
 \centering
-\caption{2013-2015期间神经机器翻译方面的部分论文（{\color{red} 论文要加引用}）}
+\caption{2013-2015期间神经机器翻译方面的部分论文}
 \label{tab:10-6}
 \begin{tabular}{l| l p{8cm}}
 \rule{0pt}{16pt} 时间   & 作者                                                              & 论文                                                                      \\ \hline
-\rule{0pt}{0pt} 2013 & \begin{tabular}[c]{@{}l@{}l@{}}\\Kalchbrenner\\ 和Blunsom\end{tabular} & Recurrent Continuous Translation Models                                 \\
-\rule{0pt}{16pt} 2014 & Sutskever等                                                       & Sequence to Sequence Learning with neural networks                      \\
-\rule{0pt}{16pt} 2014 & Bahdanau等                                                       & Neural Machine Translation by Jointly Learning to Align and Translate \\
-\rule{0pt}{16pt} 2014 & Cho等                                                            & On the Properties of Neural Machine Translation                         \\
-\rule{0pt}{16pt} 2015 & Jean等                                                           & On Using Very Large Target Vocabulary for Neural Machine Translation    \\
-\rule{0pt}{16pt} 2015 & Luong等                                                          & Effective Approches to Attention-based Neural Machine Translation
+\rule{0pt}{0pt} 2013 & \begin{tabular}[c]{@{}l@{}l@{}}\\Kalchbrenner\\ 和Blunsom\end{tabular} & Recurrent Continuous Translation Models \upcite{kalchbrenner-blunsom-2013-recurrent}                                \\
+\rule{0pt}{16pt} 2014 & Sutskever等                                                       & Sequence to Sequence Learning with neural networks \upcite{NIPS2014_5346}                     \\
+\rule{0pt}{16pt} 2014 & Bahdanau等                                                       & Neural Machine Translation by Jointly Learning to Align and Translate \upcite{bahdanau2014neural} \\
+\rule{0pt}{16pt} 2014 & Cho等                                                            & On the Properties of Neural Machine Translation \upcite{cho-etal-2014-properties}                        \\
+\rule{0pt}{16pt} 2015 & Jean等                                                           & On Using Very Large Target Vocabulary for Neural Machine Translation \upcite{DBLP:conf/acl/JeanCMB15}   \\
+\rule{0pt}{16pt} 2015 & Luong等                                                          & Effective Approches to Attention-based Neural Machine Translation \upcite{luong-etal-2015-effective}
 \end{tabular}
 \end{table}
 %----------------------------------------------
@@ -660,13 +662,13 @@ $\funp{P}({y_j | \vectorn{\emph{s}}_{j-1} ,y_{j-1},\vectorn{\emph{C}}})$由Softm

 \noindent 之所以能想到在横线处填“吃饭”、“吃东西”很有可能是因为看到了“没/吃饭”、 “很/饿”等关键信息。也就是这些关键的片段对预测缺失的单词起着关键性作用。而预测“吃饭”与前文中的“ 中午”、“又”之间的联系似乎不那么紧密。也就是说，在形成 “吃饭”的逻辑时，在潜意识里会更注意“没/吃饭”、“很饿”等关键信息。也就是我们的关注度并不是均匀地分布在整个句子上的。

-\parinterval 这个现象可以用注意力机制进行解释。注意力机制的概念来源于生物学的一些现象：当待接收的信息过多时，人类会选择性地关注部分信息而忽略其他信息。它在人类的视觉、听觉、嗅觉等方面均有体现，当我们在感受事物时，大脑会自动过滤或衰减部分信息，仅关注其中少数几个部分。例如，当看到图\ref{fig:12-20}时，往往不是“均匀地”看图像中的所有区域，可能最先注意到的是大狗头上戴的帽子，然后才会关注图片中其他的部分。那注意力机制是如何解决神经机器翻译的问题呢？下面就一起来看一看。
+\parinterval 这个现象可以用注意力机制进行解释。注意力机制的概念来源于生物学的一些现象：当待接收的信息过多时，人类会选择性地关注部分信息而忽略其他信息。它在人类的视觉、听觉、嗅觉等方面均有体现，当我们在感受事物时，大脑会自动过滤或衰减部分信息，仅关注其中少数几个部分。例如，当看到图\ref{fig:12-20}时，往往不是“均匀地”看图像中的所有区域，可能最先注意到的是小狗头上戴的帽子，然后才会关注图片中其他的部分。那注意力机制是如何解决神经机器翻译的问题呢？下面就一起来看一看。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
-\includegraphics[scale=0.2]{./Chapter12/Figures/dog-hat.jpg}
-\caption{戴帽子的狗（{\color{red} 这个图是不是也要换}）}
+\includegraphics[scale=0.05]{./Chapter10/Figures/dog-hat-new.jpg}
+\caption{戴帽子的狗}
 \label{fig:12-20}
 \end{figure}
 %----------------------------------------------
@@ -691,7 +693,7 @@ $\funp{P}({y_j | \vectorn{\emph{s}}_{j-1} ,y_{j-1},\vectorn{\emph{C}}})$由Softm
 %----------------------------------------------
 \begin{figure}[htp]
 \centering
-\input{./Chapter12/Figures/figure-attention-of-source-and-target-words}
+\input{./Chapter10/Figures/figure-attention-of-source-and-target-words}
 \caption{源语言词和目标语言词的关注度}
 \label{fig:12-21}
 \end{figure}
@@ -714,7 +716,7 @@ $\funp{P}({y_j | \vectorn{\emph{s}}_{j-1} ,y_{j-1},\vectorn{\emph{C}}})$由Softm
 %    NEW SUB-SECTION
 %----------------------------------------------------------------------------------------
 \subsection{上下文向量的计算}
-\label{sec:12.1.3}
+\label{sec:10.1.3}

 \parinterval 神经机器翻译中，注意力机制的核心是：针对不同目标语言单词生成不同的上下文向量呢？这里，可以将注意力机制看做是一种对接收到的信息的加权处理。对于更重要的信息赋予更高的权重即更高的关注度，对于贡献度较低的信息分配较低的权重，弱化其对结果的影响。这样，$\vectorn{\emph{C}}_j$可以包含更多对当前目标语言位置有贡献的源语言片段的信息。

@@ -1253,11 +1255,9 @@ L(\vectorn{\emph{Y}},\widehat{\vectorn{\emph{Y}}}) = \sum_{j=1}^n L_{\textrm{ce}
 \vspace{0.5em}
 \item 循环神经网络有很多变种结构。比如，除了RNN、LSTM、GRU，还有其他改进的循环单元结构，如LRN\upcite{DBLP:journals/corr/abs-1905-13324}、SRU\upcite{Lei2017TrainingRA}、ATR\upcite{Zhang2018SimplifyingNM}。
 \vspace{0.5em}
-\item 注意力机制的使用是机器翻译乃至整个自然语言处理近几年获得成功的重要因素之一\upcite{bahdanau2014neural,DBLP:journals/corr/LuongPM15}。早期，有研究者尝试将注意力机制和统计机器翻译的词对齐进行统一\upcite{WangNeural}（{\color{red} 不止这一篇，和李北确认一下}）。近两年，也有研究已经发现注意力模型可以捕捉一些语言现象\upcite{DBLP:journals/corr/abs-1905-09418}（{\color{red} 这一部分应该放到Transformer那一章，如果前面的内容比较少(RNN+attention)，可以说一下，attention在其它人任务中的一些应用}），比如，在Transformer 的多头注意力中，不同头往往会捕捉到不同的信息，比如，有些头对低频词更加敏感，有些头更适合词意消歧，甚至有些头可以捕捉句法信息。此外，由于注意力机制增加了模型的复杂性，而且随着网络层数的增多，神经机器翻译中也存在大量的冗余，因此研发轻量的注意力模型也是具有实践意义的方向\upcite{Xiao2019SharingAW}。
+\item 注意力机制的使用是机器翻译乃至整个自然语言处理近几年获得成功的重要因素之一\upcite{bahdanau2014neural,DBLP:journals/corr/LuongPM15}。早期，有研究者尝试将注意力机制和统计机器翻译的词对齐进行统一\upcite{WangNeural,He2016ImprovedNM,li-etal-2019-word}。（{\color{red} 如果前面的内容比较少(RNN+attention)，可以说一下，attention在其它人任务中的一些应用}）
 \vspace{0.5em}
-\item 一般来说，神经机器翻译的计算过程是没有人工干预的，翻译流程也无法用人类的知识直接进行解释，因此一个有趣的方向是在神经机器翻译中引入先验知识，使得机器翻译的行为更“像”人。比如，可以使用句法树来引入人类的语言学知识\upcite{Yang2017TowardsBH,Wang2019TreeTI}，基于句法的神经机器翻译也包含大量的树结构的神经网络建模\upcite{DBLP:journals/corr/abs-1809-01854,DBLP:journals/corr/abs-1808-09374}。此外，也可以把用户定义的词典或者翻译记忆加入到翻译过程来\upcite{DBLP:journals/corr/ZhangZ16c}（{\color{red} 应该还有论文，基于先验知识的，一般都会描述词典，清华liuyang他们，还有liuqun老师组都发过相关的，基于先验知识或者词语约束的翻译}），使得用户的约束可以直接反映到机器翻译的结果上来。先验知识的种类还有很多，包括词对齐\upcite{li-etal-2019-word}、 篇章信息\upcite{Werlen2018DocumentLevelNM,DBLP:journals/corr/abs-1805-10163} 等等，都是神经机器翻译中能够使用的信息。
-\vspace{0.5em}
-\item （{\color{red} 这部分感觉放到Transformer那章更加合适，因为很多都是在Transformer上做的}）神经机器翻译依赖成本较高的GPU设备，因此对模型的裁剪和加速也是很多系统研发人员所感兴趣的方向。比如，从工程上，可以考虑减少运算强度，比如使用低精度浮点数\upcite{Ott2018ScalingNM} 或者整数\upcite{DBLP:journals/corr/abs-1906-00532,Lin2020TowardsF8}进行计算，或者引入缓存机制来加速模型的推断；也可以通过对模型参数矩阵的剪枝来减小整个模型的体积\upcite{DBLP:journals/corr/SeeLM16}；另一种方法是知识精炼\upcite{Hinton2015Distilling,kim-rush-2016-sequence}。 利用大模型训练小模型，这样往往可以得到比单独训练小模型更好的效果\upcite{DBLP:journals/corr/ChenLCL17}。
+\item 一般来说，神经机器翻译的计算过程是没有人工干预的，翻译流程也无法用人类的知识直接进行解释，因此一个有趣的方向是在神经机器翻译中引入先验知识，使得机器翻译的行为更“像”人。比如，可以使用句法树来引入人类的语言学知识\upcite{Yang2017TowardsBH,Wang2019TreeTI}，基于句法的神经机器翻译也包含大量的树结构的神经网络建模\upcite{DBLP:journals/corr/abs-1809-01854,DBLP:journals/corr/abs-1808-09374}。此外，也可以把用户定义的词典或者翻译记忆加入到翻译过程来\upcite{DBLP:journals/corr/ZhangZ16c,zhang-etal-2017-prior,duan-etal-2020-bilingual,cao-xiong-2018-encoding}，使得用户的约束可以直接反映到机器翻译的结果上来。先验知识的种类还有很多，包括词对齐\upcite{li-etal-2019-word}、 篇章信息\upcite{Werlen2018DocumentLevelNM,DBLP:journals/corr/abs-1805-10163} 等等，都是神经机器翻译中能够使用的信息。
 \end{itemize}


--- a/Chapter12/chapter12.tex
+++ b/Chapter12/chapter12.tex
@@ -174,7 +174,7 @@
 \end{figure}
 %----------------------------------------------

-\parinterval 此外，编码端和解码端都有输入的词序列。编码端的词序列输入是为了对其进行表示，进而解码端能从编码端访问到源语言句子的全部信息。解码端的词序列输入是为了进行目标语的生成，本质上它和语言模型是一样的，在得到前$n-1$个单词的情况下输出第$n$个单词。除了输入的词序列的词嵌入，Transformer中也引入了位置嵌入，以表示每个位置信息。原因是，自注意力机制没有显性地对位置进行表示，因此也无法考虑词序。在输入中引入位置信息可以让自注意力机制间接地感受到每个词的位置，进而保证对序列表示的合理性。最终，整个模型的输出由一个Softmax层完成，它和循环神经网络中的输出层是完全一样的（\ref{sec:10.3.2}节）。
+\parinterval 此外，编码端和解码端都有输入的词序列。编码端的词序列输入是为了对其进行表示，进而解码端能从编码端访问到源语言句子的全部信息。解码端的词序列输入是为了进行目标语的生成，本质上它和语言模型是一样的，在得到前$n-1$个单词的情况下输出第$n$个单词。除了输入的词序列的词嵌入，Transformer中也引入了位置嵌入，以表示每个位置信息。原因是，自注意力机制没有显性地对位置进行表示，因此也无法考虑词序。在输入中引入位置信息可以让自注意力机制间接地感受到每个词的位置，进而保证对序列表示的合理性。最终，整个模型的输出由一个Softmax层完成，它和循环神经网络中的输出层是完全一样的。

 \parinterval 在进行更详细的介绍前，先利用图\ref{fig:12-39}简单了解一下Transformer模型是如何进行翻译的。首先，Transformer将源语“我\ 很\ 好”的{\small\bfnew{词嵌入}}\index{词嵌入}（Word Embedding）\index{Word Embedding}融合{\small\bfnew{位置编码}}\index{位置编码}（Position Embedding）\index{Position Embedding}后作为输入。然后，编码器对输入的源语句子进行逐层抽象，得到包含丰富的上下文信息的源语表示并传递给解码器。解码器的每一层，使用自注意力子层对输入解码端的表示进行加工，之后再使用编码-解码注意力子层融合源语句子的表示信息。就这样逐词生成目标语译文单词序列。解码器的每个位置的输入是当前单词（比如，“I”），而这个位置输出是下一个单词（比如，“am”），这个设计和标准的神经语言模型是完全一样的。

@@ -270,7 +270,7 @@

 \subsection{点乘注意力}

-\parinterval 在\ref{sec:12.1.3}节中已经介绍，自注意力机制中至关重要的是获取相关性系数，也就是在融合不同位置的表示向量时各位置的权重。不同于\ref{sec:12.1.3}节介绍的注意力机制的相关性系数计算方式，Transformer模型采用了一种基于点乘的方法来计算相关性系数。这种方法也称为{\small\bfnew{点乘注意力}}\index{点乘注意力}（Scaled Dot-Product Attention）\index{Scaled Dot-Product Attention}机制。它的运算并行度高，同时并不消耗太多的存储空间。
+\parinterval 在\ref{sec:12.1}节中已经介绍，自注意力机制中至关重要的是获取相关性系数，也就是在融合不同位置的表示向量时各位置的权重。不同于第十章介绍的注意力机制的相关性系数计算方式，Transformer模型采用了一种基于点乘的方法来计算相关性系数。这种方法也称为{\small\bfnew{点乘注意力}}\index{点乘注意力}（Scaled Dot-Product Attention）\index{Scaled Dot-Product Attention}机制。它的运算并行度高，同时并不消耗太多的存储空间。

 \parinterval 具体来看，在注意力机制的计算过程中，包含三个重要的参数，分别是Query，\\Key和Value。在下面的描述中，分别用$\vectorn{\emph{Q}}$，$\vectorn{\emph{K}}$，$\vectorn{\emph{V}}$对它们进行表示，其中$\vectorn{\emph{Q}}$ 和$\vectorn{\emph{K}}$的维度为$L\times d_k$，$\vectorn{\emph{V}}$的维度为$L\times d_v$。这里，$L$为序列的长度，$d_k$和$d_v$分别表示每个Key和Value的大小，通常设置为$d_k=d_v=d_{model}$。

@@ -571,4 +571,17 @@ Transformer Deep(48层) & 30.2            & 43.1            & 194$\times 10^{6}$
 %----------------------------------------------------------------------------------------
 \section{小结及深入阅读}

-\parinterval
+\parinterval 编码器-解码器框架提供了一个非常灵活的机制，因为开发者只需要设计编码器和解码器的结构就能完成机器翻译。但是，架构的设计是深度学习中最具挑战的工
+作，优秀的架构往往需要长时间的探索和大量的实验验证，而且还需要一点点 “灵感”。前面介绍的基于循环神经网络的翻译模型和注意力机制就是研究人员通过长期
+的实践发现的神经网络架构。本章介绍了一个全新的模型\ \dash \ Transformer，同时对很多优秀的技术进行了介绍。除了基础知识，关于自注意力机制和提高模型性能的技术还有很多可以讨论的地方：
+
+\begin{itemize}
+\vspace{0.5em}
+\item 近两年，有研究已经发现注意力模型可以捕捉一些语言现象\upcite{DBLP:journals/corr/abs-1905-09418}，比如，在Transformer 的多头注意力中，不同头往往会捕捉到不同的信息，比如，有些头对低频词更加敏感，有些头更适合词意消歧，甚至有些头可以捕捉句法信息。此外，由于注意力机制增加了模型的复杂性，而且随着网络层数的增多，神经机器翻译中也存在大量的冗余，因此研发轻量的注意力模型也是具有实践意义的方向\upcite{Xiao2019SharingAW}。
+\vspace{0.5em}
+\item 神经机器翻译依赖成本较高的GPU设备，因此对模型的裁剪和加速也是很多系统研发人员所感兴趣的方向。比如，从工程上，可以考虑减少运算强度，比如使用低精度浮点数\upcite{Ott2018ScalingNM} 或者整数\upcite{DBLP:journals/corr/abs-1906-00532,Lin2020TowardsF8}进行计算，或者引入缓存机制来加速模型的推断\upcite{Vaswani2018Tensor2TensorFN}；也可以通过对模型参数矩阵的剪枝来减小整个模型的体积\upcite{DBLP:journals/corr/SeeLM16}；另一种方法是知识精炼\upcite{Hinton2015Distilling,kim-rush-2016-sequence}。 利用大模型训练小模型，这样往往可以得到比单独训练小模型更好的效果\upcite{DBLP:journals/corr/ChenLCL17}。
+\vspace{0.5em}
+\item 自注意力网络作为Transformer模型中重要组成部分，近年来受到研究人员的广泛关注，尝试设计更高效地操作来替代它。比如，利用动态卷积网络来替换编码端与解码端的自注意力网络，在保证推断效率的同时取得了和Transformer相当甚至略好的翻译性能\upcite{Wu2019PayLA}；为了加速Transformer处理较长输入文本的效率，利用局部敏感哈希替换自注意力机制的Reformer模型吸引了广泛学者的关注\upcite{Kitaev2020ReformerTE}。此外，在自注意力网络引入额外的编码信息能够进一步提高模型的表示能力。比如，引入固定窗口大小的相对位置编码信息\upcite{Shaw2018SelfAttentionWR,dai-etal-2019-transformer},或利用动态系统的思想从数据中学习特定的位置编码表示，具有更好的泛化能力\upcite{Liu2020LearningTE}。通过对Transformer模型中各层输出进行可视化分析，研究人员发现Transformer自底向上各层网络依次聚焦于词级-语法级-语义级的表示\upcite{Jawahar2019WhatDB}(Shallow-to-Deep Training for Neural Machine Translation(我的EMNLP，过两天挂arXiv))，因此在底层的自注意力网络中引入局部编码信息有助于模型对局部特征的抽象\upcite{Yang2018ModelingLF,DBLP:journals/corr/abs-1904-03107}。
+\vspace{0.5em}
+\item 除了针对Transformer中子层的优化，网络各层之间的连接方式在一定程度上也能影响模型的表示能力。近年来针对网络连接优化的工作如下：在编码端顶部利用平均池化或权重累加等融合手段得到编码端各层的全局表示\upcite{Wang2018MultilayerRF,Bapna2018TrainingDN,Dou2018ExploitingDR,Wang2019ExploitingSC}，利用之前各层表示来生成当前层的输入表示\upcite{Wang2019LearningDT,Dou2019DynamicLA,Wei2020MultiscaleCD}。
+\end{itemize}
--- a/bibliography.bib
+++ b/bibliography.bib
@@ -4188,34 +4188,24 @@ year = {2012}
               John Makhoul},
  title     = {Fast and Robust Neural Network Joint Models for Statistical Machine
               Translation},
-  publisher = {Proceedings of the 52nd Annual Meeting of the Association for Computational
-               Linguistics, {ACL} 2014, June 22-27, 2014, Baltimore, MD, USA, Volume
-               1: Long Papers},
  pages     = {1370--1380},
-  //publisher = {The Association for Computer Linguistics},
+  publisher = {The Association for Computer Linguistics},
  year      = {2014},
 }
 @inproceedings{Schwenk_continuousspace,
  author    = {Holger Schwenk},
  title     = {Continuous Space Translation Models for Phrase-Based Statistical Machine
               Translation},
-  publisher = {{COLING} 2012, 24th International Conference on Computational Linguistics,
-               Proceedings of the Conference: Posters, 8-15 December 2012, Mumbai,
-               India},
  pages     = {1071--1080},
-  //publisher = {Indian Institute of Technology Bombay},
+  publisher = {Indian Institute of Technology Bombay},
  year      = {2012},
 }
 @inproceedings{kalchbrenner-blunsom-2013-recurrent,
  author    = {Nal Kalchbrenner and
               Phil Blunsom},
  title     = {Recurrent Continuous Translation Models},
-  publisher = {Proceedings of the 2013 Conference on Empirical Methods in Natural
-               Language Processing, {EMNLP} 2013, 18-21 October 2013, Grand Hyatt
-               Seattle, Seattle, Washington, USA, {A} meeting of SIGDAT, a Special
-               Interest Group of the {ACL}},
  pages     = {1700--1709},
-  //publisher = {{ACL}},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2013},
 }
 @article{HochreiterThe,
@@ -4249,8 +4239,7 @@ pages ={157-166},
               Illia Polosukhin},
  title     = {Attention is All you Need},
  publisher = {Advances in Neural Information Processing Systems 30: Annual Conference
-               on Neural Information Processing Systems 2017, 4-9 December 2017,
-               Long Beach, CA, {USA}},
+               on Neural Information Processing Systems},
  pages     = {5998--6008},
  year      = {2017},
 }
@@ -4267,11 +4256,8 @@ pages ={157-166},
               Mauro Cettolo and
               Marcello Federico},
  title     = {Neural versus Phrase-Based Machine Translation Quality: a Case Study},
-  publisher = {Proceedings of the 2016 Conference on Empirical Methods in Natural
-               Language Processing, {EMNLP} 2016, Austin, Texas, USA, November 1-4,
-               2016},
  pages     = {257--267},
-  //publisher = {The Association for Computational Linguistics},
+  publisher = {The Association for Computational Linguistics},
  year      = {2016},
 }
 @article{Hassan2018AchievingHP,
@@ -4313,11 +4299,8 @@ pages ={157-166},
               Derek F. Wong and
               Lidia S. Chao},
  title     = {Learning Deep Transformer Models for Machine Translation},
-  publisher = {Proceedings of the 57th Conference of the Association for Computational
-               Linguistics, {ACL} 2019, Florence, Italy, July 28- August 2, 2019,
-               Volume 1: Long Papers},
  pages     = {1810--1822},
-  //publisher = {Association for Computational Linguistics},
+  publisher = {Association for Computational Linguistics},
  year      = {2019},
 }
 @article{Li2020NeuralMT,
@@ -4757,6 +4740,117 @@ pages ={157-166},
  year      = {2015},
  pages     = {243--247},
 }
+@inproceedings{Chen2018TheBO,
+  author    = {Mia Xu Chen and
+               Orhan Firat and
+               Ankur Bapna and
+               Melvin Johnson and
+               Wolfgang Macherey and
+               George F. Foster and
+               Llion Jones and
+               Mike Schuster and
+               Noam Shazeer and
+               Niki Parmar and
+               Ashish Vaswani and
+               Jakob Uszkoreit and
+               Lukasz Kaiser and
+               Zhifeng Chen and
+               Yonghui Wu and
+               Macduff Hughes},
+  title     = {The Best of Both Worlds: Combining Recent Advances in Neural Machine
+               Translation},
+  pages     = {76--86},
+  publisher = {Association for Computational Linguistics},
+  year      = {2018}
+}
+@inproceedings{He2018LayerWiseCB,
+  title={Layer-Wise Coordination between Encoder and Decoder for Neural Machine Translation},
+  author={Tianyu He and X. Tan and Yingce Xia and D. He and T. Qin and Zhibo Chen and T. Liu},
+  publisher={Conference and Workshop on Neural Information Processing Systems},
+  year={2018}
+}
+@inproceedings{cho-etal-2014-properties,
+    title = "On the Properties of Neural Machine Translation: Encoder--Decoder Approaches",
+    author = {Cho, Kyunghyun  and
+      van Merri{\"e}nboer, Bart  and
+      Bahdanau, Dzmitry  and
+      Bengio, Yoshua},
+    month = oct,
+    year = "2014",
+    address = "Doha, Qatar",
+    publisher = "Association for Computational Linguistics",
+    pages = "103--111",
+}
+
+@inproceedings{DBLP:conf/acl/JeanCMB15,
+  author    = {S{\'{e}}bastien Jean and
+               KyungHyun Cho and
+               Roland Memisevic and
+               Yoshua Bengio},
+  title     = {On Using Very Large Target Vocabulary for Neural Machine Translation},
+  pages     = {1--10},
+  publisher = {The Association for Computer Linguistics},
+  year      = {2015}
+}
+
+@inproceedings{luong-etal-2015-effective,
+    title = "Effective Approaches to Attention-based Neural Machine Translation",
+    author = "Luong, Thang  and
+      Pham, Hieu  and
+      Manning, Christopher D.",
+    month = sep,
+    year = "2015",
+    address = "Lisbon, Portugal",
+    publisher = "Association for Computational Linguistics",
+    pages = "1412--1421",
+}
+@inproceedings{He2016ImprovedNM,
+  title={Improved Neural Machine Translation with SMT Features},
+  author={W. He and Zhongjun He and Hua Wu and H. Wang},
+  booktitle={AAAI Conference on Artificial Intelligence},
+  year={2016}
+}
+@inproceedings{zhang-etal-2017-prior,
+    title = "Prior Knowledge Integration for Neural Machine Translation using Posterior Regularization",
+    author = "Zhang, Jiacheng  and
+      Liu, Yang  and
+      Luan, Huanbo  and
+      Xu, Jingfang  and
+      Sun, Maosong",
+    month = jul,
+    year = "2017",
+    address = "Vancouver, Canada",
+    publisher = "Association for Computational Linguistics",
+    pages = "1514--1523",
+}
+
+@inproceedings{duan-etal-2020-bilingual,
+    title = "Bilingual Dictionary Based Neural Machine Translation without Using Parallel Sentences",
+    author = "Duan, Xiangyu  and
+      Ji, Baijun  and
+      Jia, Hao  and
+      Tan, Min  and
+      Zhang, Min  and
+      Chen, Boxing  and
+      Luo, Weihua  and
+      Zhang, Yue",
+    month = jul,
+    year = "2020",
+    address = "Online",
+    publisher = "Association for Computational Linguistics",
+    pages = "1570--1579",
+}
+
+@inproceedings{cao-xiong-2018-encoding,
+    title = "Encoding Gated Translation Memory into Neural Machine Translation",
+    author = "Cao, Qian  and
+      Xiong, Deyi",
+    month = oct # "-" # nov,
+    year = "2018",
+    address = "Brussels, Belgium",
+    publisher = "Association for Computational Linguistics",
+    pages = "3042--3047",
+}

 %%%%% chapter 10------------------------------------------------------
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -4778,9 +4872,7 @@ pages ={157-166},
               Bowen Zhou and
               Yoshua Bengio},
  title     = {A Structured Self-Attentive Sentence Embedding},
-  publisher = {5th International Conference on Learning Representations, {ICLR} 2017,
-               Toulon, France, April 24-26, 2017, Conference Track Proceedings},
-  //publisher = {OpenReview.net},
+  publisher = {5th International Conference on Learning Representations},
  year      = {2017},
 }
 @inproceedings{Shaw2018SelfAttentionWR,
@@ -4789,11 +4881,8 @@ pages ={157-166},
               Ashish Vaswani},
  title     = {Self-Attention with Relative Position Representations},
  publisher = {Proceedings of the 2018 Conference of the North American Chapter of
-               the Association for Computational Linguistics: Human Language Technologies,
-               NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short
-               Papers)},
+               the Association for Computational Linguistics: Human Language Technologies},
  pages     = {464--468},
-  //publisher = {Association for Computational Linguistics},
  year      = {2018},
 }
 @inproceedings{DBLP:journals/corr/HeZRS15,
@@ -4802,10 +4891,8 @@ pages ={157-166},
               Shaoqing Ren and
               Jian Sun},
  title     = {Deep Residual Learning for Image Recognition},
-  publisher = {2016 {IEEE} Conference on Computer Vision and Pattern Recognition,
-               {CVPR} 2016, Las Vegas, NV, USA, June 27-30, 2016},
+  publisher = {{IEEE} Conference on Computer Vision and Pattern Recognition},
  pages     = {770--778},
-  //publisher = {{IEEE} Computer Society},
  year      = {2016},
 }
 @article{JMLR:v15:srivastava14a,
@@ -4823,10 +4910,8 @@ pages ={157-166},
               Jonathon Shlens and
               Zbigniew Wojna},
  title     = {Rethinking the Inception Architecture for Computer Vision},
-  publisher = {2016 {IEEE} Conference on Computer Vision and Pattern Recognition,
-               {CVPR} 2016, Las Vegas, NV, USA, June 27-30, 2016},
+  publisher = {{IEEE} Conference on Computer Vision and Pattern Recognition},
  pages     = {2818--2826},
-  //publisher = {{IEEE} Computer Society},
  year      = {2016},
 }
 @inproceedings{DBLP:journals/corr/abs-1805-00631,
@@ -4835,10 +4920,8 @@ pages ={157-166},
               Jinsong Su},
  title     = {Accelerating Neural Transformer via an Average Attention Network},
  publisher = {Proceedings of the 56th Annual Meeting of the Association for Computational
-               Linguistics, {ACL} 2018, Melbourne, Australia, July 15-20, 2018, Volume
-               1: Long Papers},
+               Linguistics},
  pages     = {1789--1798},
-  //publisher = {Association for Computational Linguistics},
  year      = {2018},
 }
 @article{DBLP:journals/corr/CourbariauxB16,
@@ -4850,7 +4933,136 @@ pages ={157-166},
  volume    = {abs/1602.02830},
  year      = {2016},
 }
+@inproceedings{Wu2019PayLA,
+ author = {Felix Wu and
+		 Angela Fan and
+		 Alexei Baevski and
+		 Yann N. Dauphin and
+		 Michael Auli},
+ title = {Pay Less Attention with Lightweight and Dynamic Convolutions},
+ publisher = {7th International Conference on Learning Representations},
+ year = {2019},
+}
+@article{Shaw2018SelfAttentionWR,
+ title={Self-Attention with Relative Position Representations},
+ author={Peter Shaw and Jakob Uszkoreit and Ashish Vaswani},
+ journal={ArXiv},
+ year={2018},
+ volume={abs/1803.02155}
+}
+@inproceedings{dai-etal-2019-transformer,
+ title = "Transformer-{XL}: Attentive Language Models beyond a Fixed-Length Context",
+ author = "Dai, Zihang and
+		 Yang, Zhilin and
+		 Yang, Yiming and
+		 Carbonell, Jaime and
+		 Le, Quoc and
+		 Salakhutdinov, Ruslan",
+ month = jul,
+ year = "2019",
+ address = "Florence, Italy",
+ publisher = "Association for Computational Linguistics",
+ pages = "2978--2988",
+}
+@article{Liu2020LearningTE,
+	title={Learning to Encode Position for Transformer with Continuous Dynamical Model},
+	author={Xuanqing Liu and Hsiang-Fu Yu and I. Dhillon and Cho-Jui Hsieh},
+	journal={ArXiv},
+	year={2020},
+	volume={abs/2003.09229}
+}
+@inproceedings{Jawahar2019WhatDB,
+	title={What Does BERT Learn about the Structure of Language?},
+	author={Ganesh Jawahar and B. Sagot and Djam{\'e} Seddah},
+	publisher={Annual Meeting of the Association for Computational Linguistics},
+	year={2019}
+}
+@inproceedings{Yang2018ModelingLF,
+	title={Modeling Localness for Self-Attention Networks},
+	author={Baosong Yang and Zhaopeng Tu and Derek F. Wong and Fandong Meng and Lidia S. Chao and T. Zhang},
+	publisher={Conference on Empirical Methods in Natural Language Processing},
+	year={2018}
+}
+@inproceedings{DBLP:journals/corr/abs-1904-03107,
+	author = {Baosong Yang and
+			Longyue Wang and
+			Derek F. Wong and
+			Lidia S. Chao and
+			Zhaopeng Tu},
+	title = {Convolutional Self-Attention Networks},
+	pages = {4040--4045},
+	publisher = {Association for Computational Linguistics},
+	year = {2019},
+}
+@article{Wang2018MultilayerRF,
+  title={Multi-layer Representation Fusion for Neural Machine Translation},
+  author={Qiang Wang and Fuxue Li and Tong Xiao and Yanyang Li and Yinqiao Li and Jingbo Zhu},
+  journal={ArXiv},
+  year={2018},
+  volume={abs/2002.06714}
+}
+@inproceedings{Bapna2018TrainingDN,
+  title={Training Deeper Neural Machine Translation Models with Transparent Attention},
+  author={Ankur Bapna and M. Chen and Orhan Firat and Yuan Cao and Y. Wu},
+  publisher={Conference on Empirical Methods in Natural Language Processing},
+  year={2018}
+}
+@inproceedings{Dou2018ExploitingDR,
+  title={Exploiting Deep Representations for Neural Machine Translation},
+  author={Zi-Yi Dou and Zhaopeng Tu and Xing Wang and Shuming Shi and T. Zhang},
+  publisher={Conference on Empirical Methods in Natural Language Processing},
+  year={2018}
+}
+@inproceedings{Wang2019ExploitingSC,
+  title={Exploiting Sentential Context for Neural Machine Translation},
+  author={Xing Wang and Zhaopeng Tu and Longyue Wang and Shuming Shi},
+  publisher={Annual Meeting of the Association for Computational Linguistics},
+  year={2019}
+}
+
+@inproceedings{Wang2019LearningDT,
+    title = "Learning Deep Transformer Models for Machine Translation",
+    author = "Wang, Qiang  and
+      Li, Bei  and
+      Xiao, Tong  and
+      Zhu, Jingbo  and
+      Li, Changliang  and
+      Wong, Derek F.  and
+      Chao, Lidia S.",
+    month = jul,
+    year = "2019",
+    address = "Florence, Italy",
+    publisher = "Association for Computational Linguistics",
+    pages = "1810--1822"
+}
+
+@inproceedings{Dou2019DynamicLA,
+  title={Dynamic Layer Aggregation for Neural Machine Translation},
+  author={Zi-Yi Dou and Zhaopeng Tu and Xing Wang and Longyue Wang and Shuming Shi and T. Zhang},
+  publisher={AAAI Conference on Artificial Intelligence},
+  year={2019}
+}
+@inproceedings{Wei2020MultiscaleCD,
+  title={Multiscale Collaborative Deep Models for Neural Machine Translation},
+  author={Xiangpeng Wei and Heng Yu and Yue Hu and Yue Zhang and Rongxiang Weng and Weihua Luo},
+  booktitle={Annual Meeting of the Association for Computational Linguistics},
+  year={2020}
+}

+@inproceedings{Vaswani2018Tensor2TensorFN,
+  title={Tensor2Tensor for Neural Machine Translation},
+  author={Ashish Vaswani and S. Bengio and E. Brevdo and F. Chollet and Aidan N. Gomez and S. Gouws and Llion Jones and L. Kaiser and Nal Kalchbrenner and Niki Parmar and Ryan Sepassi and Noam Shazeer and Jakob Uszkoreit},
+  booktitle={American Mobile Telecommunications Association },
+  year={2018}
+}
+
+@article{Kitaev2020ReformerTE,
+  title={Reformer: The Efficient Transformer},
+  author={Nikita Kitaev and L. Kaiser and Anselm Levskaya},
+  journal={ArXiv},
+  year={2020},
+  volume={abs/2001.04451}
+}
 %%%%% chapter 12------------------------------------------------------
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%