合并分支 'zengxin' 到 'caorunzhe'

Zengxin 查看合并请求 !509

合并分支 'zengxin' 到 'caorunzhe'
Zengxin 查看合并请求 !509
7d9dbc29 · zengxin · 93f01670 · 34080625 · 7d9dbc29 · 7d9dbc29
Commit 7d9dbc29 authored Nov 29, 2020 by zengxin
--- a/Chapter10/Figures/figure-attention-of-source-and-target-words.tex
+++ b/Chapter10/Figures/figure-attention-of-source-and-target-words.tex
@@ -47,7 +47,7 @@

 {
 \draw [<->,ublue,thick] ([xshift=0.3em]ws4.south) .. controls +(-60:1) and +(south:1) .. (wt4.south);
-\draw [<->,ublue,thick] (ws4.south) .. controls +(south:1.0) and +(south:1.5) .. (wt5.south);
+\draw [<->,ublue,thick] (ws4.south) .. controls +(south:1) and +(south:1.5) .. (wt5.south);
 }

 {

--- a/Chapter10/Figures/figure-encoder-decoder-with-attention.tex
+++ b/Chapter10/Figures/figure-encoder-decoder-with-attention.tex
@@ -80,9 +80,9 @@

 \draw[<-] ([yshift=0.1em,xshift=1em]t6.north) -- ([yshift=1.2em,xshift=1em]t6.north);

-\draw [->] ([yshift=3em]s6.north) -- ([yshift=4em]s6.north) -- ([yshift=4em]t1.north) node [pos=0.5,fill=green!30,inner sep=2pt] (c1) {\scriptsize{表示$\vectorn{C}_1$}} -- ([yshift=3em]t1.north) ;
-\draw [->] ([yshift=3em]s5.north) -- ([yshift=5.3em]s5.north) -- ([yshift=5.3em]t2.north) node [pos=0.5,fill=green!30,inner sep=2pt] (c2) {\scriptsize{表示$\vectorn{C}_2$}} -- ([yshift=3em]t2.north) ;
-\draw [->] ([yshift=3.5em]s3.north) -- ([yshift=6.6em]s3.north) -- ([yshift=6.6em]t4.north) node [pos=0.5,fill=green!30,inner sep=2pt] (c3) {\scriptsize{表示$\vectorn{C}_i$}} -- ([yshift=3.5em]t4.north) ;
+\draw [->] ([yshift=3em]s6.north) -- ([yshift=4em]s6.north) -- ([yshift=4em]t1.north) node [pos=0.5,fill=green!30,inner sep=2pt] (c1) {\scriptsize{表示$\mathbi{C}_1$}} -- ([yshift=3em]t1.north) ;
+\draw [->] ([yshift=3em]s5.north) -- ([yshift=5.3em]s5.north) -- ([yshift=5.3em]t2.north) node [pos=0.5,fill=green!30,inner sep=2pt] (c2) {\scriptsize{表示$\mathbi{C}_2$}} -- ([yshift=3em]t2.north) ;
+\draw [->] ([yshift=3.5em]s3.north) -- ([yshift=6.6em]s3.north) -- ([yshift=6.6em]t4.north) node [pos=0.5,fill=green!30,inner sep=2pt] (c3) {\scriptsize{表示$\mathbi{C}_i$}} -- ([yshift=3.5em]t4.north) ;
 \node [anchor=north] (smore) at ([yshift=3.5em]s3.north) {...};
 \node [anchor=north] (tmore) at ([yshift=3.5em]t4.north) {...};


--- a/Chapter10/chapter10.tex
+++ b/Chapter10/chapter10.tex
@@ -925,14 +925,14 @@ a (\mathbi{s},\mathbi{h}) =  \left\{ \begin{array}{ll}
 %----------------------------------------------------------------------------------------
 \subsection{训练}

-\parinterval 在基于梯度的方法中，模型参数可以通过损失函数$L$对于参数的梯度进行不断更新。对于第$\textrm{step}$步参数更新，首先进行神经网络的前向计算，之后进行反向计算，并得到所有参数的梯度信息，再使用下面的规则进行参数更新：
+\parinterval 在基于梯度的方法中，模型参数可以通过损失函数$L$对参数的梯度进行不断更新。对于第$\textrm{step}$步参数更新，首先进行神经网络的前向计算，之后进行反向计算，并得到所有参数的梯度信息，再使用下面的规则进行参数更新：

 \begin{eqnarray}
 \mathbi{w}_{\textrm{step}+1} = \mathbi{w}_{\textrm{step}} - \alpha \cdot \frac{ \partial L(\mathbi{w}_{\textrm{step}})} {\partial \mathbi{w}_{\textrm{step}} }
 \label{eq:10-30}
 \end{eqnarray}

-\noindent 其中，$\mathbi{w}_{\textrm{step}}$表示更新前的模型参数，$\mathbi{w}_{\textrm{step}+1}$表示更新后的模型参数，$L(\mathbi{w}_{\textrm{step}})$表示模型相对于$\mathbi{w}_{\textrm{step}}$ 的损失，$\frac{\partial L(\mathbi{w}_{\textrm{step}})} {\partial \mathbi{w}_{\textrm{step}} }$表示损失函数的梯度，$\alpha$是更新的步进值。也就是说，给定一定量的训练数据，不断执行公式\eqref{eq:10-30}的过程。反复使用训练数据，直至模型参数达到收敛或者损失函数不再变化。通常，把公式的一次执行称为“一步”更新/训练，把访问完所有样本的训练称为“一轮”训练。
+\noindent 其中，$\mathbi{w}_{\textrm{step}}$表示更新前的模型参数，$\mathbi{w}_{\textrm{step}+1}$表示更新后的模型参数，$L(\mathbi{w}_{\textrm{step}})$表示模型相对于$\mathbi{w}_{\textrm{step}}$ 的损失，$\frac{\partial L(\mathbi{w}_{\textrm{step}})} {\partial \mathbi{w}_{\textrm{step}} }$表示损失函数的梯度，$\alpha$是更新的步长。也就是说，给定一定量的训练数据，不断执行公式\eqref{eq:10-30}的过程。反复使用训练数据，直至模型参数达到收敛或者损失函数不再变化。通常，把公式的一次执行称为“一步”更新/训练，把访问完所有样本的训练称为“一轮”训练。

 \parinterval 将公式\eqref{eq:10-30}应用于神经机器翻译有几个基本问题需要考虑：1）损失函数的选择；2）参数初始化的策略，也就是如何设置$\mathbi{w}_0$；3）优化策略和学习率调整策略；4）训练加速。下面对这些问题进行讨论。

@@ -1190,7 +1190,7 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j,\
 \subsubsection{2. 束搜索}
 \vspace{0.5em}

-\parinterval 束搜索是一种启发式图搜索算法。相比于全搜索，它可以减少搜索所占用的空间和时间，在每一步扩展的时候，剪掉一些质量比较差的结点，保留下一些质量较高的结点。具体到机器翻译任务，对于每一个目标语言位置，束搜索选择了概率最大的前$K$个单词进行扩展（其中$k$叫做束宽度，或简称为束宽）。如图\ref{fig:10-34}所示，假设\{$y_1, y_2,..., y_n$\}表示生成的目标语言序列，且$k=3$，则束搜索的具体过程为：在预测第一个位置时，可以通过模型得到$y_1$的概率分布，选取概率最大的前3个单词作为候选结果（假设分别为“have”, “has”, “it”）。在预测第二个位置的单词时，模型针对已经得到的三个候选结果（“have”, “has”, “it”）计算第二个单词的概率分布。因为$y_2$对应$|V|$种可能，总共可以得到$3 \times |V|$种结果。然后从中选取使序列概率$\funp{P}(y_2,y_1| \seq{{x}})$最大的前三个$y_2$作为新的输出结果，这样便得到了前两个位置的top-3译文。在预测其他位置时也是如此，不断重复此过程直到推断结束。可以看到，束搜索的搜索空间大小与束宽度有关，也就是：束宽度越大，搜索空间越大，更有可能搜索到质量更高的译文，但同时搜索会更慢。束宽度等于3，意味着每次只考虑三个最有可能的结果，贪婪搜索实际上便是束宽度为1的情况。在神经机器翻译系统实现中，一般束宽度设置在4～8之间。
+\parinterval 束搜索是一种启发式图搜索算法。相比于全搜索，它可以减少搜索所占用的空间和时间，在每一步扩展的时候，剪掉一些质量比较差的结点，保留下一些质量较高的结点。具体到机器翻译任务，对于每一个目标语言位置，束搜索选择了概率最大的前$k$个单词进行扩展（其中$k$叫做束宽度，或简称为束宽）。如图\ref{fig:10-34}所示，假设\{$y_1, y_2,..., y_n$\}表示生成的目标语言序列，且$k=3$，则束搜索的具体过程为：在预测第一个位置时，可以通过模型得到$y_1$的概率分布，选取概率最大的前3个单词作为候选结果（假设分别为“have”, “has”, “it”）。在预测第二个位置的单词时，模型针对已经得到的三个候选结果（“have”, “has”, “it”）计算第二个单词的概率分布。因为$y_2$对应$|V|$种可能，总共可以得到$3 \times |V|$种结果。然后从中选取使序列概率$\funp{P}(y_2,y_1| \seq{{x}})$最大的前三个$y_2$作为新的输出结果，这样便得到了前两个位置的top-3译文。在预测其他位置时也是如此，不断重复此过程直到推断结束。可以看到，束搜索的搜索空间大小与束宽度有关，也就是：束宽度越大，搜索空间越大，更有可能搜索到质量更高的译文，但同时搜索会更慢。束宽度等于3，意味着每次只考虑三个最有可能的结果，贪婪搜索实际上便是束宽度为1的情况。在神经机器翻译系统实现中，一般束宽度设置在4～8之间。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -1245,7 +1245,7 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j,\
 %    NEW SECTION
 %----------------------------------------------------------------------------------------
 \sectionnewpage
-\section{小节及拓展阅读}
+\section{小结及拓展阅读}

 \parinterval 神经机器翻译是近几年的热门方向。无论是前沿性的技术探索，还是面向应用落地的系统研发，神经机器翻译已经成为当下最好的选择之一。研究人员对神经机器翻译的热情使得这个领域得到了快速的发展。本章作为神经机器翻译的入门章节，对神经机器翻译的建模思想和基础框架进行了描述。同时，对常用的神经机器翻译架构\ \dash \ 循环神经网络进行了讨论与分析。


--- a/Chapter11/Figures/figure-single-glu.tex
+++ b/Chapter11/Figures/figure-single-glu.tex
@@ -64,8 +64,8 @@ $\otimes$： & 按位乘运算 \\
 	\draw[-latex,thick] (c2.east) -- ([xshift=0.4cm]c2.east); 
 	
 	\node[inner sep=0pt, font=\tiny] at (0.75cm, -0.4cm) {$\mathbi{x}$};
-	\node[inner sep=0pt, font=\tiny] at ([yshift=-0.8cm]a.south) {$\mathbi{B}=\mathbi{x} * \mathbi{V} + \mathbi{b}_{\mathbi{W}}$};
-	\node[inner sep=0pt, font=\tiny] at ([yshift=-0.8cm]b.south) {$\mathbi{A}=\mathbi{x} * \mathbi{W} + \mathbi{b}_{\mathbi{V}}$};
+	\node[inner sep=0pt, font=\tiny] at ([yshift=-0.8cm]a.south) {$\mathbi{A}=\mathbi{x} * \mathbi{W} + \mathbi{b}_{\mathbi{W}}$};
+	\node[inner sep=0pt, font=\tiny] at ([yshift=-0.8cm]b.south) {$\mathbi{B}=\mathbi{x} * \mathbi{V} + \mathbi{b}_{\mathbi{V}}$};
 	\node[inner sep=0pt, font=\tiny] at (8.2cm, -0.4cm) {$\mathbi{y}=\mathbi{A} \otimes \sigma(\mathbi{B})$};
 	
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter11/chapter11.tex
+++ b/Chapter11/chapter11.tex
@@ -579,7 +579,7 @@
 %    NEW SUB-SECTION
 %----------------------------------------------------------------------------------------

-\section{小节及拓展阅读}
+\section{小结及拓展阅读}

 \parinterval 卷积是一种高效的神经网络结构，在图像、语音处理等领域取得了令人瞩目的成绩。本章介绍了卷积的概念及其特性，并对池化、填充等操作进行了讨论。本章介绍了具有高并行计算能力的机器翻译范式，即基于卷积神经网络的编码器-解码器框架。其在机器翻译任务上表现出色，并大幅度缩短了模型的训练周期。除了基础部分，本章还针对卷积计算进行了延伸，内容涉及逐通道卷积、逐点卷积、轻量卷积和动态卷积等。除了上述提及的内容，卷积神经网络及其变种在文本分类、命名实体识别、关系分类、事件抽取等其他自然语言处理任务上也有许多应用\upcite{Kim2014ConvolutionalNN,2011Natural,DBLP:conf/cncl/ZhouZXQBX17,DBLP:conf/acl/ChenXLZ015,DBLP:conf/coling/ZengLLZZ14}。


--- a/Chapter12/Figures/figure-different-regularization-methods.tex
+++ b/Chapter12/Figures/figure-different-regularization-methods.tex
@@ -7,15 +7,15 @@
 \tikzstyle{standard} = [rounded corners=3pt]

 \node [lnode,anchor=west] (l1) at (0,0) {\scriptsize{子层}};
-\node [lnode,anchor=west] (l2) at ([xshift=3em]l1.east) {\scriptsize{层正则化}};
-\node [lnode,anchor=west] (l3) at ([xshift=4em]l2.east) {\scriptsize{层正则化}};
+\node [lnode,anchor=west] (l2) at ([xshift=3em]l1.east) {\scriptsize{层标准化}};
+\node [lnode,anchor=west] (l3) at ([xshift=4em]l2.east) {\scriptsize{层标准化}};
 \node [lnode,anchor=west] (l4) at ([xshift=1.5em]l3.east) {\scriptsize{子层}};

 \node [anchor=west] (plus1) at ([xshift=0.9em]l1.east) {\scriptsize{$\mathbf{\oplus}$}};
 \node [anchor=west] (plus2) at ([xshift=0.9em]l4.east) {\scriptsize{$\mathbf{\oplus}$}};

-\node [anchor=north] (label1) at ([xshift=3em,yshift=-0.5em]l1.south) {\scriptsize{(a)后正则化}};
-\node [anchor=north] (label2) at ([xshift=3em,yshift=-0.5em]l3.south) {\scriptsize{(b)前正则化}};
+\node [anchor=north] (label1) at ([xshift=3em,yshift=-0.5em]l1.south) {\scriptsize{(a)后标准化}};
+\node [anchor=north] (label2) at ([xshift=3em,yshift=-0.5em]l3.south) {\scriptsize{(b)前标准化}};

 \draw [->,thick] ([xshift=-1.5em]l1.west) -- ([xshift=-0.1em]l1.west);
 \draw [->,thick] ([xshift=0.1em]l1.east) -- ([xshift=0.2em]plus1.west);

--- a/Chapter12/chapter12.tex
+++ b/Chapter12/chapter12.tex
@@ -163,13 +163,13 @@
 \vspace{0.5em}
 \item {\small\sffamily\bfseries{残差连接}}（标记为“Add”）：对于自注意力子层和前馈神经网络子层，都有一个从输入直接到输出的额外连接，也就是一个跨子层的直连。残差连接可以使深层网络的信息传递更为有效；
 \vspace{0.5em}
-\item {\small\sffamily\bfseries{层正则化}}\index{层正则化}（Layer Normalization）：自注意力子层和前馈神经网络子层进行最终输出之前，会对输出的向量进行层正则化，规范结果向量取值范围，这样易于后面进一步的处理。
+\item {\small\sffamily\bfseries{层标准化}}\index{层标准化}（Layer Normalization）：自注意力子层和前馈神经网络子层进行最终输出之前，会对输出的向量进行层标准化，规范结果向量取值范围，这样易于后面进一步的处理。
 \vspace{0.5em}
 \end{itemize}

 \parinterval 以上操作就构成了Transformer的一层，各个模块执行的顺序可以简单描述为：Self-Attention $\to$ Residual Connection $\to$ Layer Normalization $\to$ Feed Forward Network $\to$ Residual Connection $\to$ Layer Normalization。编码器可以包含多个这样的层，比如，可以构建一个六层编码器，每层都执行上面的操作。最上层的结果作为整个编码的结果，会被传入解码器。

-\parinterval 解码器的结构与编码器十分类似。它也是由若干层组成，每一层包含编码器中的所有结构，即：自注意力子层、前馈神经网络子层、残差连接和层正则化模块。此外，为了捕捉源语言的信息，解码器又引入了一个额外的{\small\sffamily\bfseries{编码-解码注意力子层}}\index{编码-解码注意力子层}（Encoder-Decoder Attention Sub-layer）\index{Encoder-Decoder Attention Sub-layer}。这个新的子层，可以帮助模型使用源语言句子的表示信息生成目标语不同位置的表示。编码-解码注意力子层仍然基于自注意力机制，因此它和自注意力子层的结构是相同的，只是$\mathrm{query}$、$\mathrm{key}$、$\mathrm{value}$的定义不同。比如，在解码端，自注意力子层的$\mathrm{query}$、$\mathrm{key}$、$\mathrm{value}$是相同的，它们都等于解码端每个位置的表示。而在编码-解码注意力子层中，$\mathrm{query}$是解码端每个位置的表示，此时$\mathrm{key}$和$\mathrm{value}$是相同的，等于编码端每个位置的表示。图\ref{fig:12-40}给出了这两种不同注意力子层输入的区别。
+\parinterval 解码器的结构与编码器十分类似。它也是由若干层组成，每一层包含编码器中的所有结构，即：自注意力子层、前馈神经网络子层、残差连接和层标准化模块。此外，为了捕捉源语言的信息，解码器又引入了一个额外的{\small\sffamily\bfseries{编码-解码注意力子层}}\index{编码-解码注意力子层}（Encoder-Decoder Attention Sub-layer）\index{Encoder-Decoder Attention Sub-layer}。这个新的子层，可以帮助模型使用源语言句子的表示信息生成目标语不同位置的表示。编码-解码注意力子层仍然基于自注意力机制，因此它和自注意力子层的结构是相同的，只是$\mathrm{query}$、$\mathrm{key}$、$\mathrm{value}$的定义不同。比如，在解码端，自注意力子层的$\mathrm{query}$、$\mathrm{key}$、$\mathrm{value}$是相同的，它们都等于解码端每个位置的表示。而在编码-解码注意力子层中，$\mathrm{query}$是解码端每个位置的表示，此时$\mathrm{key}$和$\mathrm{value}$是相同的，等于编码端每个位置的表示。图\ref{fig:12-40}给出了这两种不同注意力子层输入的区别。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -184,7 +184,7 @@

 \parinterval 在进行更详细的介绍前，先利用图\ref{fig:12-39}简单了解一下Transformer模型是如何进行翻译的。首先，Transformer将源语言句子“我/很/好”的词嵌入融合位置编码后作为输入。然后，编码器对输入的源语句子进行逐层抽象，得到包含丰富的上下文信息的源语表示并传递给解码器。解码器的每一层，使用自注意力子层对输入解码端的表示进行加工，之后再使用编码-解码注意力子层融合源语句子的表示信息。就这样逐词生成目标语译文单词序列。解码器每个位置的输入是当前单词（比如，“I”），而这个位置的输出是下一个单词（比如，“am”），这个设计和标准的神经语言模型是完全一样的。

-\parinterval 当然，这里可能还有很多疑惑，比如，什么是位置编码？Transformer的自注意力机制具体是怎么进行计算的，其结构是怎样的？层正则化又是什么？等等。下面就一一展开介绍。
+\parinterval 当然，这里可能还有很多疑惑，比如，什么是位置编码？Transformer的自注意力机制具体是怎么进行计算的，其结构是怎样的？层标准化又是什么？等等。下面就一一展开介绍。

 %----------------------------------------------------------------------------------------
 %    NEW SECTION
@@ -381,7 +381,7 @@
 %    NEW SECTION
 %----------------------------------------------------------------------------------------

-\section{残差网络和层正则化}
+\section{残差网络和层标准化}

 \parinterval Transformer编码器、解码器分别由多层网络组成（通常为6层），每层网络又包含多个子层（自注意力网络、前馈神经网络）。因此Transformer实际上是一个很深的网络结构。再加上点乘注意力机制中包含很多线性和非线性变换；且注意力函数Attention($\cdot$)的计算也涉及多层网络，整个网络的信息传递非常复杂。从反向传播的角度来看，每次回传的梯度都会经过若干步骤，容易产生梯度爆炸或者消失。解决这个问题的一种办法就是使用残差连接\upcite{DBLP:journals/corr/HeZRS15}，此部分内容已经在{\chapternine}进行了介绍，这里不再赘述。

@@ -408,7 +408,7 @@
 \begin{figure}[htp]
 \centering
 \input{./Chapter12/Figures/figure-position-of-difference-and-layer-regularization-in-the-model}
-\caption{残差和层正则化在模型中的位置}
+\caption{残差和层标准化在模型中的位置}
 \label{fig:12-50}
 \end{figure}
 %----------------------------------------------
@@ -420,7 +420,7 @@
 \label{eq:12-50}
 \end{eqnarray}

-\noindent 其中$\mathbi{x}^l$表示第$l$层网络的输入向量，$F (\mathbi{x}^l)$是子层运算，这样会导致不同层（或子层）的结果之间的差异性很大，造成训练过程不稳定、训练时间较长。为了避免这种情况，在每层中加入了层正则化操作\upcite{Ba2016LayerN}。图\ref{fig:12-50} 中的红色方框展示了Transformer中残差和层正则化的位置。层正则化的计算公式如下：
+\noindent 其中$\mathbi{x}^l$表示第$l$层网络的输入向量，$F (\mathbi{x}^l)$是子层运算，这样会导致不同层（或子层）的结果之间的差异性很大，造成训练过程不稳定、训练时间较长。为了避免这种情况，在每层中加入了层标准化操作\upcite{Ba2016LayerN}。图\ref{fig:12-50} 中的红色方框展示了Transformer中残差和层标准化的位置。层标准化的计算公式如下：
 \begin{eqnarray}
 \textrm{LN}(\mathbi{x}) = g \cdot \frac{\mathbi{x}- \mu} {\sigma} + b
 \label{eq:12-51}
@@ -428,13 +428,13 @@

 \noindent 该公式使用均值$\mu$和方差$\sigma$对样本进行平移缩放，将数据规范化为均值为0，方差为1的标准分布。$g$和$b$是可学习的参数。

-\parinterval 在Transformer中经常使用的层正则化操作有两种结构，分别是{\small\bfnew{后正则化}}\index{后正则化}（Post-norm）\index{Post-norm}和{\small\bfnew{前正则化}}\index{前正则化}（Pre-norm）\index{Pre-norm}，结构如图\ref{fig:12-51}所示。后正则化中先进行残差连接再进行层正则化，而前正则化则是在子层输入之前进行层正则化操作。在很多实践中已经发现，前正则化的方式更有利于信息传递，因此适合训练深层的Transformer模型\upcite{WangLearning}。
+\parinterval 在Transformer中经常使用的层标准化操作有两种结构，分别是{\small\bfnew{后标准化}}\index{后标准化}（Post-norm）\index{Post-norm}和{\small\bfnew{前标准化}}\index{前标准化}（Pre-norm）\index{Pre-norm}，结构如图\ref{fig:12-51}所示。后标准化中先进行残差连接再进行层标准化，而前标准化则是在子层输入之前进行层标准化操作。在很多实践中已经发现，前标准化的方式更有利于信息传递，因此适合训练深层的Transformer模型\upcite{WangLearning}。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter12/Figures/figure-different-regularization-methods}
-\caption{不同正则化方式 }
+\caption{不同标准化方式 }
 \label{fig:12-51}
 \end{figure}
 %----------------------------------------------
@@ -535,7 +535,7 @@ lrate = d_{\textrm{model}}^{-0.5} \cdot \textrm{min} (\textrm{step}^{-0.5} , \te
 \vspace{0.5em}
 \item  Transformer Big：为了提升网络的容量，使用更宽的网络。在Base的基础上增大隐层维度至1024，前馈神经网络的维度变为4096，多头注意力机制为16头，Dropout设为0.3。
 \vspace{0.5em}
-\item Transformer Deep：加深编码器网络层数可以进一步提升网络的性能，它的参数设置与Transformer Base基本一致，但是层数增加到48层，同时使用Pre-Norm作为层正则化的结构。
+\item Transformer Deep：加深编码器网络层数可以进一步提升网络的性能，它的参数设置与Transformer Base基本一致，但是层数增加到48层，同时使用Pre-Norm作为层标准化的结构。
 \vspace{0.5em}
 \end{itemize}


--- a/ChapterAppend/chapterappend.tex
+++ b/ChapterAppend/chapterappend.tex
@@ -26,6 +26,110 @@
 \begin{appendices}
 \chapter{附录A}
 \label{appendix-A}
+\parinterval  从实践的角度，机器翻译的发展主要可以归功于两方面的推动作用：开源系统和评测。开源系统通过代码共享的方式使得最新的研究成果可以快速传播，同时实验结果可以复现。而评测比赛，使得各个研究组织的成果可以进行科学的对比，共同推动机器翻译的发展与进步。此外，开源项目也促进了不同团队之间的协作，让研究人员在同一个平台上集中力量攻关。
+
+%----------------------------------------------------------------------------------------
+%    NEW SECTION
+%----------------------------------------------------------------------------------------
+
+\section{统计机器翻译开源系统}
+
+\begin{itemize}
+\vspace{0.5em}
+\item NiuTrans.SMT。NiuTrans\upcite{Tong2012NiuTrans}是由东北大学自然语言处理实验室自主研发的统计机器翻译系统，该系统可支持基于短语的模型、基于层次短语的模型以及基于句法的模型。由于使用C++ 语言开发，所以该系统运行时间快，所占存储空间少。系统中内嵌有$n$-gram语言模型，故无需使用其他的系统即可对完成语言建模。网址：\url{http://opensource.niutrans.com/smt/index.html}
+\vspace{0.5em}
+\item Moses。Moses\upcite{Koehn2007Moses}是统计机器翻译时代最著名的系统之一，（主要）由爱丁堡大学的机器翻译团队开发。最新的Moses系统支持很多的功能，例如，它既支持基于短语的模型，也支持基于句法的模型。Moses 提供因子化翻译模型（Factored Translation Model），因此该模型可以很容易地对不同层次的信息进行建模。此外，它允许将混淆网络和字格作为输入，可缓解系统的1-best输出中的错误。Moses还提供了很多有用的脚本和工具，被机器翻译研究者广泛使用。网址：\url{http://www.statmt.org/moses/}
+\vspace{0.5em}
+\item Joshua。Joshua\upcite{Li2010Joshua}是由约翰霍普金斯大学的语言和语音处理中心开发的层次短语翻译系统。由于Joshua是由Java语言开发，所以它在不同的平台上运行或开发时具有良好的可扩展性和可移植性。Joshua也是使用非常广泛的开源机器翻译系统之一。网址：\url{https://cwiki.apache.org/confluence/display/JOSHUA/}
+\vspace{0.5em}
+\item SilkRoad。SilkRoad是由五个国内机构（中科院计算所、中科院软件所、中科院自动化所、厦门大学和哈尔滨工业大学）联合开发的基于短语的统计机器翻译系统。该系统是中国乃至亚洲地区第一个开源的统计机器翻译系统。SilkRoad支持多种解码器和规则提取模块，这样可以组合成不同的系统，提供多样的选择。网址：\url{http://www.nlp.org.cn/project/project.php?projid=14}
+\vspace{0.5em}
+\item SAMT。SAMT\upcite{zollmann2007the}是由卡内基梅隆大学机器翻译团队开发的语法增强的统计机器翻译系统。SAMT在解码的时候使用目标树来生成翻译规则，而不严格遵守目标语言的语法。SAMT 的一个亮点是它提供了简单但高效的方式在机器翻译中使用句法信息。由于SAMT在hadoop中实现，它可受益于大数据集的分布式处理。网址：\url{http://www.cs.cmu.edu/zollmann/samt/}
+\vspace{0.5em}
+\item HiFST。HiFST\upcite{iglesias2009hierarchical}是剑桥大学开发的统计机器翻译系统。该系统完全基于有限状态自动机实现，因此非常适合对搜索空间进行有效的表示。网址：\url{http://ucam-smt.github.io/}
+\vspace{0.5em}
+\item cdec。cdec\upcite{dyer2010cdec}是一个强大的解码器，是由Chris Dyer 和他的合作者们一起开发。cdec的主要功能是它使用了翻译模型的一个统一的内部表示，并为结构预测问题的各种模型和算法提供了实现框架。所以，cdec也可以被用来做一个对齐系统或者一个更通用的学习框架。此外，由于使用C++语言编写，cdec的运行速度较快。网址：\url{http://cdec-decoder.org/index.php?title=MainPage}
+\vspace{0.5em}
+\item Phrasal。Phrasal\upcite{Cer2010Phrasal}是由斯坦福大学自然语言处理小组开发的系统。除了传统的基于短语的模型，Phrasal还支持基于非层次短语的模型，这种模型将基于短语的翻译延伸到非连续的短语翻译，增加了模型的泛化能力。网址：\url{http://nlp.stanford.edu/phrasal/}
+\vspace{0.5em}
+\item Jane。Jane\upcite{vilar2012jane}是一个基于短语和基于层次短语的机器翻译系统，由亚琛工业大学的人类语言技术与模式识别小组开发。Jane提供了系统融合模块，因此可以非常方便的对多个系统进行融合。网址：\url{https://www-i6.informatik.rwth-aachen.de/jane/}
+\vspace{0.5em}
+\item GIZA++。GIZA++\upcite{och2003systematic}是Franz Och研发的用于训练IBM模型1-5和HMM单词对齐模型的工具包。在早期，GIZA++是所有统计机器翻译系统中词对齐的标配工具。网址：\url{https://github.com/moses-smt/giza-pp}
+\vspace{0.5em}
+\item FastAlign。FastAlign\upcite{DBLP:conf/naacl/DyerCS13}是一个快速，无监督的词对齐工具，由卡内基梅隆大学开发。网址：\url{https://github.com/clab/fast\_align}
+\vspace{0.5em}
+\end{itemize}
+
+%----------------------------------------------------------------------------------------
+%    NEW SECTION
+%----------------------------------------------------------------------------------------
+\section{神经机器翻译开源系统}
+
+\begin{itemize}
+\vspace{0.5em}
+\item GroundHog。GroundHog\upcite{bahdanau2014neural}基于Theano\upcite{al2016theano}框架，由蒙特利尔大学LISA 实验室使用Python语言编写的一个框架，旨在提供灵活而高效的方式来实现复杂的循环神经网络模型。它提供了包括LSTM在内的多种模型。Bahdanau等人在此框架上又编写了GroundHog神经机器翻译系统。该系统也作为了很多论文的基线系统。网址：\url{https://github.com/lisa-groundhog/GroundHog}
+\vspace{0.5em}
+\item Nematus。Nematus\upcite{DBLP:journals/corr/SennrichFCBHHJL17}是英国爱丁堡大学开发的，基于Theano框架的神经机器翻译系统。该系统使用GRU作为隐层单元，支持多层网络。Nematus 编码端有正向和反向的编码方式，可以同时提取源语句子中的上下文信息。该系统的一个优点是，它可以支持输入端有多个特征的输入（例如词的词性等）。网址：\url{https://github.com/EdinburghNLP/nematus}
+\vspace{0.5em}
+\item ZophRNN。ZophRNN\upcite{zoph2016simple}是由南加州大学的Barret Zoph 等人使用C++语言开发的系统。Zoph既可以训练序列表示模型（如语言模型），也可以训练序列到序列的模型（如神经机器翻译模型）。当训练神经机器翻译系统时，ZophRNN也支持多源输入。网址：\url{https://github.com/isi-nlp/Zoph\_RNN}
+\vspace{0.5em}
+\item Fairseq。Fairseq\upcite{Ottfairseq}是由Facebook开发的，基于PyTorch框架的用以解决序列到序列问题的工具包，其中包括基于卷积神经网络、基于循环神经网络、基于Transformer的模型等。Fairseq是当今使用最广泛的神经机器翻译开源系统之一。网址：\url{https://github.com/facebookresearch/fairseq}
+\vspace{0.5em}
+\item Tensor2Tensor。Tensor2Tensor\upcite{Vaswani2018Tensor2TensorFN}是由谷歌推出的，基于TensorFlow框架的开源系统。该系统基于Transformer模型，因此可以支持大多数序列到序列任务。得益于Transformer 的网络结构，系统的训练速度较快。现在，Tensor2Tensor也是机器翻译领域广泛使用的开源系统之一。网址：\url{https://github.com/tensorflow/tensor2tensor}
+\vspace{0.5em}
+\item OpenNMT。OpenNMT\upcite{KleinOpenNMT}系统是由哈佛大学自然语言处理研究组开源的，基于Torch框架的神经机器翻译系统。OpenNMT系统的早期版本使用Lua 语言编写，现在也扩展到了TensorFlow和PyTorch，设计简单易用，易于扩展，同时保持效率和翻译精度。网址：\url{https://github.com/OpenNMT/OpenNMT}
+\vspace{0.5em}
+\item 斯坦福神经机器翻译开源代码库。斯坦福大学自然语言处理组（Stanford NLP）发布了一篇教程，介绍了该研究组在神经机器翻译上的研究信息，同时实现了多种翻译模型\upcite{luong2016acl_hybrid}。 网址：\url{https://nlp.stanford.edu/projects/nmt/}
+\vspace{0.5em}
+\item THUMT。清华大学NLP团队实现的神经机器翻译系统，支持Transformer等模型\upcite{ZhangTHUMT}。该系统主要基于TensorFlow和Theano实现，其中Theano版本包含了RNNsearch模型，训练方式包括MLE （Maximum Likelihood Estimate）, MRT（Minimum Risk Training）, SST（Semi-Supervised Training）。TensorFlow 版本实现了Seq2Seq, RNNsearch, Transformer三种基本模型。网址：\url{https://github.com/THUNLP-MT/THUMT}
+\vspace{0.5em}
+\item NiuTrans.NMT。由小牛翻译团队基于NiuTensor实现的神经机器翻译系统。支持循环神经网络、Transformer等结构，并支持语言建模、序列标注、机器翻译等任务。支持机器翻译GPU与CPU 训练及解码。其小巧易用，为开发人员提供快速二次开发基础。此外，NiuTrans.NMT已经得到了大规模应用，形成了支持304种语言翻译的小牛翻译系统。网址：\url{http://opensource.niutrans.com/niutensor/index.html}
+\vspace{0.5em}
+\item MARIAN。主要由微软翻译团队搭建\upcite{JunczysMarian}，其使用C++实现的用于GPU/CPU训练和解码的引擎，支持多GPU训练和批量解码，最小限度依赖第三方库，静态编译一次之后，复制其二进制文件就能在其他平台使用。网址：\url{https://marian-nmt.github.io/}
+\vspace{0.5em}
+\item Sockeye。由Awslabs开发的神经机器翻译框架\upcite{hieber2017sockeye}。其中支持RNNsearch、Transformer、CNN等翻译模型，同时提供了从图片翻译到文字的模块以及WMT 德英新闻翻译、领域适应任务、多语言零资源翻译任务的教程。网址：\url{https://awslabs.github.io/sockeye/}
+\vspace{0.5em}
+\item CytonMT。由NICT开发的一种用C++实现的神经机器翻译开源工具包\upcite{WangCytonMT}。主要支持Transformer模型，并支持一些常用的训练方法以及解码方法。网址：\url{https://github.com/arthurxlw/cytonMt}
+\vspace{0.5em}
+\item OpenSeq2Seq。由NVIDIA团队开发的\upcite{DBLP:journals/corr/abs-1805-10387}基于TensorFlow的模块化架构，用于序列到序列的模型，允许从可用组件中组装新模型，支持混合精度训练，利用NVIDIA Volta Turing GPU中的Tensor核心，基于Horovod的快速分布式训练，支持多GPU，多节点多模式。网址：\url{https://nvidia.github.io/OpenSeq2Seq/html/index.html}
+\vspace{0.5em}
+\item NMTPyTorch。由勒芒大学语言实验室发布的基于序列到序列框架的神经网络翻译系统\upcite{nmtpy2017}，NMTPyTorch的核心部分依赖于Numpy，PyTorch和tqdm。其允许训练各种端到端神经体系结构，包括但不限于神经机器翻译、图像字幕和自动语音识别系统。网址：\url{https://github.com/lium-lst/nmtpytorch}
+\vspace{0.5em}
+\end{itemize}
+
+
+%----------------------------------------------------------------------------------------
+%    NEW SECTION
+%----------------------------------------------------------------------------------------
+\section{公开评测任务}
+\parinterval 机器翻译相关评测主要有两种组织形式，一种是由政府及国家相关机构组织，权威性强。如由美国国家标准技术研究所组织的NIST评测、日本国家科学咨询系统中心主办的NACSIS Test Collections for IR（NTCIR）PatentMT、日本科学振兴机构（Japan Science and Technology Agency，简称JST）等组织联合举办的Workshop on Asian Translation（WAT）以及国内由中文信息学会主办的全国机器翻译大会（China Conference on Machine Translation，简称CCMT）；另一种是由相关学术机构组织，具有领域针对性的特点，如倾向新闻领域的Conference on Machine Translation（WMT）以及面向口语的International Workshop on Spoken Language Translation（IWSLT）。下面将针对上述评测进行简要介绍。
+
+\begin{itemize}
+\vspace{0.5em}
+\item CCMT（全国机器翻译大会），前身为CWMT（全国机器翻译研讨会）是国内机器翻译领域的旗舰会议，自2005年起已经组织多次机器翻译评测，对国内机器翻译相关技术的发展产生了深远影响。该评测主要针对汉语、英语以及国内的少数民族语言（蒙古语、藏语、维吾尔语等）进行评测，领域包括新闻、口语、政府文件等，不同语言方向对应的领域也有所不同。评价方式不同届略有不同，主要采用自动评价的方式，自CWMT\ 2013起则针对某些领域增设人工评价。自动评价的指标一般包括BLEU-SBP、BLEU-NIST、TER、METEOR、NIST、GTM、mWER、mPER 以及ICT 等，其中以BLEU-SBP 为主，汉语为目标语的翻译采用基于字符的评价方式，面向英语的翻译采用基于词的评价方式。每年该评测吸引国内外近数十家企业及科研机构参赛，业内认可度极高。关于CCMT的更多信息可参考中文信息学会机器翻译专业委员会相关页面：\url{http://sc.cipsc.org.cn/mt/index.php/CWMT.html}。
+\vspace{0.5em}
+\item WMT由Special Interest Group for Machine Translation（SIGMT）主办，会议自2006年起每年召开一次，是一个涉及机器翻译多种任务的综合性会议，包括多领域翻译评测任务、质量评价任务以及其他与机器翻译的相关任务（如文档对齐评测等）。现在WMT已经成为机器翻译领域的旗舰评测会议，很多研究工作都以WMT评测结果作为基准。WMT评测涉及的语言范围较广，包括英语、德语、芬兰语、捷克语、罗马尼亚语等十多种语言，翻译方向一般以英语为核心，探索英语与其他语言之间的翻译性能，领域包括新闻、信息技术、生物医学。最近，也增加了无指导机器翻译等热门问题。WMT在评价方面类似于CCMT，也采用人工评价与自动评价相结合的方式，自动评价的指标一般为BLEU、TER 等。此外，WMT公开了所有评测数据，因此也经常被机器翻译相关人员所使用。更多WMT的机器翻译评测相关信息可参考SIGMT官网：\url{http://www.sigmt.org/}。
+\vspace{0.5em}
+\item NIST机器翻译评测开始于2001年，是早期机器翻译公开评测中颇具代表性的任务，现在WMT和CCMT很多任务的设置也大量参考了当年NIST评测的内容。NIST评测由美国国家标准技术研究所主办，作为美国国防高级计划署（DARPA）中TIDES计划的重要组成部分。早期，NIST评测主要评价阿拉伯语和汉语等语言到英语的翻译效果，评价方法一般采用人工评价与自动评价相结合的方式。人工评价采用5分制评价。自动评价使用多种方式，包括BLEU，METEOR，TER以及HyTER。此外NIST从2016 年起开始对稀缺语言资源技术进行评估，其中机器翻译作为其重要组成部分共同参与评测，评测指标主要为BLEU。除对机器翻译系统进行评测之外，NIST在2008 和2010年对于机器翻译的自动评价方法（MetricsMaTr）也进行了评估，以鼓励更多研究人员对现有评价方法进行改进或提出更加贴合人工评价的方法。同时NIST评测所提供的数据集由于数据质量较高受到众多科研人员喜爱，如MT04，MT06等（汉英）平行语料经常被科研人员在实验中使用。不过，近几年NIST评测已经停止。更多NIST的机器翻译评测相关信息可参考官网：\url{https://www.nist.gov/programs-projects/machine-translation}。
+\vspace{0.5em}
+\item 从2004年开始举办的IWSLT也是颇具特色的机器翻译评测，它主要关注口语相关的机器翻译任务，测试数据包括TED talks的多语言字幕以及QED 教育讲座影片字幕等，语言涉及英语、法语、德语、捷克语、汉语、阿拉伯语等众多语言。此外在IWSLT 2016 中还加入了对于日常对话的翻译评测，尝试将微软Skype中一种语言的对话翻译成其他语言。评价方式采用自动评价的模式，评价标准和WMT类似，一般为BLEU 等指标。另外，IWSLT除了对文本到文本的翻译评测外，还有自动语音识别以及语音转另一种语言的文本的评测。更多IWSLT的机器翻译评测相关信息可参考IWSLT\ 2019官网：\url{https://workshop2019.iwslt.org/}。
+\vspace{0.5em}
+\item 日本举办的机器翻译评测WAT是亚洲范围内的重要评测之一，由日本科学振兴机构（JST）、情报通信研究机构（NICT）等多家机构共同组织，旨在为亚洲各国之间交流融合提供便宜之处。语言方向主要包括亚洲主流语言（汉语、韩语、印地语等）以及英语对日语的翻译，领域丰富多样，包括学术论文、专利、新闻、食谱等。评价方式包括自动评价（BLEU、RIBES以及AMFM 等）以及人工评价，其特点在于对于测试语料以段落为单位进行评价，考察其上下文关联的翻译效果。更多WAT的机器翻译评测相关信息可参考官网：\url{http://lotus.kuee.kyoto-u.ac.jp/WAT/}。
+\vspace{0.5em}
+\item NTCIR计划是由日本国家科学咨询系统中心策划主办的，旨在建立一个用在自然语言处理以及信息检索相关任务上的日文标准测试集。在NTCIR-9和NTCIR-10中开设的Patent Machine Translation（PatentMT）任务主要针对专利领域进行翻译测试，其目的在于促进机器翻译在专利领域的发展和应用。在NTCIR-9中，评测方式采取人工评价与自动评价相结合，以人工评价为主导。人工评价主要根据准确度和流畅度进行评估，自动评价采用BLEU、NIST等方式进行。NTCIR-10评价方式在此基础上增加了专利审查评估、时间评估以及多语种评估，分别考察机器翻译系统在专利领域翻译的实用性、耗时情况以及不同语种的翻译效果等。更多NTCIR评测相关信息可参考官网：\url{http://research.nii.ac.jp/ntcir/index-en.html}。
+\vspace{0.5em}
+\end{itemize}
+
+\parinterval 以上评测数据大多可以从评测网站上下载，此外部分数据也可以从LDC（Lingu-istic Data Consortium）上申请，网址为\url{https://www.ldc.upenn.edu/}。ELRA（Euro-pean Language Resources Association）上也有一些免费的语料库供研究使用，其官网为\url{http://www.elra.info/}。从机器翻译发展的角度看，这些评测任务给相关研究提供了基准数据集，使得不同的系统都可以在同一个环境下进行比较和分析，进而建立了机器翻译研究所需的实验基础。此外，公开评测也使得研究者可以第一时间了解机器翻译研究的最新成果，比如，有多篇ACL会议最佳论文的灵感就来自当年参加机器翻译评测任务的系统。
+
+\end{appendices}
+%----------------------------------------------------------------------------------------
+%	CHAPTER  APPENDIX B
+%----------------------------------------------------------------------------------------
+
+\begin{appendices}
+\chapter{附录B}
+\label{appendix-B}
 \parinterval 在构建机器翻译系统的过程中，数据是必不可少的，尤其是现在主流的神经机器翻译系统，系统的性能往往受限于语料库规模和质量。所幸的是，随着语料库语言学的发展，一些主流语种的相关语料资源已经十分丰富。

 \parinterval 为了方便读者进行相关研究，我们汇总了几个常用的基准数据集，这些数据集已经在机器翻译领域中被广泛使用，有很多之前的相关工作可以进行复现和对比。同时，我们收集了一下常用的平行语料，方便读者进行一些探索。
@@ -161,12 +265,12 @@
 \end{appendices}

 %----------------------------------------------------------------------------------------
-%	CHAPTER  APPENDIX B
+%	CHAPTER  APPENDIX C
 %----------------------------------------------------------------------------------------

 \begin{appendices}
-\chapter{附录B}
-\label{appendix-B}
+\chapter{附录C}
+\label{appendix-C}

 %----------------------------------------------------------------------------------------
 %    NEW SECTION

--- a/bibliography.bib
+++ b/bibliography.bib
@@ -3854,16 +3854,11 @@ year = {2012}
 %%%%% chapter 9------------------------------------------------------
 @article{brown1992class,
  title={Class-based n-gram models of natural language},
-  author={Brown and
-              Peter F and
-              Desouza and
-              Peter V and
-              Mercer amd
-              Robert L
-              and Pietra and
-              Vincent J Della
-              and Lai and
-              Jenifer C},
+  author={Peter F. Brown and
+               Vincent J. Della Pietra and
+               Peter V. De Souza and
+               Jennifer C. Lai and
+               Robert L. Mercer},
  journal={Computational linguistics},
  volume={18},
  number={4},
@@ -3873,10 +3868,8 @@ year = {2012}

 @inproceedings{mikolov2012context,
  title={Context dependent recurrent neural network language model},
-  author={Mikolov and
-            Tomas and
-            Zweig and
-            Geoffrey},
+  author={Tomas Mikolov and
+               Geoffrey Zweig},
  publisher={IEEE Spoken Language Technology Workshop},
  pages={234--239},
  year={2012}
@@ -3884,38 +3877,28 @@ year = {2012}

 @article{zaremba2014recurrent,
  title={Recurrent Neural Network Regularization},
-  author={Zaremba and
-             Wojciech and
-             Sutskever and
-             Ilya and
-             Vinyals and
-             Oriol},
+  author={Wojciech Zaremba and
+               Ilya Sutskever and
+               Oriol Vinyals},
  journal={arXiv: Neural and Evolutionary Computing},
  year={2014}
 }

 @article{zilly2016recurrent,
  title={Recurrent Highway Networks},
-  author={Zilly and
-            Julian and
-            Srivastava and
-            Rupesh Kumar and
-            Koutnik and
-            Jan and
-            Schmidhuber and
-            Jurgen},
+  author={Julian G. Zilly and
+               Rupesh Kumar Srivastava and
+               Jan Koutn{\'{\i}}k and
+               J{\"{u}}rgen Schmidhuber},
  journal={International Conference on Machine Learning},
  year={2016}
 }

 @article{merity2017regularizing,
  title={Regularizing and optimizing LSTM language models},
-  author={Merity and
-             tephen and
-             Keskar and
-             Nitish Shirish and
-             Socher and
-             Richard},
+  author={Stephen Merity and
+               Nitish Shirish Keskar and
+               Richard Socher},
  journal={International Conference on Learning Representations},
  year={2017}
 }
@@ -3993,7 +3976,7 @@ year = {2012}
 @article{Ba2016LayerN,
  author    = {Lei Jimmy Ba and
               Jamie Ryan Kiros and
-               Geoffrey E. Hinton},
+               Geoffrey Hinton},
  title     = {Layer Normalization},
  journal   = {CoRR},
  volume    = {abs/1607.06450},
@@ -4018,7 +4001,7 @@ year = {2012}
               Satoshi Nakamura},
  title     = {Incorporating Discrete Translation Lexicons into Neural Machine Translation},
  pages     = {1557--1567},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2016}
 }

@@ -4062,7 +4045,7 @@ year = {2012}
  year={2011}
 }
 @inproceedings{mccann2017learned,
-  author    = {Bryan McCann and
+  author    = {Bryan Mccann and
               James Bradbury and
               Caiming Xiong and
               Richard Socher},
@@ -4081,15 +4064,15 @@ year = {2012}
 		  Matt Gardner and 
 		  Christopher Clark and 
 		  Kenton Lee and 
-		  L. Zettlemoyer},
-  publisher={arXiv preprint arXiv:1802.05365},
+		  Luke Zettlemoyer},
+  publisher={Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics},
  year={2018}
 }


 @article{Graves2013HybridSR,
  title={Hybrid speech recognition with Deep Bidirectional LSTM},
-  author={A. Graves and 
+  author={Alex Graves and 
          Navdeep Jaitly and 
 		  Abdel-rahman Mohamed},
  publisher={IEEE Workshop on Automatic Speech Recognition and Understanding},
@@ -4101,7 +4084,7 @@ year = {2012}
  title={Character-Word LSTM Language Models},
  author={Lyan Verwimp and 
          Joris Pelemans and 
-		  H. V. Hamme and 
+		  Hugo Van Hamme and 
 		  Patrick Wambacq},
  publisher={European Association of Computational Linguistics},
  year={2017}
@@ -4112,7 +4095,7 @@ year = {2012}
               Kyunghyun Cho},
  title     = {Gated Word-Character Recurrent Language Model},
  pages     = {1992--1997},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2016}
 }
 @inproceedings{Hwang2017CharacterlevelLM,
@@ -4146,7 +4129,7 @@ year = {2012}
  title={Larger-Context Language Modelling},
  author={Tian Wang and 
          Kyunghyun Cho},
-  journal={arXiv preprint arXiv:1511.03729},
+  journal={Annual Meeting of the Association for Computational Linguistics},
  year={2015}
 }
 @article{Adel2015SyntacticAS,
@@ -4174,7 +4157,7 @@ year = {2012}
 }
 @inproceedings{Pham2016ConvolutionalNN,
  title={Convolutional Neural Network Language Models},
-  author={Ngoc-Quan Pham and 
+  author={Ngoc-quan Pham and 
          German Kruszewski and 
 		  Gemma Boleda},
  publisher={Conference on Empirical Methods in Natural Language Processing},
@@ -4268,9 +4251,9 @@ year = {2012}
 @inproceedings{Bastings2017GraphCE,
  title={Graph Convolutional Encoders for Syntax-aware Neural Machine Translation},
  author={Jasmijn Bastings and 
-          Ivan Titov and W. Aziz and 
+          Ivan Titov and Wilker Aziz and 
 		  Diego Marcheggiani and 
-		  K. Sima'an},
+		  Khalil Sima'an},
  publisher={Conference on Empirical Methods in Natural Language Processing},
  year={2017}
 }
@@ -4727,8 +4710,8 @@ author    = {Yoshua Bengio and
               Quoc V. Le and
               Ruslan Salakhutdinov},
  title     = {Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context},
-  journal   = {CoRR},
-  volume    = {abs/1901.02860},
+  journal   = {Annual Meeting of the Association for Computational Linguistics},
+  pages     = {2978--2988},
  year      = {2019}
 }
 @inproceedings{li-etal-2019-word,
@@ -4810,7 +4793,7 @@ author    = {Yoshua Bengio and
  year      = {2017}
 }
 @article{Hinton2015Distilling,
-  author    = {Geoffrey E. Hinton and
+  author    = {Geoffrey Hinton and
               Oriol Vinyals and
               Jeffrey Dean},
  title     = {Distilling the Knowledge in a Neural Network},
@@ -4821,7 +4804,7 @@ author    = {Yoshua Bengio and

 @inproceedings{Ott2018ScalingNM,
  title={Scaling Neural Machine Translation},
-  author={Myle Ott and Sergey Edunov and David Grangier and M. Auli},
+  author={Myle Ott and Sergey Edunov and David Grangier and Michael Auli},
  publisher={Annual Meeting of the Association for Computational Linguistics},
  year={2018}
 }
@@ -4842,7 +4825,7 @@ author    = {Yoshua Bengio and
               Alexander M. Rush},
  title     = {Sequence-Level Knowledge Distillation},
  pages     = {1317--1327},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2016}
 }
 @article{Akaike1969autoregressive,
@@ -4878,7 +4861,7 @@ author    = {Yoshua Bengio and
 }
 @inproceedings{He2018LayerWiseCB,
  title={Layer-Wise Coordination between Encoder and Decoder for Neural Machine Translation},
-  author={Tianyu He and X. Tan and Yingce Xia and D. He and T. Qin and Zhibo Chen and T. Liu},
+  author={Tianyu He and Xu Tan and Yingce Xia and Di He and Tao Qin and Zhibo Chen and Tie-Yan Liu},
  publisher={Conference on Neural Information Processing Systems},
  year={2018}
 }
@@ -4956,7 +4939,7 @@ author    = {Yoshua Bengio and
               Deyi Xiong},
  title     = {Encoding Gated Translation Memory into Neural Machine Translation},
  pages     = {3042--3047},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2018}
 }
 @inproceedings{yang-etal-2016-hierarchical,
@@ -4968,7 +4951,7 @@ author    = {Yoshua Bengio and
               Eduard H. Hovy},
  title     = {Hierarchical Attention Networks for Document Classification},
  pages     = {1480--1489},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics},
  year      = {2016}
 }
 %%%%% chapter 10------------------------------------------------------
@@ -4983,7 +4966,7 @@ author    = {Yoshua Bengio and
               Jian Sun},
  title     = {Faster {R-CNN:} Towards Real-Time Object Detection with Region Proposal
               Networks},
-  journal   = {Institute of Electrical and Electronics Engineers},
+  journal   = {{IEEE} Transactions on Pattern Analysis and Machine Intelligence},
  volume    = {39},
  number    = {6},
  pages     = {1137--1149},
@@ -5002,7 +4985,6 @@ author    = {Yoshua Bengio and
  publisher = {European Conference on Computer Vision},
  volume    = {9905},
  pages     = {21--37},
-  publisher = {Springer},
  year      = {2016}
 }

@@ -5028,7 +5010,7 @@ author    = {Yoshua Bengio and
               Qun Liu},
  title     = {genCNN: {A} Convolutional Architecture for Word Sequence Prediction},
  pages     = {1567--1576},
-  publisher = {The Association for Computer Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2015}
 }

@@ -5038,7 +5020,7 @@ author    = {Yoshua Bengio and
               Navdeep Jaitly},
  title     = {Very deep convolutional networks for end-to-end speech recognition},
  pages     = {4845--4849},
-  publisher = {Institute of Electrical and Electronics Engineers},
+  publisher = {International Conference on Acoustics, Speech and Signal Processing},
  year      = {2017}
 }

@@ -5049,7 +5031,7 @@ author    = {Yoshua Bengio and
  title     = {A deep convolutional neural network using heterogeneous pooling for
               trading acoustic invariance with phonetic confusion},
  pages     = {6669--6673},
-  publisher = {Institute of Electrical and Electronics Engineers},
+  publisher = {International Conference on Acoustics, Speech and Signal Processing},
  year      = {2013}
 }

@@ -5058,8 +5040,7 @@ author    = {Yoshua Bengio and
               Hieu Pham and
               Christopher D. Manning},
  title     = {Effective Approaches to Attention-based Neural Machine Translation},
-  publisher = {Conference on Empirical Methods in Natural
-               Language Processing},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  pages     = {1412--1421},
  year      = {2015}
 }
@@ -5083,7 +5064,7 @@ author    = {Yoshua Bengio and
  title     = {Leveraging Linguistic Structures for Named Entity Recognition with
               Bidirectional Recursive Neural Networks},
  pages     = {2664--2669},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2017}
 }

@@ -5099,10 +5080,10 @@ author    = {Yoshua Bengio and
  author    = {Emma Strubell and
               Patrick Verga and
               David Belanger and
-               Andrew McCallum},
+               Andrew Mccallum},
  title     = {Fast and Accurate Entity Recognition with Iterated Dilated Convolutions},
  pages     = {2670--2680},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2017}
 }

@@ -5168,7 +5149,7 @@ author    = {Yoshua Bengio and
               Tommi S. Jaakkola},
  title     = {Molding CNNs for text: non-linear, non-consecutive convolutions},
  pages     = {1565--1575},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2015}
 }

@@ -5178,7 +5159,7 @@ author    = {Yoshua Bengio and
  title     = {Effective Use of Word Order for Text Categorization with Convolutional
               Neural Networks},
  pages     = {103--112},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics},
  year      = {2015}
 }

@@ -5187,7 +5168,7 @@ author    = {Yoshua Bengio and
               Ralph Grishman},
  title     = {Relation Extraction: Perspective from Convolutional Neural Networks},
  pages     = {39--48},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics},
  year      = {2015}
 }

@@ -5205,7 +5186,7 @@ author    = {Yoshua Bengio and
               Barry Haddow and
               Alexandra Birch},
  title     = {Improving Neural Machine Translation Models with Monolingual Data},
-  publisher = {The Association for Computer Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2016}
 }

@@ -5220,7 +5201,7 @@ author    = {Yoshua Bengio and

 @article{Waibel1989PhonemeRU,
  title={Phoneme recognition using time-delay neural networks},
-  author={Alexander Waibel and Toshiyuki Hanazawa and Geoffrey Everest Hinton and Kiyohiro Shikano and K.J. Lang},
+  author={Alexander Waibel and Toshiyuki Hanazawa and Geoffrey Hinton and Kiyohiro Shikano and Kevin J. Lang},
  journal={IEEE Transactions on Acoustics, Speech, and Signal Processing},
  year={1989},
  volume={37},
@@ -5229,7 +5210,7 @@ author    = {Yoshua Bengio and

 @article{LeCun1989BackpropagationAT,
  title={Backpropagation Applied to Handwritten Zip Code Recognition},
-  author={Yann LeCun and Bernhard Boser and John Denker and Don Henderson and R.E.Howard and W.E. Hubbard and Larry Jackel},
+  author={Yann Lecun and Bernhard Boser and John Denker and Don Henderson and Richard E.Howard and Wayne E. Hubbard and Larry Jackel},
  journal={Neural Computation},
  year={1989},
  volume={1},
@@ -5243,7 +5224,7 @@ author    = {Yoshua Bengio and
  year={1998},
  volume={86},
  number={11},
-  pages={2278-2324},
+  pages={2278-2324}
 }

 @inproceedings{DBLP:journals/corr/HeZRS15,
@@ -5254,7 +5235,7 @@ author    = {Yoshua Bengio and
  title     = {Deep Residual Learning for Image Recognition},
  publisher = {{IEEE} Conference on Computer Vision and Pattern Recognition},
  pages     = {770--778},
-  year      = {2016},
+  year      = {2016}
 }

 @inproceedings{DBLP:conf/cvpr/HuangLMW17,
@@ -5279,10 +5260,9 @@ author    = {Yoshua Bengio and
 @article{He2020MaskR,
  title={Mask R-CNN},
  author={Kaiming He and Georgia Gkioxari and Piotr Doll{\'a}r and Ross B. Girshick},
-  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
-  year={2020},
-  volume={42},
-  pages={386-397}
+  journal={International Conference on Computer Vision},
+  pages={2961--2969},
+  year={2017}
 }

 @inproceedings{Kalchbrenner2014ACN,
@@ -5317,7 +5297,7 @@ author    = {Yoshua Bengio and
  author    = {C{\'{\i}}cero Nogueira dos Santos and
               Maira Gatti},
  pages     = {69--78},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {International Conference on Computational Linguistics},
  year={2014}
 }

@@ -5374,7 +5354,7 @@ author    = {Yoshua Bengio and
 		 Michael Auli},
 title = {Pay Less Attention with Lightweight and Dynamic Convolutions},
 publisher = {International Conference on Learning Representations},
- year = {2019},
+ year = {2019}
 }

 @inproceedings{kalchbrenner-blunsom-2013-recurrent,
@@ -5382,8 +5362,8 @@ author    = {Yoshua Bengio and
               Phil Blunsom},
  title     = {Recurrent Continuous Translation Models},
  pages     = {1700--1709},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
-  year      = {2013},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
+  year      = {2013}
 }

 @article{Wu2016GooglesNM,
@@ -5459,7 +5439,7 @@ author    = {Yoshua Bengio and
  author    = {Ilya Sutskever and
               James Martens and
               George E. Dahl and
-               Geoffrey Everest Hinton},
+               Geoffrey Hinton},
  publisher = {International Conference on Machine Learning},
  pages     = {1139--1147},
  year={2013}
@@ -5474,7 +5454,7 @@ author    = {Yoshua Bengio and
 }

 @article{JMLR:v15:srivastava14a,
-  author  = {Nitish Srivastava and Geoffrey Everest Hinton and Alex Krizhevsky and Ilya Sutskever and Ruslan Salakhutdinov},
+  author  = {Nitish Srivastava and Geoffrey Hinton and Alex Krizhevsky and Ilya Sutskever and Ruslan Salakhutdinov},
  title   = {Dropout: A Simple Way to Prevent Neural Networks from Overfitting},
  journal = {Journal of Machine Learning Research},
  year    = {2014},
@@ -5508,7 +5488,7 @@ author    = {Yoshua Bengio and
  title={Rigid-motion scattering for image classification},
  author={Sifre, Laurent and Mallat, St{\'e}phane},
  year={2014},
-  publisher={Citeseer}
+  journal={Citeseer}
 }

 @article{Taigman2014DeepFaceCT,
@@ -5567,7 +5547,7 @@ author    = {Yoshua Bengio and
               Tong Zhang},
  title     = {Deep Pyramid Convolutional Neural Networks for Text Categorization},
  pages     = {562--570},
-  publisher = {Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2017}
 }

@@ -5596,7 +5576,7 @@ author    = {Yoshua Bengio and
  title     = {Speech-Transformer: {A} No-Recurrence Sequence-to-Sequence Model for
               Speech Recognition},
  pages     = {5884--5888},
-  publisher = {Institute of Electrical and Electronics Engineers},
+  publisher = {International Conference on Acoustics, Speech and Signal Processing},
  year      = {2018}
 }

@@ -5774,14 +5754,14 @@ author    = {Yoshua Bengio and
 }
 @article{Liu2020LearningTE,
 	title={Learning to Encode Position for Transformer with Continuous Dynamical Model},
-	author={Xuanqing Liu and Hsiang-Fu Yu and I. Dhillon and Cho-Jui Hsieh},
+	author={Xuanqing Liu and Hsiang-Fu Yu and Inderjit Dhillon and Cho-Jui Hsieh},
 	journal={ArXiv},
 	year={2020},
 	volume={abs/2003.09229}
 }
 @inproceedings{Jawahar2019WhatDB,
 	title={What Does BERT Learn about the Structure of Language?},
-	author={Ganesh Jawahar and B. Sagot and Djam{\'e} Seddah},
+	author={Ganesh Jawahar and Beno{\^{\i}}t Sagot and Djam{\'e} Seddah},
 	publisher={Annual Meeting of the Association for Computational Linguistics},
 	year={2019}
 }
@@ -8136,7 +8116,7 @@ author    = {Zhuang Liu and
  author    = {Ivan Vulic and
               Anna Korhonen},
  title     = {On the Role of Seed Lexicons in Learning Bilingual Word Embeddings},
-  publisher = {The Association for Computer Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2016}
 }
 @inproceedings{DBLP:conf/iclr/SmithTHH17,
@@ -8782,7 +8762,7 @@ author    = {Zhuang Liu and
  title     = {Using Context Vectors in Improving a Machine Translation System with
               Bridge Language},
  pages     = {318--322},
-  publisher = {The Association for Computer Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2013}
 }
 @inproceedings{DBLP:conf/emnlp/ZhuHWZWZ14,
@@ -8806,7 +8786,7 @@ author    = {Zhuang Liu and
               Satoshi Nakamura},
  title     = {Improving Pivot Translation by Remembering the Pivot},
  pages     = {573--577},
-  publisher = {The Association for Computer Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2015}
 }
 @inproceedings{DBLP:conf/acl/CohnL07,
@@ -8832,7 +8812,7 @@ author    = {Zhuang Liu and
               Haifeng Wang},
  title     = {Revisiting Pivot Language Approach for Machine Translation},
  pages     = {154--162},
-  publisher = {The Association for Computer Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2009}
 }
 @article{DBLP:journals/corr/ChengLYSX16,
@@ -8877,7 +8857,7 @@ author    = {Zhuang Liu and
               Rafael E. Banchs},
  title     = {Enhancing scarce-resource language translation through pivot combinations},
  pages     = {1361--1365},
-  publisher = {The Association for Computer Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2011}
 }
 @article{DBLP:journals/corr/HintonVD15,
@@ -8924,7 +8904,7 @@ author    = {Zhuang Liu and
               Haifeng Wang},
  title     = {Multi-Task Learning for Multiple Language Translation},
  pages     = {1723--1732},
-  publisher = {The Association for Computer Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2015}
 }
 @article{DBLP:journals/tacl/LeeCH17,
@@ -9504,7 +9484,7 @@ author    = {Zhuang Liu and
               Xiaohua Liu and
               Hang Li},
  title     = {Modeling Coverage for Neural Machine Translation},
-  publisher = {The Association for Computer Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2016}
 }
 @article{DBLP:journals/tacl/TuLLLL17,
@@ -10555,3 +10535,429 @@ author    = {Zhuang Liu and

 %%%%% chapter 18------------------------------------------------------
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%% chapter appendix-A------------------------------------------------------
+@inproceedings{Tong2012NiuTrans,
+  author    = {Tong Xiao and
+               Jingbo Zhu and
+               Hao Zhang and
+               Qiang Li},
+  title     = {NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based
+               Machine Translation},
+  pages     = {19--24},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2012}
+}
+
+@inproceedings{Li2010Joshua,
+  author    = {Zhifei Li and
+               Chris Callison-Burch and
+               Chris Dyer and
+               Sanjeev Khudanpur and
+               Lane Schwartz and
+               Wren N. G. Thornton and
+               Jonathan Weese and
+               Omar Zaidan},
+  title     = {Joshua: An Open Source Toolkit for Parsing-Based Machine Translation},
+  pages     = {135--139},
+  publisher = {Association for Computational Linguistics},
+  year      = {2009}
+}
+
+@inproceedings{iglesias2009hierarchical,
+  author    = {Gonzalo Iglesias and
+               Adri{\`{a}} de Gispert and
+               Eduardo Rodr{\'{\i}}guez Banga and
+               William J. Byrne},
+  title     = {Hierarchical Phrase-Based Translation with Weighted Finite State Transducers},
+  pages     = {433--441},
+  publisher = {The Association for Computational Linguistics},
+  year      = {2009}
+}
+
+@inproceedings{dyer2010cdec,
+  author    = {Chris Dyer and
+               Adam Lopez and
+               Juri Ganitkevitch and
+               Jonathan Weese and
+               Ferhan T{\"{u}}re and
+               Phil Blunsom and
+               Hendra Setiawan and
+               Vladimir Eidelman and
+               Philip Resnik},
+  title     = {cdec: {A} Decoder, Alignment, and Learning Framework for Finite-State
+               and Context-Free Translation Models},
+  pages     = {7--12},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2010}
+}
+
+@inproceedings{Cer2010Phrasal,
+  author    = {Daniel M. Cer and
+               Michel Galley and
+               Daniel Jurafsky and
+               Christopher D. Manning},
+  title     = {Phrasal: {A} Statistical Machine Translation Toolkit for Exploring
+               New Model Features},
+  pages     = {9--12},
+  publisher = {The Association for Computational Linguistics},
+  year      = {2010}
+}
+
+@article{vilar2012jane,
+  title={Jane: an advanced freely available hierarchical machine translation toolkit},
+  author={Vilar, David and Stein, Daniel and Huck, Matthias and Ney, Hermann},
+  publisher={Machine Translation},
+  volume={26},
+  number={3},
+  pages={197--216},
+  year={2012}
+}
+
+@inproceedings{DBLP:conf/naacl/DyerCS13,
+  author    = {Chris Dyer and
+               Victor Chahuneau and
+               Noah A. Smith},
+  title     = {A Simple, Fast, and Effective Reparameterization of {IBM} Model 2},
+  pages     = {644--648},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2013}
+}
+
+@article{al2016theano,
+  author    = {Rami Al-Rfou and
+               Guillaume Alain and
+               Amjad Almahairi and
+               Christof Angerm{\"{u}}ller and
+               Dzmitry Bahdanau and
+               Nicolas Ballas and
+               Fr{\'{e}}d{\'{e}}ric Bastien and
+               Justin Bayer and
+               Anatoly Belikov and
+               Alexander Belopolsky and
+               Yoshua Bengio and
+               Arnaud Bergeron and
+               James Bergstra and
+               Valentin Bisson and
+               Josh Bleecher Snyder and
+               Nicolas Bouchard and
+               Nicolas Boulanger-Lewandowski and
+               Xavier Bouthillier and
+               Alexandre de Br{\'{e}}bisson and
+               Olivier Breuleux and
+               Pierre Luc Carrier and
+               Kyunghyun Cho and
+               Jan Chorowski and
+               Paul F. Christiano and
+               Tim Cooijmans and
+               Marc-Alexandre C{\^{o}}t{\'{e}} and
+               Myriam C{\^{o}}t{\'{e}} and
+               Aaron C. Courville and
+               Yann N. Dauphin and
+               Olivier Delalleau and
+               Julien Demouth and
+               Guillaume Desjardins and
+               Sander Dieleman and
+               Laurent Dinh and
+               Melanie Ducoffe and
+               Vincent Dumoulin and
+               Samira Ebrahimi Kahou and
+               Dumitru Erhan and
+               Ziye Fan and
+               Orhan Firat and
+               Mathieu Germain and
+               Xavier Glorot and
+               Ian J. Goodfellow and
+               Matthew Graham and
+               {\c{C}}aglar G{\"{u}}l{\c{c}}ehre and
+               Philippe Hamel and
+               Iban Harlouchet and
+               Jean-Philippe Heng and
+               Bal{\'{a}}zs Hidasi and
+               Sina Honari and
+               Arjun Jain and
+               S{\'{e}}bastien Jean and
+               Kai Jia and
+               Mikhail Korobov and
+               Vivek Kulkarni and
+               Alex Lamb and
+               Pascal Lamblin and
+               Eric Larsen and
+               C{\'{e}}sar Laurent and
+               Sean Lee and
+               Simon Lefran{\c{c}}ois and
+               Simon Lemieux and
+               Nicholas L{\'{e}}onard and
+               Zhouhan Lin and
+               Jesse A. Livezey and
+               Cory Lorenz and
+               Jeremiah Lowin and
+               Qianli Ma and
+               Pierre-Antoine Manzagol and
+               Olivier Mastropietro and
+               Robert McGibbon and
+               Roland Memisevic and
+               Bart van Merri{\"{e}}nboer and
+               Vincent Michalski and
+               Mehdi Mirza and
+               Alberto Orlandi and
+               Christopher Joseph Pal and
+               Razvan Pascanu and
+               Mohammad Pezeshki and
+               Colin Raffel and
+               Daniel Renshaw and
+               Matthew Rocklin and
+               Adriana Romero and
+               Markus Roth and
+               Peter Sadowski and
+               John Salvatier and
+               Fran{\c{c}}ois Savard and
+               Jan Schl{\"{u}}ter and
+               John Schulman and
+               Gabriel Schwartz and
+               Iulian Vlad Serban and
+               Dmitriy Serdyuk and
+               Samira Shabanian and
+               {\'{E}}tienne Simon and
+               Sigurd Spieckermann and
+               S. Ramana Subramanyam and
+               Jakub Sygnowski and
+               J{\'{e}}r{\'{e}}mie Tanguay and
+               Gijs van Tulder and
+               Joseph P. Turian and
+               Sebastian Urban and
+               Pascal Vincent and
+               Francesco Visin and
+               Harm de Vries and
+               David Warde-Farley and
+               Dustin J. Webb and
+               Matthew Willson and
+               Kelvin Xu and
+               Lijun Xue and
+               Li Yao and
+               Saizheng Zhang and
+               Ying Zhang},
+  title     = {Theano: {A} Python framework for fast computation of mathematical
+               expressions},
+  journal   = {CoRR},
+  volume    = {abs/1605.02688},
+  year      = {2016}
+}
+
+@inproceedings{DBLP:journals/corr/SennrichFCBHHJL17,
+  author    = {Rico Sennrich and
+               Orhan Firat and
+               Kyunghyun Cho and
+               Barry Haddow and
+			   Alexandra Birch and
+               Julian Hitschler and
+               Marcin Junczys-Dowmunt and
+               Samuel L{\"{a}}ubli and
+               Antonio Valerio Miceli Barone and
+               Jozef Mokry and
+               Maria Nadejde},
+  title     = {Nematus: a Toolkit for Neural Machine Translation},
+  publisher = {European Association of Computational Linguistics},
+  pages     = {65--68},
+  year      = {2017}
+}
+
+@inproceedings{Koehn2007Moses,
+  author    = {Philipp Koehn and
+               Hieu Hoang and
+			    Alexandra Birch and
+               Chris Callison-Burch and
+               Marcello Federico and
+               Nicola Bertoldi and
+               Brooke Cowan and
+               Wade Shen and
+               Christine Moran and
+               Richard Zens and
+               Chris Dyer and
+               Ondrej Bojar and
+               Alexandra Constantin and
+               Evan Herbst},
+  title     = {Moses: Open Source Toolkit for Statistical Machine Translation},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2007}
+}
+
+@inproceedings{zollmann2007the,
+  author    = {Andreas Zollmann and
+               Ashish Venugopal and
+               Matthias Paulik and
+               Stephan Vogel},
+  title     = {The Syntax Augmented {MT} {(SAMT)} System at the Shared Task for the
+               2007 {ACL} Workshop on Statistical Machine Translation},
+  pages     = {216--219},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2007}
+}
+
+@article{och2003systematic,
+  author    = {Franz Josef Och and
+               Hermann Ney},
+  title     = {A Systematic Comparison of Various Statistical Alignment Models},
+  journal   = {Computational Linguistics},
+  volume    = {29},
+  number    = {1},
+  pages     = {19--51},
+  year      = {2003}
+}
+
+@inproceedings{zoph2016simple,
+  author    = {Barret Zoph and
+               Ashish Vaswani and
+               Jonathan May and
+               Kevin Knight},
+  title     = {Simple, Fast Noise-Contrastive Estimation for Large {RNN} Vocabularies},
+  pages     = {1217--1222},
+  publisher = {The Association for Computational Linguistics},
+  year      = {2016}
+}
+
+@inproceedings{Ottfairseq,
+  author    = {Myle Ott and
+               Sergey Edunov and
+               Alexei Baevski and
+               Angela Fan and
+               Sam Gross and
+               Nathan Ng and
+               David Grangier and
+               Michael Auli},
+  title     = {fairseq: {A} Fast, Extensible Toolkit for Sequence Modeling},
+  pages     = {48--53},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2019}
+}
+
+@inproceedings{Vaswani2018Tensor2TensorFN,
+   author    = {Ashish Vaswani and
+               Samy Bengio and
+               Eugene Brevdo and
+               Fran{\c{c}}ois Chollet and
+               Aidan N. Gomez and
+               Stephan Gouws and
+               Llion Jones and
+               Lukasz Kaiser and
+               Nal Kalchbrenner and
+               Niki Parmar and
+               Ryan Sepassi and
+               Noam Shazeer and
+               Jakob Uszkoreit},
+  title     = {Tensor2Tensor for Neural Machine Translation},
+  pages     = {193--199},
+  publisher = {Association for Machine Translation in the Americas},
+  year      = {2018}
+}
+
+@inproceedings{KleinOpenNMT,
+  author    = {Guillaume Klein and
+               Yoon Kim and
+               Yuntian Deng and
+               Jean Senellart and
+               Alexander M. Rush},
+  title     = {OpenNMT: Open-Source Toolkit for Neural Machine Translation},
+  pages     = {67--72},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2017}
+}
+
+@inproceedings{luong2016acl_hybrid,
+  author    = {Minh-Thang Luong and
+               Christopher D. Manning},
+  title     = {Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character
+               Models},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2016}
+}
+
+@article{ZhangTHUMT,
+  author    = {Jiacheng Zhang and
+               Yanzhuo Ding and
+               Shiqi Shen and
+               Yong Cheng and
+               Maosong Sun and
+               Huan-Bo Luan and
+               Yang Liu},
+  title     = {{THUMT:} An Open Source Toolkit for Neural Machine Translation},
+  journal   = {CoRR},
+  volume    = {abs/1706.06415},
+  year      = {2017}
+}
+
+@inproceedings{JunczysMarian,
+  author    = {Marcin Junczys-Dowmunt and
+               Roman Grundkiewicz and
+               Tomasz Dwojak and
+               Hieu Hoang and
+               Kenneth Heafield and
+               Tom Neckermann and
+               Frank Seide and
+               Ulrich Germann and
+               Alham Fikri Aji and
+               Nikolay Bogoychev and
+               Andr{\'{e}} F. T. Martins and
+               Alexandra Birch},
+  title     = {Marian: Fast Neural Machine Translation in {C++}},
+  pages     = {116--121},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2018}
+}
+
+@article{hieber2017sockeye,
+  author    = {Felix Hieber and
+               Tobias Domhan and
+               Michael Denkowski and
+               David Vilar and
+               Artem Sokolov and
+               Ann Clifton and
+               Matt Post},
+  title     = {Sockeye: {A} Toolkit for Neural Machine Translation},
+  journal   = {CoRR},
+  volume    = {abs/1712.05690},
+  year      = {2017}
+}
+
+@inproceedings{WangCytonMT,
+  author    = {Xiaolin Wang and
+               Masao Utiyama and
+               Eiichiro Sumita},
+  title     = {CytonMT: an Efficient Neural Machine Translation Open-source Toolkit
+               Implemented in {C++}},
+  pages     = {133--138},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2018}
+}
+
+@article{DBLP:journals/corr/abs-1805-10387,
+  author    = {Oleksii Kuchaiev and
+               Boris Ginsburg and
+               Igor Gitman and
+               Vitaly Lavrukhin and
+               Carl Case and
+               Paulius Micikevicius},
+  title     = {OpenSeq2Seq: extensible toolkit for distributed and mixed precision
+               training of sequence-to-sequence models},
+  journal   = {CoRR},
+  volume    = {abs/1805.10387},
+  year      = {2018}
+}
+
+@article{nmtpy2017,
+  author    = {Ozan Caglayan and
+               Mercedes Garc{\'{\i}}a-Mart{\'{\i}}nez and
+               Adrien Bardet and
+               Walid Aransa and
+               Fethi Bougares and
+               Lo{\"{\i}}c Barrault},
+  title     = {{NMTPY:} {A} Flexible Toolkit for Advanced Neural Machine Translation
+               Systems},
+  journal   = {Prague Bull. Math. Linguistics},
+  volume    = {109},
+  pages     = {15--28},
+  year      = {2017}
+}
+%%%%% chapter appendix-A------------------------------------------------------
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%