合并分支 'caorunzhe' 到 'zengxin'

Caorunzhe 查看合并请求 !562

合并分支 'caorunzhe' 到 'zengxin'
Caorunzhe 查看合并请求 !562
1e8e3be2 · zengxin · f74c2603 · 4b76ba54 · 1e8e3be2 · 1e8e3be2
Commit 1e8e3be2 authored Dec 08, 2020 by zengxin
--- a/Chapter13/chapter13.tex
+++ b/Chapter13/chapter13.tex
@@ -23,6 +23,22 @@

 \chapter{神经机器翻译模型推断}

+\parinterval 神经机器翻译的一个关键步骤是对神经网络进行训练。也就是，在双语平行数据上优化训练目标函数，使模型参数自动调整到一个“最佳”状态，从而可以对新的句子进行翻译。通常，训练过程对模型性能的好坏起到关键性作用。因此，对模型训练方法的研究也机器翻译领域的重要研究方向，其中的很多发现也对其它自然语言处理任务有很好的借鉴意义。
+
+\parinterval 神经机器翻译模型训练面临着一些挑战，例如：
+
+\begin{itemize}
+\vspace{0.5em}
+\item 如何对大容量模型进行有效的训练？例如，避免过拟合问题，并让模型更加健壮，同时有效地处理更大的词汇表；
+\vspace{0.5em}
+\item 如何设计更好的模型训练策略？例如，在训练中更好地利用机器翻译评价指标，同时选择对翻译更有价值的样本进行参数更新；
+\vspace{0.5em}
+\item 如何让模型学习到的“知识”在模型之间迁移？例如，把一个“强”模型的能力迁移到一个“弱”模型上，而这种能力可能是无法通过直接训练“弱”模型得到的。
+\vspace{0.5em}
+\end{itemize}
+
+\parinterval 本章将就这些问题展开讨论，内容会覆盖开放词表、正则化、对抗样本训练、最小风险训练、知识蒸馏等多个主题。需要注意的是，神经机器翻译模型训练涉及的内容十分广泛。很多情况下，模型训练问题会和建模问题强相关。因此，本章的内容主要集中在相对独立的模型训练问题上。在后续章节中，仍然会有模型训练方面的介绍，其主要针对机器翻译的特定主题，如极深神经网络训练、无指导训练等。
+
 %----------------------------------------------------------------------------------------
 %    NEW SECTION
 %----------------------------------------------------------------------------------------
@@ -52,9 +68,11 @@
 \parinterval 如果要覆盖更多的翻译现象，词表会不断膨胀，并带来两个问题：

 \begin{itemize}
+\vspace{0.5em}
 \item 数据稀疏。很多不常见的低频词包含在词表中，而这些低频词的分布式表示很难得到充分学习；
-
+\vspace{0.5em}
 \item 词向量矩阵的增大。这会增加计算和存储的负担。
+\vspace{0.5em}
 \end{itemize}

 \parinterval 理想情况下，机器翻译应该是一个{\small\bfnew{开放词表}}\index{开放词表}（Open-Vocabulary）\index{Open-Vocabulary}的翻译任务。也就是，不论测试数据中包含什么样的词，机器翻译系统都应该能够正常翻译。但是，现实的情况是，即使不断扩充词表，也不可能覆盖所有可能的单词。这时就会出现OOV问题（集外词问题）。这个问题在使用受限词表时会更加严重，因为低频词和未见过的词都会被看作OOV单词。这时会将这些单词用<UNK>代替。通常，数据中<UNK>的数量会直接影响翻译性能，过多的<UNK>会造成欠翻译、结构混乱等问题。因此神经机器翻译需要额外的机制解决大词表和OOV问题。
@@ -128,11 +146,13 @@
 \parinterval 图\ref{fig:7-9}给出了BPE算法执行的实例。在执行合并操作时，需要考虑不同的情况。假设词表中存在子词``ab''和``cd''，此时要加入子词``abcd''。可能会出现如下的情况：

 \begin{itemize}
+\vspace{0.5em}
 \item 若``ab''、``cd''、``abcd''完全独立，彼此的出现互不影响，将``abcd''加入词表，词表数目$+1$；
-
+\vspace{0.5em}
 \item 若``ab''和``cd''必同时出现则词表中加入``abcd''，去除``ab''和``cd''，词表数目$-1$。这个操作是为了较少词表中的冗余；
-
+\vspace{0.5em}
 \item 若出现``ab''，其后必出现``cd''，但是``cd''却可以作为独立的子词出现，则将``abcd''加入词表，去除``ab''，反之亦然，词表数目不变。
+\vspace{0.5em}
 \end{itemize}

 \parinterval 在得到了子词词表后，便需要对单词进行切分。BPE要求从较长的子词开始替换。首先，对子词词表按照字符长度从大到小进行排序。然后，对于每个单词，遍历子词词表，判断每个子词是不是当前词的子串，若是则进行替换切分。将单词中所有的子串替换为子词后，如果仍有子串未被替换，则将其用<UNK>代替，如图\ref{fig:7-10} 。

--- a/Chapter14/chapter14.tex
+++ b/Chapter14/chapter14.tex
@@ -177,7 +177,7 @@ a &=& \omega_{\textrm{low}}\cdot |\seq{x}| \label{eq:14-3}\\
 b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 \end{eqnarray}
 \vspace{0.5em}
-\noindent 其中，$\omega_{\textrm{low}}$和$\omega_{\textrm{high}}$分别是表示译文长度的下限和上限的参数，比如，很多系统中设置为$\omega_{\textrm{low}}=1/2$，$\omega_{\textrm{high}}=2$，表示译文至少有源语言句子一半长，最多有源语言句子两倍长。$\omega_{\textrm{low}}$和$\omega_{\textrm{high}}$的设置对推断效率影响很大，$\omega_{\textrm{high}}$可以被看作是一个推断的终止条件，最理想的情况是$\omega_{\textrm{high}} \cdot |\seq{x}|$恰巧就等于最佳译文的长度，这时没有任何计算的浪费。反过来的一种情况，$\omega_{\textrm{high}} \cdot |\seq{x}|$远大于最佳译文的长度，这时很多计算都是无用的。为了找到长度预测的准确率和召回率之间的平衡，一般需要大量的实验最终确定$\omega_{\textrm{low}}$和$\omega_{\textrm{high}}$。当然，利用统计模型预测$\omega_{\textrm{low}}$和$\omega_{\textrm{high}}$也是非常值得探索的方向，比如基于产出率的模型\upcite{Gu2017NonAutoregressiveNM,Feng2016ImprovingAM}。
+\noindent 其中，$\omega_{\textrm{low}}$和$\omega_{\textrm{high}}$分别是表示译文长度的下限和上限的参数，比如，很多系统中设置为$\omega_{\textrm{low}}=1/2$，$\omega_{\textrm{high}}=2$，表示译文至少有源语言句子一半长，最多有源语言句子两倍长。$\omega_{\textrm{low}}$和$\omega_{\textrm{high}}$的设置对推断效率影响很大，$\omega_{\textrm{high}}$可以被看作是一个推断的终止条件，最理想的情况是$\omega_{\textrm{high}} \cdot |\seq{x}|$恰巧就等于最佳译文的长度，这时没有任何计算的浪费。反过来的一种情况，$\omega_{\textrm{high}} \cdot |\seq{x}|$远大于最佳译文的长度，这时很多计算都是无用的。为了找到长度预测的准确率和召回率之间的平衡，一般需要大量的实验最终确定$\omega_{\textrm{low}}$和$\omega_{\textrm{high}}$。当然，利用统计模型预测$\omega_{\textrm{low}}$和$\omega_{\textrm{high}}$也是非常值得探索的方向，比如基于繁衍率的模型\upcite{Gu2017NonAutoregressiveNM,Feng2016ImprovingAM}。
 \vspace{0.5em}
 \item 覆盖度模型。译文长度过长或过短的问题，本质上对应着 {\small\sffamily\bfseries{过翻译}}\index{过翻译}（Over Translation）\index{Over Translation}和{\small\sffamily\bfseries{欠翻译}}\index{欠翻译}（Under Translation）\index{Under Translation}的问题\upcite{Yang2018OtemUtemOA}。这两种问题出现的原因主要在于：神经机器翻译没有对过翻译和欠翻译建模，即机器翻译覆盖度问题\upcite{TuModeling}。针对此问题，最常用的方法是在推断的过程中引入一个度量覆盖度的模型。比如，使用GNMT 覆盖度模型\upcite{Wu2016GooglesNM}，其中翻译模型得分被定义为：
 \begin{eqnarray}
@@ -485,11 +485,11 @@ b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 \parinterval 另外，在每个解码器层中还包括额外的位置注意力模块，该模块与Transformer模型的其它部分中使用的多头注意力机制相同，如下：

 \begin{eqnarray}
-\textrm{Attention}(Q,K,V)&=&\textrm{Softmax}(\frac{QK^{T}}{\sqrt{d_k}})\cdot V
+\textrm{Attention}(\mathbi{Q},\mathbi{K},\mathbi{V})&=&\textrm{Softmax}(\frac{\mathbi{Q}{\mathbi{K}}^{T}}{\sqrt{d_k}})\cdot \mathbi{V}
 \label{eq:14-10}
 \end{eqnarray}

-\noindent 其中$d_k$表示模型的隐层大小，其中位置编码作为$Q$和$K$,解码端上一层的输出作为$V$。将位置信息直接结合到注意力过程中，比单独的位置嵌入提供了更强的位置信息，同时该附加信息可能还会提高解码器执行局部重排序的能力。
+\noindent 其中$d_k$表示模型的隐层大小，其中位置编码作为$\mathbi{Q}$和$\mathbi{K}$,解码端上一层的输出作为$\mathbi{V}$。将位置信息直接结合到注意力过程中，比单独的位置嵌入提供了更强的位置信息，同时该附加信息可能还会提高解码器执行局部重排序的能力。

 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION

--- a/Chapter15/chapter15.tex
+++ b/Chapter15/chapter15.tex
--- a/Chapter16/Figures/figure-application-process-of-back-translation.tex
+++ b/Chapter16/Figures/figure-application-process-of-back-translation.tex
 \begin{tikzpicture}


-\tikzstyle{bignode} = [line width=0.6pt,draw=black,minimum width=6.3em,minimum height=2.2em,fill=white]
-\tikzstyle{middlenode} = [line width=0.6pt,draw=black,minimum width=5.6em,minimum height=2.2em,fill=white]
+\tikzstyle{bignode} = [,inner sep=0.3em,draw=black,line width=0.6pt,rounded corners=2pt,minimum width=3.0em]


-\node [anchor=center] (node1-1) at (0,0) {\scriptsize{汉语}};
-\node [anchor=west] (node1-2) at ([xshift=0.8em]node1-1.east) {\scriptsize{英语}};
-\node [anchor=north] (node1-3) at ([xshift=1.45em]node1-1.south) {\scriptsize{反向翻译模型}};
-\draw [->,line width=0.6pt](node1-1.east)--(node1-2.west);
+\node [anchor=center] (node1-1) at (0,0) {{汉语}};
+\node [anchor=west] (node1-2) at ([xshift=0.8em]node1-1.east) {{英语}};
+\node [anchor=north] (node1-3) at ([xshift=1.75em]node1-1.south) {{反向翻译模型}};
+\draw [->,thick](node1-1.east)--(node1-2.west);

 \begin{pgfonlayer}{background}
 {
-\node[fill=blue!20,inner sep=0.1em,draw=black,line width=0.6pt,minimum width=6.0em,drop shadow,rounded corners=2pt] [fit =(node1-1)(node1-2)(node1-3)]  (remark1) {};
+\node[fill=blue!20,inner sep=0.3em,draw=black,line width=0.6pt,minimum width=6.0em,drop shadow,rounded corners=2pt] [fit =(node1-1)(node1-2)(node1-3)]  (remark1) {};
 }
 \end{pgfonlayer}


-\node [anchor=north,fill=green!20,inner sep=0.1em,minimum width=3em,draw=black,line width=0.6pt,rounded corners=2pt](node2-1) at ([xshift=-1.5em,yshift=-1.95em]remark1.south){\scriptsize{汉语}};
-\node [anchor=west,fill=green!20,inner sep=0.1em,minimum width=3em,draw=black,line width=0.6pt,rounded corners=2pt](node2-2) at (node2-1.east){\scriptsize{英语}};
-\draw [->,line width=0.6pt]([yshift=-2.0em]remark1.south)--(remark1.south) node [pos=0.5,right] (pos1) {\scriptsize{训练}};
+\node [anchor=north,fill=green!20,bignode](node2-1) at ([yshift=-3em]node1-3.south){{汉语}};
+\node [anchor=north,fill=green!20,bignode](node2-2) at (node2-1.south){{英语}};
+\draw [->,thick](node2-1.north)--(remark1.south) node [pos=0.5,right] (pos1) {{训练}};


-\node [anchor=west,fill=yellow!20,inner sep=0.1em,minimum width=3em,draw=black,line width=0.6pt,rounded corners=2pt](node3-1) at ([xshift=5.0em,yshift=0.0em]node1-2.east){\scriptsize{汉语}};
-\node [anchor=north,fill=red!20,inner sep=0.1em,minimum width=3em,draw=black,line width=0.6pt,rounded corners=2pt](node3-2) at ([yshift=-2.15em]node3-1.south){\scriptsize{英语}};
+\node [anchor=west,fill=yellow!20,bignode](node3-1) at ([xshift=6.5em,yshift=0.0em]node1-2.east){{汉语}};
+\node [anchor=north,fill=red!20,bignode](node3-2) at ([yshift=-2.5em]node3-1.south){{英语}};
+\node [anchor=center](node3-3) at ([xshift=0.4em]node3-2.east){};

-\draw [->,line width=0.6pt](node3-1.south)--(node3-2.north) node [pos=0.5,right] (pos2) {\scriptsize{翻译}};
+\draw [->,thick](node3-1.south)--(node3-2.north) node [pos=0.5,right] (pos2) {{翻译}};

 \begin{pgfonlayer}{background}
 {
-\node[rounded corners=2pt,inner sep=0.3em,draw=black,line width=0.6pt,dotted] [fit =(node3-1)(node3-2)]  (remark2) {};
+\node[rounded corners=2pt,inner sep=0.3em,draw=black,line width=0.6pt,dotted] [fit =(node3-1)(node3-2)(node3-3)]  (remark2) {};
 }
 \end{pgfonlayer}

-\draw [->,line width=0.6pt](remark1.east)--([yshift=0.85em]remark2.west) node [pos=0.5,above] (pos2) {\scriptsize{模型翻译}};
-\node [anchor=south](pos2-2) at ([yshift=-0.5em]pos2.north){\scriptsize{使用反向}};
+\draw [->,thick](remark1.east)--([xshift=5.5em]remark1.east) node [pos=0.5,above] (pos2) {{模型翻译}};
+\node [anchor=south](pos2-2) at ([yshift=-0.5em]pos2.north){{使用反向}};

-\draw[decorate,thick,decoration={brace,amplitude=5pt}] ([yshift=1.3em,xshift=1.0em]node3-1.east) -- ([yshift=-5.2em,xshift=1.0em]node3-1.east) node [pos=0.1,right,xshift=0.0em,yshift=0.0em] (label1) {\scriptsize{{混合}}};
+\draw[decorate,thick,decoration={brace,amplitude=5pt}] ([yshift=1.5em,xshift=1.5em]node3-1.east) -- ([yshift=-8.6em,xshift=1.5em]node3-1.east) node [pos=0.1,right,xshift=0.0em,yshift=0.0em] (label1) {{{混合}}};


-\node [anchor=west,fill=red!20,inner sep=0.1em,minimum width=3em,draw=black,line width=0.6pt,rounded corners=2pt](node4-1) at ([xshift=2.0em,yshift=1.6em]node3-2.east){\scriptsize{英语}};
-\node [anchor=north,fill=green!20,inner sep=0.1em,minimum width=3em,draw=black,line width=0.6pt,rounded corners=2pt](node4-2) at (node4-1.south){\scriptsize{英语}};
-\node [anchor=west,fill=yellow!20,inner sep=0.1em,minimum width=3em,draw=black,line width=0.6pt,rounded corners=2pt](node4-3) at (node4-1.east){\scriptsize{汉语}};
-\node [anchor=north,fill=green!20,inner sep=0.1em,minimum width=3em,draw=black,line width=0.6pt,rounded corners=2pt](node4-4) at (node4-3.south){\scriptsize{汉语}};
+\node [anchor=west,fill=red!20,bignode](node4-1) at ([xshift=2.5em,yshift=1.3em]node3-2.east){{英语}};
+\node [anchor=north,fill=yellow!20,bignode](node4-2) at (node4-1.south){{汉语}};
+\node [anchor=west,fill=green!20,bignode](node4-3) at (node4-1.east){{英语}};
+\node [anchor=north,fill=green!20,bignode](node4-4) at (node4-3.south){{汉语}};


-\node [anchor=center] (node5-1) at ([xshift=3.4em,yshift=0.25em]node4-3.east) {\scriptsize{英语}};
-\node [anchor=west] (node5-2) at ([xshift=0.8em]node5-1.east) {\scriptsize{汉语}};
-\node [anchor=north] (node5-3) at ([xshift=1.65em]node5-1.south) {\scriptsize{正向翻译模型}};
-\draw [->,line width=0.6pt](node5-1.east)--(node5-2.west);
+\node [anchor=center] (node5-1) at ([xshift=5em,yshift=0.02em]node4-3.east) {{英语}};
+\node [anchor=west] (node5-2) at ([xshift=0.8em]node5-1.east) {{汉语}};
+\node [anchor=north] (node5-3) at ([xshift=1.65em]node5-1.south) {{正向翻译模型}};
+\draw [->,thick](node5-1.east)--(node5-2.west);

 \begin{pgfonlayer}{background}
 {
-\node[fill=blue!20,inner sep=0.1em,draw=black,line width=0.6pt,minimum width=6.0em,drop shadow,rounded corners=2pt] [fit =(node5-1)(node5-2)(node5-3)]  (remark3) {};
+\node[fill=blue!20,inner sep=0.3em,draw=black,line width=0.6pt,minimum width=6.0em,drop shadow,rounded corners=2pt] [fit =(node5-1)(node5-2)(node5-3)]  (remark3) {};
 }
 \end{pgfonlayer}

-\draw [->,line width=0.6pt]([xshift=-2em]remark3.west)--(remark3.west) node [pos=0.5,above] (pos3) {\scriptsize{训练}};
-
-\node [anchor=south](d1) at ([xshift=0.0em,yshift=2em]remark3.north){\scriptsize{真实数据：}};
-\node [anchor=north](d2) at ([xshift=0.35em]d1.south){\scriptsize{伪数据：}};
-\node [anchor=south](d3) at ([xshift=0.0em,yshift=0em]d1.north){\scriptsize{额外数据：}};
-\node [anchor=west,fill=green!20,minimum width=1em](d1-1) at ([xshift=-0.0em]d1.east){};
-\node [anchor=west,fill=red!20,minimum width=1em](d2-1) at ([xshift=-0.0em]d2.east){};
-\node [anchor=west,fill=yellow!20,minimum width=1em](d3-1) at ([xshift=-0.0em]d3.east){};
+\draw [->,thick]([xshift=-3.2em]remark3.west)--(remark3.west) node [pos=0.5,above] (pos3) {{训练}};

+\node [anchor=south](d1) at ([xshift=-1.5em,yshift=1em]remark1.north){{真实数据：}};
+\node [anchor=west](d2) at ([xshift=2.0em]d1.east){{伪数据：}};
+\node [anchor=west](d3) at ([xshift=2.0em]d2.east){{额外数据：}};
+\node [anchor=west,fill=green!20,minimum width=1.5em](d1-1) at ([xshift=-0.0em]d1.east){};
+\node [anchor=west,fill=red!20,minimum width=1.5em](d2-1) at ([xshift=-0.0em]d2.east){};
+\node [anchor=west,fill=yellow!20,minimum width=1.5em](d3-1) at ([xshift=-0.0em]d3.east){};

 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter16/Figures/figure-example-of-iterative-back-translation.tex
+++ b/Chapter16/Figures/figure-example-of-iterative-back-translation.tex
 \begin{tikzpicture}
-\begin{scope}
-\node [anchor=center] (node1) at (9.6,1) {\small{训练：}};
-\node [anchor=center] (node11) at (10.2,1) {};
-\node [anchor=center] (node12) at (11.4,1) {};
-\node [anchor=center] (node2) at (9.6,0.5) {\small{推理：}};
-\node [anchor=center] (node21) at (10.2,0.5) {};
-\node [anchor=center] (node22) at (11.4,0.5) {};
-\node [anchor=west,draw=black,line width=0.6pt,minimum width=5.6em,minimum height=2.2em,fill=blue!20,rounded corners=2pt] (node1-1) at (0,0) {\footnotesize{双语数据}};
-\node [anchor=south,draw=black,line width=0.6pt,minimum width=4.5em,minimum height=2.2em,fill=blue!20,rounded corners=2pt] (node1-2) at ([yshift=-5em]node1-1.south) {\footnotesize{目标语伪数据}};
-\node [anchor=west,draw=black,line width=0.6pt,minimum width=4.5em,minimum height=2.2em,fill=red!20,rounded corners=2pt] (node2-1) at ([xshift=-7.7em,yshift=-2.5em]node1-1.west) {\footnotesize{前向NMT系统}};
-\node [anchor=west,draw=black,line width=0.6pt,minimum width=4.5em,minimum height=2.2em,fill=red!20,rounded corners=2pt] (node3-1) at ([xshift=1.5em,yshift=-2.5em]node1-1.east) {\footnotesize{反向NMT系统}};
+\tikzstyle{rec} = [inner sep=0.3em,minimum width=4em,draw=black,line width=0.6pt,rounded corners=2pt]
+\node [anchor=north,fill=green!20,rec](node1-1) at (0,0){{汉语}};
+\node [anchor=north,fill=green!20,rec](node1-2) at (node1-1.south){{英语}};
+\node [anchor=north,fill=yellow!20,rec](node2-1) at ([yshift=-5.0em]node1-1.south){{汉语}};
+\node [anchor=north,fill=red!20,rec](node2-2) at (node2-1.south){{英语}};
+\node [anchor=east] (node3-1) at ([xshift=-4.0em,yshift=-3.5em]node1-1.west) {{正向}};
+\node [anchor=north] (node3-2) at ([yshift=0.5em]node3-1.south) {{翻译模型}};
+\begin{pgfonlayer}{background}
+{
+\node[fill=blue!20,inner sep=0.3em,draw=black,line width=0.6pt,minimum width=3.0em,drop shadow,rounded corners=2pt] [fit =(node3-1)(node3-2)]  (remark1) {};
+}
+\end{pgfonlayer}
+\draw [->,thick]([yshift=-0.75em]node1-1.west)--(remark1.north east);
+\draw [->,thick,dashed](remark1.south east)--([yshift=-0.75em]node2-1.west);

-\node [anchor=east,draw=black,line width=0.6pt,minimum width=5.6em,minimum height=2.2em,fill=blue!20,rounded corners=2pt] (node4-1) at ([xshift=18em]node1-1)  {\footnotesize{双语数据}};
-\node [anchor=south,draw=black,line width=0.6pt,minimum width=4.5em,minimum height=2.2em,fill=blue!20,rounded corners=2pt] (node4-2) at ([yshift=-5em]node4-1.south) {\footnotesize{目标语伪数据}};
+\node [anchor=west] (node4-1) at ([xshift=4.0em,yshift=-3.5em]node1-1.east) {{反向}};
+\node [anchor=north] (node4-2) at ([yshift=0.5em]node4-1.south) {{翻译模型}};
+\begin{pgfonlayer}{background}
+{
+\node[fill=blue!20,inner sep=0.3em,draw=black,line width=0.6pt,minimum width=3.0em,drop shadow,rounded corners=2pt] [fit =(node4-1)(node4-2)]  (remark2) {};
+}
+\end{pgfonlayer}
+\draw [->,thick]([yshift=-0.75em]node1-1.east)--(remark2.north west);
+\draw [->,thick]([yshift=-0.75em]node2-1.east)--(remark2.south west);

-\node [anchor=east,draw=black,line width=0.6pt,minimum width=4.5em,minimum height=2.2em,fill=red!20,rounded corners=2pt] (node5-1) at ([xshift=15.2em]node3-1.east) {\footnotesize{前向NMT系统}};
+\node [anchor=west,fill=green!20,rec](node5-1) at ([xshift=4.0em,yshift=3.48em]node4-1.east){{英语}};
+\node [anchor=north,fill=green!20,rec](node5-2) at (node5-1.south){{汉语 }};
+\node [anchor=north,fill=yellow!20,rec](node6-1) at ([yshift=-5.0em]node5-1.south){{英语}};
+\node [anchor=north,fill=red!20,rec](node6-2) at (node6-1.south){{汉语}};
+
+\draw [->,thick,dashed](remark2.south east)--([yshift=-0.75em]node6-1.west);
+
+\node [anchor=west] (node7-1) at ([xshift=4.0em,yshift=-3.5em]node5-1.east) {{正向}};
+\node [anchor=north] (node7-2) at ([yshift=0.5em]node7-1.south) {{翻译模型}};
+\begin{pgfonlayer}{background}
+{
+\node[fill=blue!20,inner sep=0.3em,draw=black,line width=0.6pt,minimum width=3.0em,drop shadow,rounded corners=2pt] [fit =(node7-1)(node7-2)]  (remark3) {};
+}
+\end{pgfonlayer}
+
+\draw [->,thick]([yshift=-0.75em]node5-1.east)--(remark3.north west);
+\draw [->,thick]([yshift=-0.75em]node6-1.east)--(remark3.south west);
+
+\node [anchor=south](d1) at ([xshift=-0.7em,yshift=4em]remark1.north){{真实数据：}};
+\node [anchor=west](d2) at ([xshift=2.0em]d1.east){{伪数据：}};
+\node [anchor=west](d3) at ([xshift=2.0em]d2.east){{额外数据：}};
+\node [anchor=west,fill=green!20,minimum width=1.5em](d1-1) at ([xshift=-0.0em]d1.east){};
+\node [anchor=west,fill=red!20,minimum width=1.5em](d2-1) at ([xshift=-0.0em]d2.east){};
+\node [anchor=west,fill=yellow!20,minimum width=1.5em](d3-1) at ([xshift=-0.0em]d3.east){};
+\node [anchor=south] (d4) at ([xshift=1em]d1.north) {{训练：}};
+\node [anchor=south] (d5) at ([xshift=0.5em]d2.north) {{推理：}};
+\draw [->,thick] ([xshift=0em]d4.east)--([xshift=1.5em]d4.east);
+\draw [->,thick,dashed] ([xshift=0em]d5.east)--([xshift=1.5em]d5.east);

-\draw [->,line width=1pt](node1-1.west)--([xshift=3em]node2-1.north);
-\draw [->,line width=1pt](node1-1.east)--([xshift=-3em]node3-1.north);
-\draw [->,line width=1pt](node1-2.east)--([xshift=-3em]node3-1.south);
-\draw [->,line width=1pt](node11.east)--(node12.west);
-\draw [->,line width=1pt,dashed](node21.east)--(node22.west);
-\draw [->,line width=1pt,dashed]([xshift=3em]node2-1.south)--([xshift=-0.1em]node1-2.west);
-\draw [->,line width=1pt,dashed]([xshift=3em]node3-1.south)--([xshift=-0.1em]node4-2.west);
-\draw [->,line width=1pt](node4-1.east)--([xshift=-3em]node5-1.north);
-\draw [->,line width=1pt](node4-2.east)--([xshift=-3em]node5-1.south);
-\end{scope}
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter16/Figures/figure-the-iterative-process-of-bidirectional-training.png
+++ b/Chapter16/Figures/figure-the-iterative-process-of-bidirectional-training.png
--- a/Chapter16/Figures/figure-the-iterative-process-of-bidirectional-training.tex
+++ b/Chapter16/Figures/figure-the-iterative-process-of-bidirectional-training.tex
+\definecolor{color1}{rgb}{1,0.725,0.058}
+\tikzstyle{data} = [rectangle,very thick,rounded corners,minimum width=2.3cm,minimum height=0.83cm,text centered,draw=black!70,fill=color1!25]
+\tikzstyle{data_shadow} = [rectangle,very thick,rounded corners,minimum width=2.3cm,minimum height=0.83cm,text centered,draw=black!70,fill=black!70]
+\tikzstyle{process} = [rectangle,thick,rounded corners,minimum width=2cm,minimum height=0.7cm,text centered,draw=black!80,fill=gray!25]
+\tikzstyle{state} = [rectangle,thick,rounded corners,minimum width=3cm,minimum height=0.7cm,text centered,draw=black!80,fill=gray!25]
+\begin{tikzpicture}[node distance = 0,scale = 1]
+\tikzstyle{every node}=[scale=1]
+\node(monolingual_X_shadow)[data_shadow]{};
+\node(bilingual_D_shadow)[data_shadow, right of = monolingual_X_shadow, xshift=5cm]{};
+\node(monolingual_Y_shadow)[data_shadow, right of = bilingual_D_shadow, xshift=5cm]{};
+\node(monolingual_X)[data,right of = monolingual_X_shadow,xshift=-0.08cm,yshift=0.08cm]{单语语料X};
+\node(bilingual_D)[data, right of = monolingual_X, xshift=5cm, fill=ugreen!25]{双语语料D};
+\node(monolingual_Y)[data, right of = bilingual_D, xshift=5cm, fill=blue!25]{单语语料Y};
+
+\node(process_1_1)[process, right of = monolingual_X, xshift=2.5cm, yshift=-1.5cm]{\textbf{$M^0_{x\to y}$}};
+\node(process_1_2)[process, right of = process_1_1, xshift=5cm, fill=red!25]{$M^0_{y\to x}$};
+\node(process_2_1)[process, below of = process_1_1, yshift=-1.2cm]{解码过程};
+\node(process_2_2)[process, below of = process_1_2, yshift=-1.2cm, fill=red!25]{解码过程};
+\node(process_3_1)[state, below of = process_2_1, yshift=-1.2cm, fill=color1!25]{\{$x_i,\hat{y}^0_i$\}};
+\node(process_3_2)[state, below of = process_2_2, yshift=-1.2cm, fill=blue!25]{\{$\hat{x}^0_i,{y_i}$\}};
+\node(process_4_1)[process, below of = process_3_1, yshift=-1.2cm]{\textbf{$M^1_{x\to y}$}};
+\node(process_4_2)[process, below of = process_3_2, yshift=-1.2cm, fill=red!25]{$M^1_{y\to x}$};
+\node(process_5_1)[process, below of = process_4_1, yshift=-1.2cm]{解码过程};
+\node(process_5_2)[process, below of = process_4_2, yshift=-1.2cm, fill=red!25]{解码过程};
+\node(process_6_1)[state, below of = process_5_1, yshift=-1.2cm, fill=color1!25]{\{$x_i,\hat{y}^1_i$\}};
+\node(process_6_2)[state, below of = process_5_2, yshift=-1.2cm, fill=blue!25]{\{$\hat{x}^1_i,{y_i}$\}};
+\node(process_7_1)[process, below of = process_6_1, yshift=-1.2cm]{\textbf{$M^2_{x\to y}$}};
+\node(process_7_2)[process, below of = process_6_2, yshift=-1.2cm, fill=red!25]{$M^2_{y\to x}$};
+\node(ellipsis_1)[below of = monolingual_X, yshift=-9.9cm,scale=1.5]{$...$};
+\node(ellipsis_2)[below of = process_7_1, yshift=-1.2cm,scale=1.5]{$...$};
+\node(ellipsis_3)[below of = bilingual_D, yshift=-9.9cm,scale=1.5]{$...$};
+\node(ellipsis_4)[below of = process_7_2, yshift=-1.2cm,scale=1.5]{$...$};
+\node(ellipsis_5)[below of = monolingual_Y, yshift=-9.9cm,scale=1.5]{$...$};
+\node(text_1)[left of = process_2_1, xshift=-4cm,scale=0.8]{第0轮迭代};
+\node(text_2)[left of = process_5_1, xshift=-4cm,scale=0.8]{第1轮迭代};
+\node(text_3)[left of = ellipsis_2, xshift=-4cm, scale=0.8]{第2轮迭代};
+\draw[->, very thick, color=color1!40](monolingual_X.south)--(ellipsis_1.north);
+\draw[->, very thick, color=ugreen!55](bilingual_D.south)--(ellipsis_3.north);
+\draw[->, very thick, color=blue!55](monolingual_Y.south)--(ellipsis_5.north);
+\draw[->, very thick, color=color1!40]([xshift=-1.5cm]process_2_1.west)--(process_2_1.west);
+\draw[->, very thick, color=color1!40]([xshift=-1.5cm]process_5_1.west)--(process_5_1.west);
+\draw[->, very thick, color=blue!55]([xshift=1.5cm]process_2_2.east)--(process_2_2.east);
+\draw[->, very thick, color=blue!55]([xshift=1.5cm]process_5_2.east)--(process_5_2.east);
+\draw[->, thick](process_1_1.south)--(process_2_1.north);
+\draw[->, thick](process_1_2.south)--(process_2_2.north);
+\draw[->, thick](process_2_1.south)--(process_3_1.north);
+\draw[->, thick](process_2_2.south)--(process_3_2.north);
+\draw[->, thick](process_4_1.south)--(process_5_1.north);
+\draw[->, thick](process_4_2.south)--(process_5_2.north);
+\draw[->, thick](process_5_1.south)--(process_6_1.north);
+\draw[->, thick](process_5_2.south)--(process_6_2.north);
+\draw[->, thick](process_7_1.south)--(ellipsis_2.north);
+\draw[->, thick](process_7_2.south)--(ellipsis_4.north);
+\draw[->, very thick, color=color1!40](process_3_1.east)--([yshift=0.35cm]process_4_2.west);
+\draw[->, very thick, color=color1!40](process_3_2.west)--([yshift=0.35cm]process_4_1.east);
+\draw[->, very thick, color=color1!40](process_6_1.east)--([yshift=0.35cm]process_7_2.west);
+\draw[->, very thick, color=color1!40](process_6_2.west)--([yshift=0.35cm]process_7_1.east);
+\draw[->, very thick, color=ugreen!55,in=0,out=270]([xshift=-0.3cm]bilingual_D.south)to(process_1_1.east);
+\draw[->, very thick, color=ugreen!55,in=180,out=270]([xshift=0.3cm]bilingual_D.south)to(process_1_2.west);
+\draw[->, very thick, color=ugreen!55,in=0,out=270]([yshift=-3.7cm]bilingual_D.south)to(process_4_1.east);
+\draw[->, very thick, color=ugreen!55,in=180,out=270]([yshift=-3.7cm]bilingual_D.south)to(process_4_2.west);
+\draw[->, very thick, color=ugreen!55,in=0,out=270]([yshift=-7.3cm]bilingual_D.south)to(process_7_1.east);
+\draw[->, very thick, color=ugreen!55,in=180,out=270]([yshift=-7.3cm]bilingual_D.south)to(process_7_2.west);
+\draw[->, very thick, color=ugreen!55,in=180,out=270]([yshift=-7.3cm]bilingual_D.south)to(process_7_2.west);
+
+\draw[-, very thick, dashed, color=blue!55]([xshift=-1cm,yshift=-0.35cm]text_1.south)--([xshift=12.7cm,yshift=-0.35cm]text_1.south);
+\draw[-, very thick, dashed, color=blue!55]([xshift=-1cm,yshift=-0.35cm]text_2.south)--([xshift=12.7cm,yshift=-0.35cm]text_2.south);
+\draw[-, very thick, dashed, color=blue!55]([xshift=-1cm,yshift=-0.35cm]text_3.south)--([xshift=12.7cm,yshift=-0.35cm]text_3.south);
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter16/chapter16.tex
+++ b/Chapter16/chapter16.tex
@@ -66,7 +66,7 @@
 \begin{figure}[htp]
 \centering
 \input{./Chapter16/Figures/figure-example-of-iterative-back-translation}
-\caption{\red{迭代式回译方法的流程，未修改} {\color{blue} 这个图的逻辑我觉得是ok的，主要是这些线和过程需要再清晰一下，再找我讨论下！}}
+\caption{迭代式回译方法的流程}
 \label{fig:16-2-xc}
 \end{figure}
 %----------------------------------------------
@@ -288,7 +288,7 @@ $\funp{P}(\seq{y}|\seq{x})$和$\funp{P}(\seq{x}|\seq{y})$是否真的没有关
 %----------------------------------------------
 \begin{figure}[h]
 \centering
-\includegraphics[scale=0.7]{Chapter16/Figures/figure-the-iterative-process-of-bidirectional-training.png}
+\input{Chapter16/Figures/figure-the-iterative-process-of-bidirectional-training}
 \caption{双向训练的迭代过程}
 \label{fig:16-1-fk}
 \end{figure}
@@ -315,7 +315,7 @@ $\funp{P}(\seq{y}|\seq{x})$和$\funp{P}(\seq{x}|\seq{y})$是否真的没有关
 \label{eq:16-7-xc}
 \end{eqnarray}

-\parinterval 公式\ref{eq:16-7-xc}很自然地把两个方向的翻译模型$\funp{P}(\seq{y}|\seq{x})$和$\funp{P}(\seq{x}|\seq{y})$以及两个语言模型$\funp{P}(\seq{x})$和$\funp{P}(\seq{y})$联系起来：$\funp{P}(\seq{x})\funp{P}(\seq{y}|\seq{x})$应该与$\funp{P}(\seq{y})\funp{P}(\seq{x}|\seq{y})$接近，因为它们都表达了同一个联合分布$\funp{P}(\seq{x},\seq{y})$。因此，在构建训练两个方向的翻译模型的目标函数时，除了它们单独训练时各自使用的极大似然估计目标函数，可以额外增加一个目标项来鼓励两个方向的翻译模型：
+\parinterval 公式\eqref{eq:16-7-xc}很自然地把两个方向的翻译模型$\funp{P}(\seq{y}|\seq{x})$和$\funp{P}(\seq{x}|\seq{y})$以及两个语言模型$\funp{P}(\seq{x})$和$\funp{P}(\seq{y})$联系起来：$\funp{P}(\seq{x})\funp{P}(\seq{y}|\seq{x})$应该与$\funp{P}(\seq{y})\funp{P}(\seq{x}|\seq{y})$接近，因为它们都表达了同一个联合分布$\funp{P}(\seq{x},\seq{y})$。因此，在构建训练两个方向的翻译模型的目标函数时，除了它们单独训练时各自使用的极大似然估计目标函数，可以额外增加一个目标项来鼓励两个方向的翻译模型：
 \begin{eqnarray}
 {L}_{\rm{dual}} & = & (\log{\funp{P}(\seq{x})} + \log{\funp{P}(\seq{y}|\seq{x})} - \log{\funp{P}(\seq{y})} - \log{\funp{P}(\seq{x}|\seq{y}))^{2}}
 \label{eq:16-8-xc}
@@ -323,7 +323,7 @@ $\funp{P}(\seq{y}|\seq{x})$和$\funp{P}(\seq{x}|\seq{y})$是否真的没有关

 \parinterval 通过该正则化项，我们将互为对偶的两个任务放在一块学习，通过任务对偶性加强监督学习的过程，就是有监督对偶学习\upcite{DBLP:conf/icml/XiaQCBYL17,qin2020dual}。这里，$\funp{P}(\seq{x})$和$\funp{P}(\seq{y})$这两个语言模型是预先训练好的，并不参与翻译模型的训练。可以看到，对于单独的一个模型来说，其目标函数增加了与另外一个方向的模型相关的项。这样的形式与L1/L2正则化非常类似（见{\chapternine}），因此可以把这个方法看作是一种任务特定的正则化的手段（由翻译任务本身的性质所启发而来）。有监督对偶学习实际上要优化下面这个损失函数:
 \begin{eqnarray}
-{L} & = &  \log{\funp{P}(\seq{y}|\seq{x})}+\log{\funp{P}(\seq{x}|\seq{y})}+\mathcal{L}_{\rm{dual}}
+{L} & = &  \log{\funp{P}(\seq{y}|\seq{x})}+\log{\funp{P}(\seq{x}|\seq{y})}+{L}_{\rm{dual}}
 \label{eq:16-2-fk}
 \end{eqnarray}

@@ -346,7 +346,7 @@ $\funp{P}(\seq{y}|\seq{x})$和$\funp{P}(\seq{x}|\seq{y})$是否真的没有关
 \label{eq:16-9-xc}
 \end{eqnarray}

-\parinterval 公式\ref{eq:16-9-xc}假设$\funp{P}(\seq{x}|\seq{y})=\funp{P}(\seq{x}|\seq{x},\seq{y})$。这个假设显然是成立的，因为当知道一个句子的译文时，并不需要知道它的源文就可以把它翻译回去。如果直接优化（最大化）公式\ref{eq:16-9-xc}右侧，相当于对这个等式$\funp{P}(\seq{x}|\seq{y})$和$\funp{P}(\seq{y}|\seq{x})$施加了{\small\sffamily\bfnew{循环一致性}}\index{循环一致性}（Circle Consistency）\index{Circle Consistency}的约束\upcite{DBLP:conf/iccv/ZhuPIE17}，也就是对于一个句子$\seq{x}$，通过$\funp{P}(\seq{y}|\seq{x})$把它翻译成$\seq{y}$后，根据$\funp{P}(\seq{x}|\seq{y})$应该能重新翻译出$\seq{x}$，如图\ref{fig:16-10-xc}所示。公式\ref{eq:16-9-xc}给出了同时优化$\funp{P}(\seq{x}|\seq{y})$ 和$\funp{P}(\seq{y}|\seq{x})$的一个目标函数形式。这个目标函数的一个额外的好处是它本质上是在学习一个由$\funp{P}(\seq{x}|\seq{y})$和$\funp{P}(\seq{y}|\seq{x})$组成的语言模型$\funp{P}(\seq{x})$，而$\funp{P}(\seq{x})$的学习依赖于单语数据，这意味着这个目标函数可以很自然地直接使用大量单语数据来同时训练两个翻译模型。相同的结论可以推广到$\funp{P}(\seq{y})$ 上\upcite{DBLP:conf/nips/HeXQWYLM16}。
+\parinterval 公式\eqref{eq:16-9-xc}假设$\funp{P}(\seq{x}|\seq{y})=\funp{P}(\seq{x}|\seq{x},\seq{y})$。这个假设显然是成立的，因为当知道一个句子的译文时，并不需要知道它的源文就可以把它翻译回去。如果直接优化（最大化）公式\eqref{eq:16-9-xc}右侧，相当于对这个等式$\funp{P}(\seq{x}|\seq{y})$和$\funp{P}(\seq{y}|\seq{x})$施加了{\small\sffamily\bfnew{循环一致性}}\index{循环一致性}（Circle Consistency）\index{Circle Consistency}的约束\upcite{DBLP:conf/iccv/ZhuPIE17}，也就是对于一个句子$\seq{x}$，通过$\funp{P}(\seq{y}|\seq{x})$把它翻译成$\seq{y}$后，根据$\funp{P}(\seq{x}|\seq{y})$应该能重新翻译出$\seq{x}$，如图\ref{fig:16-10-xc}所示。公式\eqref{eq:16-9-xc}给出了同时优化$\funp{P}(\seq{x}|\seq{y})$ 和$\funp{P}(\seq{y}|\seq{x})$的一个目标函数形式。这个目标函数的一个额外的好处是它本质上是在学习一个由$\funp{P}(\seq{x}|\seq{y})$和$\funp{P}(\seq{y}|\seq{x})$组成的语言模型$\funp{P}(\seq{x})$，而$\funp{P}(\seq{x})$的学习依赖于单语数据，这意味着这个目标函数可以很自然地直接使用大量单语数据来同时训练两个翻译模型。相同的结论可以推广到$\funp{P}(\seq{y})$ 上\upcite{DBLP:conf/nips/HeXQWYLM16}。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -357,14 +357,14 @@ $\funp{P}(\seq{y}|\seq{x})$和$\funp{P}(\seq{x}|\seq{y})$是否真的没有关
 \end{figure}
 %----------------------------------------------

-\parinterval 但是直接使用公式\ref{eq:16-9-xc}作为目标函数需要解决两个问题：
+\parinterval 但是直接使用公式\eqref{eq:16-9-xc}作为目标函数需要解决两个问题：

 \begin{itemize}
 \vspace{0.5em}
-\item 计算公式\ref{eq:16-9-xc}要枚举所有可能的隐变量$\seq{y}$的取值，也就是所有可能产生的目标语句子，而这是不可能的，因此一般会通过平均多个随机产生的$\seq{y}$对应的损失来近似真正的目标函数值；
+\item 计算公式\eqref{eq:16-9-xc}要枚举所有可能的隐变量$\seq{y}$的取值，也就是所有可能产生的目标语句子，而这是不可能的，因此一般会通过平均多个随机产生的$\seq{y}$对应的损失来近似真正的目标函数值；

 \vspace{0.5em}
-\item 从公式\ref{eq:16-9-xc}可以看到，在$\funp{P}(\seq{x})$上计算完目标函数值后，得到的梯度首先传递给$\funp{P}(\seq{x}|\seq{y})$，然后通过$\funp{P}(\seq{x}|\seq{y})$传递给$\funp{P}(\seq{y}|\seq{x})$。由于$\funp{P}(\seq{x}|\seq{y})$的输入$\seq{y}$由$\funp{P}(\seq{y}|\seq{x})$采样得到，而采样操作不可导，导致梯度的传播在$\funp{P}(\seq{y}|\seq{x})$的输出处断开了，因此$\funp{P}(\seq{y}|\seq{x})$接收不到任何梯度来进行更新。常见的解决方案是使用策略梯度\upcite{DBLP:conf/nips/SuttonMSM99}。策略梯度的基本思想如下：如果在执行某个动作之后，获得了一个不错的反馈，那么可以调整策略来增加这个状态下执行该动作的概率；反之，如果采取某个动作后获得了一个负反馈，就需要调整策略来降低这个状态下执行该动作的概率。在算法的实现上，首先对两个翻译模型求梯度，然后在策略调整时选择将梯度加到模型上（获得正反馈）或者减去该梯度（获得负反馈）。
+\item 从公式\eqref{eq:16-9-xc}可以看到，在$\funp{P}(\seq{x})$上计算完目标函数值后，得到的梯度首先传递给$\funp{P}(\seq{x}|\seq{y})$，然后通过$\funp{P}(\seq{x}|\seq{y})$传递给$\funp{P}(\seq{y}|\seq{x})$。由于$\funp{P}(\seq{x}|\seq{y})$的输入$\seq{y}$由$\funp{P}(\seq{y}|\seq{x})$采样得到，而采样操作不可导，导致梯度的传播在$\funp{P}(\seq{y}|\seq{x})$的输出处断开了，因此$\funp{P}(\seq{y}|\seq{x})$接收不到任何梯度来进行更新。常见的解决方案是使用策略梯度\upcite{DBLP:conf/nips/SuttonMSM99}。策略梯度的基本思想如下：如果在执行某个动作之后，获得了一个不错的反馈，那么可以调整策略来增加这个状态下执行该动作的概率；反之，如果采取某个动作后获得了一个负反馈，就需要调整策略来降低这个状态下执行该动作的概率。在算法的实现上，首先对两个翻译模型求梯度，然后在策略调整时选择将梯度加到模型上（获得正反馈）或者减去该梯度（获得负反馈）。

 \vspace{0.5em}
 \end{itemize}

--- a/Chapter18/chapter18.tex
+++ b/Chapter18/chapter18.tex
--- a/bibliography.bib
+++ b/bibliography.bib
@@ -8821,6 +8821,252 @@ author    = {Zhuang Liu and
  year      = {2020}
 }

+@article{DBLP:journals/corr/abs-1806-01261,
+  author    = {Peter W. Battaglia and
+               Jessica B. Hamrick and
+               Victor Bapst and
+               Alvaro Sanchez-Gonzalez and
+               Vin{\'{\i}}cius Flores Zambaldi and
+               Mateusz Malinowski and
+               Andrea Tacchetti and
+               David Raposo and
+               Adam Santoro and
+               Ryan Faulkner and
+               {\c{C}}aglar G{\"{u}}l{\c{c}}ehre and
+               H. Francis Song and
+               Andrew J. Ballard and
+               Justin Gilmer and
+               George E. Dahl and
+               Ashish Vaswani and
+               Kelsey R. Allen and
+               Charles Nash and
+               Victoria Langston and
+               Chris Dyer and
+               Nicolas Heess and
+               Daan Wierstra and
+               Pushmeet Kohli and
+               Matthew Botvinick and
+               Oriol Vinyals and
+               Yujia Li and
+               Razvan Pascanu},
+  title     = {Relational inductive biases, deep learning, and graph networks},
+  journal   = {CoRR},
+  volume    = {abs/1806.01261},
+  year      = {2018}
+}
+
+@inproceedings{Shaw2018SelfAttentionWR,
+  author    = {Peter Shaw and
+               Jakob Uszkoreit and
+               Ashish Vaswani},
+  title     = {Self-Attention with Relative Position Representations},
+  publisher = {Annual Conference of the North American Chapter of the Association for Computational Linguistics},
+  pages     = {464--468},
+  year      = {2018},
+}
+
+@article{Dai2019TransformerXLAL,
+  author    = {Zihang Dai and
+               Zhilin Yang and
+               Yiming Yang and
+               Jaime G. Carbonell and
+               Quoc V. Le and
+               Ruslan Salakhutdinov},
+  title     = {Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context},
+  journal   = {Annual Meeting of the Association for Computational Linguistics},
+  pages     = {2978--2988},
+  year      = {2019}
+}
+
+@inproceedings{vaswani2017attention,
+	title={Attention is All You Need},
+	author={Ashish {Vaswani} and Noam {Shazeer} and Niki {Parmar} and Jakob {Uszkoreit} and Llion {Jones} and Aidan N. {Gomez} and Lukasz {Kaiser} and Illia {Polosukhin}},
+	publisher={International Conference on Neural Information Processing},
+	pages={5998--6008},
+	year={2017}
+}
+
+@inproceedings{DBLP:conf/acl/LiXTZZZ17,
+  author    = {Junhui Li and
+               Deyi Xiong and
+               Zhaopeng Tu and
+               Muhua Zhu and
+               Min Zhang and
+               Guodong Zhou},
+  title     = {Modeling Source Syntax for Neural Machine Translation},
+  pages     = {688--697},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:conf/acl/EriguchiHT16,
+  author    = {Akiko Eriguchi and
+               Kazuma Hashimoto and
+               Yoshimasa Tsuruoka},
+  title     = {Tree-to-Sequence Attentional Neural Machine Translation},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2016}
+}
+
+@inproceedings{Yang2017TowardsBH,
+  author    = {Baosong Yang and
+               Derek F. Wong and
+               Tong Xiao and
+               Lidia S. Chao and
+               Jingbo Zhu},
+  title     = {Towards Bidirectional Hierarchical Representations for Attention-based
+               Neural Machine Translation},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
+  pages     = {1432--1441},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:conf/acl/ChenHCC17,
+  author    = {Huadong Chen and
+               Shujian Huang and
+               David Chiang and
+               Jiajun Chen},
+  title     = {Improved Neural Machine Translation with a Syntax-Aware Encoder and
+               Decoder},
+  pages     = {1936--1945},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2017}
+}
+
+@inproceedings{TuModeling,
+  author    = {Zhaopeng Tu and
+               Zhengdong Lu and
+               Yang Liu and
+               Xiaohua Liu and
+               Hang Li},
+  title     = {Modeling Coverage for Neural Machine Translation},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2016}
+}
+
+@inproceedings{DBLP:conf/wmt/SennrichH16,
+  author    = {Rico Sennrich and
+               Barry Haddow},
+  title     = {Linguistic Input Features Improve Neural Machine Translation},
+  pages     = {83--91},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2016}
+}
+
+@inproceedings{DBLP:conf/emnlp/ShiPK16,
+  author    = {Xing Shi and
+               Inkit Padhi and
+               Kevin Knight},
+  title     = {Does String-Based Neural {MT} Learn Source Syntax?},
+  pages     = {1526--1534},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2016}
+}
+
+@inproceedings{DBLP:conf/acl/BugliarelloO20,
+  author    = {Emanuele Bugliarello and
+               Naoaki Okazaki},
+  title     = {Enhancing Machine Translation with Dependency-Aware Self-Attention},
+  pages     = {1618--1627},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2020}
+}
+
+@article{Aharoni2017TowardsSN,
+  title={Towards String-To-Tree Neural Machine Translation},
+  author={Roee Aharoni and 
+          Yoav Goldberg},
+  journal={Annual Meeting of the Association for Computational Linguistics},
+  year={2017}
+}
+
+@inproceedings{DBLP:conf/iclr/Alvarez-MelisJ17,
+  author    = {David Alvarez-Melis and
+               Tommi S. Jaakkola},
+  title     = {Tree-structured decoding with doubly-recurrent neural networks},
+  publisher = {International Conference on Learning Representations},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:conf/naacl/DyerKBS16,
+  author    = {Chris Dyer and
+               Adhiguna Kuncoro and
+               Miguel Ballesteros and
+               Noah A. Smith},
+  title     = {Recurrent Neural Network Grammars},
+  pages     = {199--209},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2016}
+}
+
+@book{aho1972theory,
+  author    = {Aho, Alfred V and
+               Ullman, Jeffrey D},
+  title     = {The theory of parsing, translation, and compiling},
+  publisher = {Prentice-Hall Englewood Cliffs, NJ},
+  year      = {1973},
+}
+
+@inproceedings{DBLP:journals/corr/LuongLSVK15,
+  author    = {Minh-Thang Luong and
+               Quoc V. Le and
+               Ilya Sutskever and
+               Oriol Vinyals and
+               Lukasz Kaiser},
+  title     = {Multi-task Sequence to Sequence Learning},
+  publisher = {International Conference on Learning Representations},
+  year      = {2016}
+}
+
+@inproceedings{DBLP:conf/wmt/NadejdeRSDJKB17,
+  author    = {Maria Nadejde and
+               Siva Reddy and
+               Rico Sennrich and
+               Tomasz Dwojak and
+               Marcin Junczys-Dowmunt and
+               Philipp Koehn and
+               Alexandra Birch},
+  title     = {Predicting Target Language {CCG} Supertags Improves Neural Machine
+               Translation},
+  pages     = {68--79},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:conf/acl/WuZYLZ17,
+  author    = {Shuangzhi Wu and
+               Dongdong Zhang and
+               Nan Yang and
+               Mu Li and
+               Ming Zhou},
+  title     = {Sequence-to-Dependency Neural Machine Translation},
+  pages     = {698--707},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:journals/corr/abs-1808-09374,
+  author    = {Xinyi Wang and
+               Hieu Pham and
+               Pengcheng Yin and
+               Graham Neubig},
+  title     = {A Tree-based Decoder for Neural Machine Translation},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
+  pages     = {4772--4777},
+  year      = {2018}
+}
+
+@inproceedings{Tong2016Syntactic,
+  author    = {Tong Xiao and
+               Jingbo Zhu and
+               Chunliang Zhang and
+               Tongran Liu},
+  title     = {Syntactic Skeleton-Based Translation},
+  pages     = {2856--2862},
+  publisher = {AAAI Conference on Artificial Intelligence},
+  year      = {2016},
+}
+

 %%%%% chapter 15------------------------------------------------------
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%