add 16.5

1fbbf4c1 · 曹润柘 · 530a9dda · 1fbbf4c1 · 1fbbf4c1 · 1fbbf4c1
Commit 1fbbf4c1 authored Nov 28, 2020 by 曹润柘
--- a/Chapter16/Figures/figure-schematic-of-the-domain-discriminator.jpg
+++ b/Chapter16/Figures/figure-schematic-of-the-domain-discriminator.jpg
--- a/Chapter16/Figures/figure-the-meaning-of-pitch-in-different-fields.jpg
+++ b/Chapter16/Figures/figure-the-meaning-of-pitch-in-different-fields.jpg
--- a/Chapter16/Figures/figure-three-common-methods-of-adding-noise.tex
+++ b/Chapter16/Figures/figure-three-common-methods-of-adding-noise.tex
@@ -138,7 +138,7 @@
 \node [anchor=south,inner sep=2pt,minimum height=1.5em,minimum width=3.0em] (c10) at (c11.north) {\scriptsize{源语言}};
 \node [anchor=south,inner sep=2pt,minimum height=1.5em,minimum width=3.0em] (c30) at (c31.north) {\small{$n$=3}};
-\node [anchor=south,inner sep=2pt,minimum height=1.5em,minimum width=3.0em] (c50) at (c51.north) {\small{$\mathbi{S}$}};
+\node [anchor=south,inner sep=2pt,minimum height=1.5em,minimum width=3.0em] (c50) at (c51.north) {\small{$\seq{S}$}};
 \node [anchor=south,inner sep=2pt] (c60) at (c61.north) {\scriptsize{进行排序}};
 \node [anchor=south,inner sep=2pt] (c60-2) at (c60.north) {\scriptsize{由小到大}};

--- a/Chapter16/chapter16.tex
+++ b/Chapter16/chapter16.tex
@@ -31,7 +31,7 @@
 %    NEW SECTION
 %----------------------------------------------------------------------------------------
-\section{数据的有效使用}
+\section{数据的有效使用}\label{effective-use-of-data}
 \parinterval 数据稀缺是低资源机器翻译所面临的主要问题。因此，充分使用既有数据是一种解决问题的思路。比如，在双语训练不充分的时候，可以简单地对双语数据的部分单词用近义词进行替换，达到丰富双语数据的目的\upcite{DBLP:conf/acl/FadaeeBM17a,DBLP:conf/emnlp/WangPDN18}，也可以考虑用转述等方式生成更多的双语训练数据\upcite{DBLP:conf/emnlp/MartonCR09,DBLP:conf/eacl/LapataSM17}。
@@ -97,7 +97,7 @@
    \vspace{0.5em}
 \end{itemize}
-\parinterval 图\ref{fig:16-4-xc}展示了三种加噪方法的示例。这里，$\funp{P}_{\rm{Drop}}$和$\funp{P}_{\rm{Mask}}$均设置为0.1，表示每个词有$10\%$的概率被丢弃或掩码。打乱顺序的操作略微复杂，一种实现方法是，通过一个数字来表示每个词在句子中的位置，如“我”是第一个词，“你”是第三个词，然后，在每个位置生成一个$1$到$n$的随机数，$n$一般设置为3，然后将每个词的位置数和对应的随机数相加，即图中的$\mathbi{S}$（{\color{blue} S为啥要加粗？？？}）。 对$\mathbi{S}$ 按照从小到大排序，根据排序后每个位置的索引从原始句子中选择对应的词，从而得到最终打乱顺序后的结果。比如，在排序后，$S_2$的值小于$S_1$，其余词则保持递增顺序，则将原始句子中的第零个词和第一个词的顺序进行交换，其他词保持不变。
+\parinterval 图\ref{fig:16-4-xc}展示了三种加噪方法的示例。这里，$\funp{P}_{\rm{Drop}}$和$\funp{P}_{\rm{Mask}}$均设置为0.1，表示每个词有$10\%$的概率被丢弃或掩码。打乱顺序的操作略微复杂，一种实现方法是，通过一个数字来表示每个词在句子中的位置，如“我”是第一个词，“你”是第三个词，然后，在每个位置生成一个$1$到$n$的随机数，$n$一般设置为3，然后将每个词的位置数和对应的随机数相加，即图中的$\seq{S}$（{\color{blue} S为啥要加粗？？？}）。 对$\seq{S}$ 按照从小到大排序，根据排序后每个位置的索引从原始句子中选择对应的词，从而得到最终打乱顺序后的结果。比如，在排序后，$S_2$的值小于$S_1$，其余词则保持递增顺序，则将原始句子中的第零个词和第一个词的顺序进行交换，其他词保持不变。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -170,7 +170,7 @@
 \parinterval 融合目标语言端的语言模型是一种最直接的使用单语数据的方法\upcite{2015OnGulcehre,DBLP:journals/csl/GulcehreFXCB17,DBLP:conf/wmt/StahlbergCS18}。实际上，神经机器翻译模型本身也具备了语言模型的作用，因为在解码器本质上也是一个语言模型，用于描述生成译文词串的规律。类似于语言模型，神经机器翻译模型可以自回归地生成翻译结果。对于一个双语句对$(\mathbi{x}, \mathbi{y})$，神经机器翻译模型根据源语言句子$\mathbi{x}$和前面生成的词来预测当前位置词的概率分布：
 \begin{eqnarray}
-\log{P(\mathbi{y} | \mathbi{x}; \theta)} = \sum_{t}{\log{P(y_t | \mathbi{x}, {\mathbi{y}}_{<t}; \theta)}}
+\log{P(\mathbi{y} | \mathbi{x}; \theta)} & = & \sum_{t}{\log{P(y_t | \mathbi{x}, {\mathbi{y}}_{<t}; \theta)}}
 \label{eq:16-1-xc}
 \end{eqnarray}
@@ -187,7 +187,7 @@
 \parinterval 浅融合通过对神经机器翻译模型和语言模型的预测概率进行插值来得到最终的预测概率：
 \begin{eqnarray}
-\log{\funp{P}(y_t | \mathbi{x}, \mathbi{y}_{<t})} = \log{\funp{P}(y_t | \mathbi{x}, \mathbi{y}_{<t}; \theta_{TM})} + \beta \log{\funp{P}(y_t | \mathbi{y}_{<t}; \theta_{LM})}
+\log{\funp{P}(y_t | \mathbi{x}, \mathbi{y}_{<t})}& = & \log{\funp{P}(y_t | \mathbi{x}, \mathbi{y}_{<t}; \theta_{TM})} + \beta \log{\funp{P}(y_t | \mathbi{y}_{<t}; \theta_{LM})}
 \label{eq:16-2-xc}
 \end{eqnarray}
@@ -197,19 +197,19 @@
 \parinterval 深融合的预测方式为：
 \begin{eqnarray}
-\log{\funp{P}(y_t | \mathbi{x}, \mathbi{y}_{<t})}= \log{\funp{P}(y_t | \mathbi{x}, \mathbi{y}_{<t}; s_{t})}
+\log{\funp{P}(y_t | \mathbi{x}, \mathbi{y}_{<t})}& = & \log{\funp{P}(y_t | \mathbi{x}, \mathbi{y}_{<t}; s_{t})}
 \label{eq:16-3-xc}
 \end{eqnarray}
 \noindent 其中，$s_{t}$表示当前时刻$t$的隐藏层表示。
 \begin{eqnarray}
-s_{t} = s_{t}^{TM} + g_{t} s_{t}^{LM}
+s_{t}& = & s_{t}^{TM} + g_{t} s_{t}^{LM}
 \label{eq:16-4-xc}
 \end{eqnarray}
 \parinterval 这里，$s_{t}^{TM}$和$s_{t}^{LM}$分别表示翻译模型和语言模型在时刻$t$的隐藏层表示，$g_{t}$用来控制语言模型隐藏层表示的权重，通过下面的计算得到：
 \begin{eqnarray}
-g_{t} = \sigma (w^{T}s_{t}^{TM} + b)
+g_{t}& = & \sigma (w^{T}s_{t}^{TM} + b)
 \label{eq:16-5-xc}
 \end{eqnarray}
@@ -314,7 +314,7 @@ g_{t} = \sigma (w^{T}s_{t}^{TM} + b)
 \parinterval 回顾神经机器翻译系统的建模过程，给定一个互译的句对$(\mathbi{x},\mathbi{y})$，一个从源语言句子$\mathbi{x}$到目标语言句子$\mathbi{y}$的翻译被表示为求条件概率$\funp{P}(\mathbi{y}|\mathbi{x})$的问题。类似地，一个从目标语言句子$\mathbi{y}$到源语言句子$\mathbi{x}$的翻译可以表示为$\funp{P}(\mathbi{x}|\mathbi{y})$。通常来说，神经机器翻译的训练一次只得到一个方向的模型，也就是$\funp{P}(\mathbi{y}|\mathbi{x})$或者$\funp{P}(\mathbi{x}|\mathbi{y})$。这意味着$\funp{P}(\mathbi{y}|\mathbi{x})$和$\funp{P}(\mathbi{x}|\mathbi{y})$之间是互相独立的。$\funp{P}(\mathbi{y}|\mathbi{x})$和$\funp{P}(\mathbi{x}|\mathbi{y})$是否真的没有关系呢？比如，$\mathbi{x}$和$\mathbi{y}$是相同大小的向量，且$\mathbi{x}$到$\mathbi{y}$的变换是一个线性变换，也就是与一个方阵$\mathbi{W}$做矩阵乘法：
 \begin{eqnarray}
-\mathbi{y} = \mathbi{x} \cdot \mathbi{W}
+\mathbi{y} & = & \mathbi{x} \cdot \mathbi{W}
 \label{eq:16-6-xc}
 \end{eqnarray}
@@ -347,7 +347,7 @@ Joint training for neural machine translation models with monolingual data
 \parinterval 公式\ref{eq:16-7-xc}很自然地把两个方向的翻译模型$\funp{P}(\mathbi{y}|\mathbi{x})$和$\funp{P}(\mathbi{x}|\mathbi{y})$以及两个语言模型$\funp{P}(\mathbi{x})$和$\funp{P}(\mathbi{y})$联系起来：$\funp{P}(\mathbi{x})\funp{P}(\mathbi{y}|\mathbi{x})$应该与$\funp{P}(\mathbi{y})\funp{P}(\mathbi{x}|\mathbi{y})$接近，因为它们都表达了同一个联合分布$\funp{P}(\mathbi{x},\mathbi{y})$。因此，在构建训练两个方向的翻译模型的目标函数时，除了它们单独训练时各自使用的极大似然估计目标函数，可以额外增加一个目标项来鼓励两个方向的翻译模型。这种方法也被看做是一种{\small\bfnew{有监督对偶学习}}\index{有监督对偶学习}（Supervised Dual Learning\index{Supervised Dual Learning}）：
 \begin{eqnarray}
-\mathcal{L} = (\textrm{log P}(\mathbi{x}) + \textrm{log P}(\mathbi{y}|\mathbi{x}) - \textrm{log P}(\mathbi{y}) - \textrm{log P}(\mathbi{x}|\mathbi{y}))^{2}
+\mathcal{L} & = & (\textrm{log P}(\mathbi{x}) + \textrm{log P}(\mathbi{y}|\mathbi{x}) - \textrm{log P}(\mathbi{y}) - \textrm{log P}(\mathbi{x}|\mathbi{y}))^{2}
 \label{eq:16-8-xc}
 \end{eqnarray}
@@ -391,7 +391,7 @@ Joint training for neural machine translation models with monolingual data
 \parinterval 重新回顾公式\ref{eq:16-9-xc}对应的目标函数，无监督对偶学习跟回译（假设现在只在一个句对$(\mathbi{x},\mathbi{y})$上做回译）之间有着很深的内在联系：给定一个句子$\mathbi{x}$，无监督对偶学习和回译都首先用$\funp{P}(\mathbi{y}|\mathbi{x})$把$\mathbi{x}$翻译成$\mathbi{y}$，然后无监督对偶学习最大化$\funp{P}(\mathbi{x}|\mathbi{y})\funp{P}(\mathbi{y}|\mathbi{x})$，而回译则是最大化$\funp{P}(\mathbi{x}|\mathbi{y})$。可以看到，当无监督对偶学习假设$\funp{P}(\mathbi{y}|\mathbi{x})$是一个完美的翻译模型的时候，它与回译是等价的。此外，在共享两个方向的模型参数$\theta$的情况下，可以看到无监督对偶学习的梯度为
 \begin{equation}
-\frac{\partial \funp{P}(\mathbi{x})}{\partial \theta} =\funp{P}(\mathbi{y}|\mathbi{x}) \frac{\partial \funp{P}(\mathbi{x}|\mathbi{y})}{\partial \theta}+\funp{P}(\mathbi{x}|\mathbi{y}) \frac{\partial \funp{P}(\mathbi{y}|\mathbi{x})}{\partial \theta}
+\frac{\partial \funp{P}(\mathbi{x})}{\partial \theta} = \funp{P}(\mathbi{y}|\mathbi{x}) \frac{\partial \funp{P}(\mathbi{x}|\mathbi{y})}{\partial \theta}+\funp{P}(\mathbi{x}|\mathbi{y}) \frac{\partial \funp{P}(\mathbi{y}|\mathbi{x})}{\partial \theta}
 \end{equation}
 \noindent 而回译的梯度为$\frac{\partial \funp{P}(\mathbi{x}|\mathbi{y})}{\partial \theta}$。从这个角度出发，无监督对偶学习与回译都在优化语言模型$\funp{P}(\mathbi{x})$这个目标函数，只不过回译使用对$\theta$有偏的梯度估计。
@@ -442,7 +442,7 @@ Joint training for neural machine translation models with monolingual data
 \end{figure}
 \begin{equation}
-\funp{P}(\mathbi{y}|\mathbi{x}) =\sum_{\mathbi{p}}{\funp{P}(\mathbi{p}|\mathbi{x})\funp{P}(\mathbi{y}|\mathbi{p})}
+\funp{P}(\mathbi{y}|\mathbi{x})  =  \sum_{\mathbi{p}}{\funp{P}(\mathbi{p}|\mathbi{x})\funp{P}(\mathbi{y}|\mathbi{p})}
 \label{eq:ll-1}
 \end{equation}
@@ -689,7 +689,7 @@ Joint training for neural machine translation models with monolingual data
 \parinterval 尽管已经得到了短语的翻译，短语表的另外一个重要的组成部分，也就是短语对的得分（概率）无法直接由词典归纳方法直接给出，而这些得分在统计机器翻译模型中非常重要。在无监督词典归纳中，在推断词典的时候会为一对源语言单词和目标语言单词打分（词嵌入之间的相似度），然后根据打分来决定哪一个目标语言单词更有可能是当前源语言单词的翻译。在无监督短语归纳中，这样一个打分已经提供了对短语对质量的度量，因此经过适当的归一化处理后就可以得到短语对的得分：
 \begin{eqnarray}
-P(t|s)=\frac{\mathrm{cos}(\mathbi{x},\mathbi{y})/\tau}{\sum_{\mathbi{y}^{'}}\mathrm{cos}(\mathbi{x},\mathbi{y}^{'})\tau}
+P(\mathbi{y}|\mathbi{x}) & = & \frac{\mathrm{cos}(\mathbi{x},\mathbi{y})/\tau}{\sum_{\mathbi{y}^{'}}\mathrm{cos}(\mathbi{x},\mathbi{y}^{'})\tau}
 \label{eq:16-2}
 \end{eqnarray}
@@ -809,7 +809,7 @@ P(t|s)=\frac{\mathrm{cos}(\mathbi{x},\mathbi{y})/\tau}{\sum_{\mathbi{y}^{'}}\mat
 \vspace{0.5em}
 \item {\small\bfnew{语言模型的使用}}。无监督神经机器翻译的一个重要部分就是来自语言模型的目标函数。因为翻译模型本质上是在完成文本生成任务，所以只有文本生成类型的语言模型建模方法才可以应用到无监督神经机器翻译里。比如，经典的给定前文预测下一词就是一个典型的自回归生成任务（见{\chaptertwo}），因此可以运用到无监督神经机器翻译里。但是，目前在预训练里流行的BERT等模型是掩码语言模型\upcite{devlin2019bert}，就不能直接在无监督神经翻译里使用。
-\parinterval 另外一个在无监督神经机器翻译中比较常见的语言模型目标函数则是{\small\bfnew{降噪自编码器}}\index{降噪自编码器}（Denoising Autoencoder\index{降噪自编码器}）。它也是文本生成类型的语言模型建模方法。对于一个句子$\mathbi{x}$，首先使用一个噪声函数$\mathbi{x}^{'}=\mathrm{noise}(\mathbi{x})$ 来对$x$注入噪声，产生一个质量较差的句子$\mathbi{x}^{'}$。然后，让模型学习如何从$\mathbi{x}^{'}$还原出$\mathbi{x}$。这样一个目标函数比预测下一词更贴近翻译任务的本质，因为它是一个序列到序列的映射，并且输入输出两个序列在语义上是等价的。我们之所以采用$\mathbi{x}^{'}$而不是$\mathbi{x}$自己来预测$\mathbi{x}^{'}$，是因为模型可以通过简单的复制输入作为输出来完成从$\mathbi{x}$预测$\mathbi{x}$的任务，并且在输入中注入噪声会让模型更加鲁棒，因为模型可以通过训练集数据学会如何利用句子中噪声以外的信息来处理其中噪声并得到正确的输出。通常来说，噪声函数$\mathrm{noise}$有三种形式，如表\ref{tab:16-1}所示。
+\parinterval 另外一个在无监督神经机器翻译中比较常见的语言模型目标函数则是{\small\bfnew{降噪自编码器}}\index{降噪自编码器}（Denoising Autoencoder\index{降噪自编码器}）。它也是文本生成类型的语言模型建模方法。对于一个句子$\mathbi{x}$，首先使用一个噪声函数$\mathbi{x}^{'}=\mathrm{noise}(\mathbi{x})$ 来对$\mathbi{x}$注入噪声，产生一个质量较差的句子$\mathbi{x}^{'}$。然后，让模型学习如何从$\mathbi{x}^{'}$还原出$\mathbi{x}$。这样一个目标函数比预测下一词更贴近翻译任务的本质，因为它是一个序列到序列的映射，并且输入输出两个序列在语义上是等价的。我们之所以采用$\mathbi{x}^{'}$而不是$\mathbi{x}$自己来预测$\mathbi{x}^{'}$，是因为模型可以通过简单的复制输入作为输出来完成从$\mathbi{x}$预测$\mathbi{x}$的任务，并且在输入中注入噪声会让模型更加鲁棒，因为模型可以通过训练集数据学会如何利用句子中噪声以外的信息来处理其中噪声并得到正确的输出。通常来说，噪声函数$\mathrm{noise}$有三种形式，如表\ref{tab:16-1}所示。
 \begin{table}[h]
 \centering
@@ -829,7 +829,253 @@ P(t|s)=\frac{\mathrm{cos}(\mathbi{x},\mathbi{y})/\tau}{\sum_{\mathbi{y}^{'}}\mat
 \parinterval 实际当中三种形式的噪声函数都会被使用到，其中在交换方法中越相近的词越容易被交换，并且保证被交换的词的对数有限，而删除和空白方法里词的删除和替换概率通常都非常低，如$0.1$等。
 \vspace{0.5em}
 \end{itemize}
+%----------------------------------------------------------------------------------------
+%    NEW SECTION
+%----------------------------------------------------------------------------------------
+\section{领域适应}
+\parinterval 机器翻译常常面临训练数据与待翻译问题领域不一致的问题。不同领域的句子存在着很大的区别，比如口语领域的常用词和句子结构较为简单，而体育、化学等专业领域的单词和句子结构较为复杂。此外，不同领域之间存在着较为严重的一词多义问题，即同一个词在不同领域中经常会有不同的含义，如图\ref{fig:16-1-wbh}所示。
+\begin{figure}[h]
+\centering
+\includegraphics[scale=3]{Chapter16/Figures/figure-the-meaning-of-pitch-in-different-fields.jpg}
+\caption{单词pitch（图里标红）在不同领域的不同词义实例}
+\label{fig:16-1-wbh}
+\end{figure}
+\parinterval 在机器翻译任务中，某些领域的双语数据相对容易获取，如新闻或口语等领域，所以机器翻译在这些领域上表现较佳。然而，即使在富资源语种上，化学、医学等专业领域的双语数据却十分有限。如果直接使用低资源领域的数据训练一个机器翻译模型，由于数据稀缺问题，会导致模型的性能较差\upcite{DBLP:conf/iccv/SunSSG17}。混合多个领域的数据进行训练，不同领域的数据量不平衡会导致数据较少的领域训练不充分，模型容易忽略低资源领域的知识，使得在低资源领域上的翻译结果差强人意\upcite{DBLP:conf/acl/DuhNST13}。
+\parinterval {\small\bfnew{领域适应}}（Domain Adaptation）方法是利用{\small\bfnew{其他领域}}（Source domain, 又称源领域）知识来改进{\small\bfnew{特定领域}}（Target domain, 又称目标领域）翻译效果的方法，可以有效地减少模型对目标领域数据的依赖。领域适应主要从两个方向出发：
+\begin{itemize}
+\vspace{0.5em}
+\item 基于数据的方法。利用其它领域的双语数据或领域内单语数据进行数据选择或数据增强，来提升数据量。
+\vspace{0.5em}
+\item 基于模型的方法。针对领域适应开发特定的模型结构或训练策略、推断方法。
+\vspace{0.5em}
+\end{itemize}
+\parinterval 在统计机器翻译时代，从数据或模型角度提升机器翻译模型在特定领域上的翻译性能就已经备受关注，这些技术和思想也为神经机器翻译中的领域适应技术提供了参考。
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsection{统计机器翻译中的领域适应}
+\parinterval 统计机器翻译的领域适应通过增加目标领域的数据量或设计特定的模型，使得模型生成更符合目标领域风格的翻译结果。方法可以分为基于混合模型的方法\upcite{DBLP:conf/wmt/FosterK07,DBLP:conf/iwslt/BisazzaRF11,niehues2012detailed,DBLP:conf/acl/SennrichSA13,joty2015using,imamura2016multi}、基于数据加权的方法\upcite{DBLP:conf/emnlp/MatsoukasRZ09,DBLP:conf/emnlp/FosterGK10,shah2012general,DBLP:conf/iwslt/MansourN12,DBLP:conf/cncl/ZhouCZ15}、基于数据选择的方法\upcite{DBLP:conf/lrec/EckVW04,DBLP:conf/coling/ZhaoEV04,DBLP:conf/acl/MooreL10,DBLP:conf/acl/DuhNST13,DBLP:conf/coling/HoangS14,joty2015using,chen2016bilingual}和基于伪数据的方法\upcite{DBLP:conf/iwslt/Ueffing06,DBLP:conf/coling/WuWZ08,DBLP:conf/iwslt/Schwenk08,DBLP:conf/wmt/BertoldiF09,DBLP:conf/wmt/LambertSSA11}。
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsubsection{1. 基于混合模型的方法}
+\parinterval 不同领域的数据存在着共性，但是又有各自的风格，使用多领域数据训练出不同的模型，分情况处理问题可能会带来更好的效果，例如对疑问句和陈述句分别使用两个模型进行推断效果更好\upcite{DBLP:conf/eacl/Sennrich12}。混合模型是统计机器学习理论中的传统实现技术之一，通过训练与每个语料库分别对应的多个统计机器翻译子模型，如语言模型、翻译模型和重排序模型，然后将它们进行组合以实现最佳性能\upcite{DBLP:conf/wmt/FosterK07}。混合模型方法的步骤如下：
+\begin{itemize}
+\vspace{0.5em}
+\item 将训练数据根据所在的领域分为几个不同的部分。
+\vspace{0.5em}
+\item 利用每一部分数据训练一个子模型。
+\vspace{0.5em}
+\item 根据测试数据的上下文信息适当为每个子模型调整权重。
+\vspace{0.5em}
+\item 对不同的领域使用调整出的最佳权重，加权整合多个子模型的输出。
+\vspace{0.5em}
+\end{itemize}
+\parinterval 混合模型方法把统计机器翻译的每个模块分为多个模型同时解决问题，有权重地组合多个模型的结果，对之后神经机器翻译中的模型融合、加权推断等方法有一定的启发。
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsubsection{2. 基于数据加权的方法}
+\parinterval 在真实场景中，针对每个领域单独训练一个机器翻译模型是不现实的。所以，通常的策略是训练单个机器翻译模型来支持多领域的翻译。虽然混合多个领域的数据可以有效增加训练数据规模，但正如前面所说，由于各个领域样本的数据量不平衡，在训练数据稀缺的领域上，模型表现往往差强人意。一种思路认为，数据量较少的领域数据具有更大的学习价值，可以提供较为珍贵的知识和信息，应该在训练过程中获得更大的权重。数据加权方法使用规则或统计方法对每个实例或领域进行评分，作为它们的权重，然后赋予每个实例或领域不同的权重来训练统计机器翻译模型。一般来说，数据量较少的领域数据权重更高，从而使这些更有价值的数据发挥出更大的作用\upcite{DBLP:conf/emnlp/MatsoukasRZ09,DBLP:conf/emnlp/FosterGK10}。
+\parinterval 另一种加权方法是通过数据重新采样对语料库进行加权\upcite{DBLP:conf/wmt/ShahBS10,rousseau2011lium}。语料库加权方法通过构建目标领域语言模型，比较各个语料库与目标领域的相似性，赋予相似的语料库更高的权重。
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsubsection{3. 基于数据选择的方法}
+\parinterval 源领域大规模的双语数据中通常会包含一些和目标领域相似的句子，基于数据选择的基本思想是通过相似度函数选择出与目标领域相似的源领域数据，再把选择出的源领域数据与原始目标领域数据混合训练模型，增加目标领域的双语数据规模，从而提高模型在目标领域的性能。相似度可以通过语言模型\upcite{DBLP:conf/lrec/EckVW04,DBLP:conf/coling/ZhaoEV04,moore2010intelligent,DBLP:conf/acl/DuhNST13}、联合模型\upcite{DBLP:conf/coling/HoangS14,joty2015using}、卷积神经网络\upcite{chen2016bilingual}等方法进行计算。
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsubsection{4. 基于伪数据的方法}
+\parinterval 数据选择方法可以从源领域中选择出和目标领域相似的训练数据，但可用的数据量是较为有限的。因此，另外一种思路是对现有的双语数据进行修改或通过单语数据生成伪数据来增加数据量。常用方法包括通过信息检索\upcite{DBLP:conf/acl/UtiyamaI03}、平行词嵌入\upcite{DBLP:conf/acl/MarieF17}等方法生成一些伪平行句对或单语n元模型\upcite{DBLP:conf/emnlp/WangZLUS14}、平行短语对\upcite{DBLP:conf/coling/WangZLUS16,chu2015integrated}来增加可使用的伪双语数据。统计机器翻译可以有效地利用n元模型或平行短语对来提高模型性能。
+\parinterval 其中，{\small\bfnew{自学习}}（self-training）是一种具有代表性的生成伪数据的方法\upcite{DBLP:journals/tit/Scudder65a}。自学习通过源领域的双语训练数据训练一个基准翻译系统，然后对目标领域的单语数据进行翻译，再从翻译候选集合中选择高质量的译文和源语言句子组合成为双语句对，之后将其加入到训练数据中重新训练翻译系统，该过程将一直迭代到翻译性能稳定为止\upcite{DBLP:conf/iwslt/Ueffing06}。基于自学习的统计机器翻译可以从一个性能较好的源领域机器翻译模型出发，逐步引入目标领域的知识，可以显著提高目标领域的翻译性能。
+\parinterval 随着神经机器翻译的不断发展，许多基于神经网络的领域适应方法也被提出\upcite{DBLP:conf/coling/ChuW18}，其核心思想仍和统计机器翻译时期相似，也是从数据角度和模型角度两方面进行改进。不同之处在于，统计机器翻译模型是多个独立模型的组合，而神经机器翻译是一个整体的模型，因此统计机器翻译的模型方法不能直接应用于神经机器翻译。
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsection{神经机器翻译中的数据方法}
+\parinterval 增加训练数据是解决低资源领域性能较差问题的一个主要思路。受统计机器翻译时代技术的启发，可以通过类似的思想来增加神经机器翻译的训练数据量。下面分别介绍基于多领域数据的方法、基于数据选择的方法和基于单语数据的方法。
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsubsection{1. 基于多领域数据的方法}
+\parinterval 一种简单的思路是利用其他领域的数据来辅助低资源领域训练，即基于多领域数据的领域适应\upcite{chu2017empirical,DBLP:journals/corr/abs-1708-08712,DBLP:conf/acl/WangTNYCP20,DBLP:conf/acl/JiangLWZ20}。实际上，多领域建模也可以看作是一种多语言建模问题（\ref{multilingual-translation-model}节）。通过借助其他相似领域的训练数据，可以缓解目标领域数据缺乏的问题。比如，想要训练网络小说领域的翻译系统，可以通过加入文学、字幕等领域的翻译数据来增加数据量。
+\parinterval 但在神经机器翻译中，多领域数据不平衡问题也会导致数据量较少的领域训练不充分，从而翻译性能较差。目前比较常用的多领域数据方法是在数据中加入标签来提高神经机器翻译模型对不同领域的辨别\upcite{DBLP:conf/acl/ChuDK17}，该方法主要通过两种策略来缓解领域不平衡问题：
+\begin{itemize}
+\vspace{0.5em}
+\item 在不同领域的源语句子句首加上不同的标签来指定该句子对应的领域，让神经机器翻译模型在预测过程中可以生成更符合领域特性的句子。
+\item 把数据量较小的领域数据复制数倍，来对各个领域的数据量进行平衡，使模型对各个领域给与平等的关注。
+\vspace{0.5em}
+\end{itemize}
+\parinterval 这种方法的优点在于思路简单、易于实现，可以大大提高可用的数据量，但不同领域的数据不可避免地会产生一些干扰，在实际使用中可能会产生词汇预测错误的问题。
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsubsection{2. 基于数据选择的方法}
+\parinterval 既然多领域数据的方法会对模型产生一些干扰，那么能否从其他领域数据中选择与目标领域比较相近的数据来提升数据量呢？与统计机器翻译相似，可以通过使用困惑度\upcite{DBLP:conf/emnlp/AxelrodHG11}或JS散度\upcite{DBLP:conf/icdm/Remus12}等方法计算源领域数据与目标领域数据的相似度，从而选择较为相似的数据以增加目标领域的数据量，提升模型能力。
+\parinterval 一种相似度衡量方法是根据句子的词嵌入来计算\upcite{DBLP:conf/acl/WangFUS17}，即把其他领域的双语数据根据与目标领域句子的词嵌入相似度进行排序，设定一个阈值来选择与目标领域比较相似的句子，将其与目标领域的数据混合用于模型训练。除了直接混合两部分数据，还可以修改数据的训练方式，从而更有效地利用训练数据，具体方法可以参考\ref{modeling-methods-in neural-machine-translation}节的内容。
+\parinterval 数据选择充分地利用了其他领域的双语数据，使低资源问题得到了一定程度的缓解，模型可以学到更多目标领域相关的语言学知识，是目前比较常用的一个方法。
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsubsection{3. 基于单语数据的方法}
+\parinterval 尽管目标领域的双语数据是十分有限的，但通常存在大量可用的单语数据。例如在网络小说翻译任务中，只有少量的双语数据可用，而网络上有丰富的单语小说数据可用。\ref{effective-use-of-data}节中提到了很多在低资源场景下利用单语数据的模型方法，比如进行数据增强或利用语言模型等。这些方法均可以直接应用在领域适应任务上，有效地利用领域内的单语数据可以显著提高机器翻译性能。
+\parinterval 此外，如果目标领域的双语数据极度稀缺，甚至完全没有任何双语数据，这时可以使用\ref{unsupervised-dictionary-induction}节中提到的无监督词典归纳方法从目标领域中归纳出双语词典，然后将目标领域的目标端单语数据通过逐词翻译的方法生成伪数据\upcite{DBLP:conf/acl/HuXNC19}，即对每个单词根据双语词典进行对应翻译，构建伪平行语料，用来训练目标领域的神经机器翻译模型。
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsection{神经机器翻译中的模型方法}\label{modeling-methods-in neural-machine-translation}
+\parinterval 一个神经机器翻译模型的实现分为三个组成部分：设计神经网络模型结构、制定训练策略训练模型、模型推断生成翻译结果。因此，一个自然的想法是在这三个步骤中对领域适应问题进行研究：
+\begin{itemize}
+\vspace{0.5em}
+\item 基于模型结构的方法，设计一个更适用于领域适应问题的模型结构\upcite{2015OnGulcehre,2019Non,britz2017effective,DBLP:conf/ranlp/KobusCS17}。
+\vspace{0.5em}
+\item 基于训练策略的方法，制定能够更好利用多领域数据的训练方法\upcite{DBLP:conf/emnlp/WangULCS17,DBLP:conf/aclnmt/ChenCFL17,DBLP:journals/corr/abs-1906-03129,dakwale2017finetuning,chu2015integrated,DBLP:conf/emnlp/ZengLSGLYL19,barone2017regularization,DBLP:conf/acl/SaundersB20,DBLP:conf/acl/LewisLGGMLSZ20}。
+\vspace{0.5em}
+\item 基于模型推断的方法，生成更符合目标适应领域的翻译结果\upcite{DBLP:conf/emnlp/DouWHN19,khayrallah2017neural,DBLP:journals/corr/FreitagA16,DBLP:conf/acl/SaundersSGB19}。
+\vspace{0.5em}
+\end{itemize}
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsubsection{1. 基于模型结构的方法}
+\parinterval 统计机器翻译时代基于模型结构的方法集中在提升词对齐模型、语言模型等子模型的领域适应性，从而提升统计机器翻译模型在特定领域的表现。而神经机器翻译模型是一个端到端的框架，无法像统计机器翻译一样从增强子模型适应性的角度开展工作。
+\parinterval 在使用多领域数据时，神经机器翻译的结构无法有效地利用多领域语料库的信息多样性，混合多个相差较大的领域数据进行训练会使单个领域的翻译性能下降。为了解决这一问题，一个比较典型的做法是在使用多领域数据训练时，如图\ref{fig:16-2-wbh}所示，在神经机器翻译模型的编码器中添加一个判别器，使用判别器预测输入句子的领域\upcite{DBLP:conf/wmt/BritzLP17}，具体的做法为：在编码器的顶部添加一个判别器网络，这个判别器使用源语言句子$x$的编码器表示$x'$作为输入，预测句子所属的领域标签$d$。为了使预测领域标签d的正确概率$\funp{P(d|H)}$最大，模型在训练过程中最小化如式\ref{eq:16-1-wbh}所示的判别器损失$\funp{L}_{P\rm{disc}}$。
+\begin{figure}[h]
+\centering
+\includegraphics[scale=1]{Chapter16/Figures/figure-schematic-of-the-domain-discriminator.jpg}
+\caption{领域判别器示意图}
+\label{fig:16-2-wbh}
+\end{figure}
+\begin{eqnarray}
+\funp{L}_{\rm{disc}}& = &-log\funp{P}(d|H)
+\label{eq:16-1-wbh}
+\end{eqnarray}
+\noindent 其中，$H$是编码器的隐藏状态。神经机器翻译模型同时优化对领域标签的预测和对翻译结果的预测，如公式\ref{eq:16-2-wbh}和\ref{eq:16-3-wbh}所示。这种方法使模型具备了识别不同领域数据的能力，从而可以针对不同领域进行特定的建模，生成符合领域特征的翻译结果。
+\begin{eqnarray}
+\funp{L}_{\rm{gen}}& = & -log\funp{P}(y|x)\label{eq:16-2-wbh} \\
+\funp{L} & =& \funp{L}_{\rm{disc}}+\funp{L}_{\rm{gen}}\label{eq:16-3-wbh} 
+\end{eqnarray}
+\parinterval 此外，还可以利用词频-逆文档频率方法\upcite{DBLP:journals/ibmrd/Luhn58}预测输入句子的领域标签，在词嵌入层加上领域特定的嵌入表示。还有一些对模型结构进行更改的工作在于融合语言模型到神经机器翻译模型中\upcite{2015OnGulcehre,DBLP:conf/emnlp/DomhanH17}。
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsubsection{2. 基于训练策略的方法}
+\parinterval 因为特定领域的训练数据通常十分稀缺，所以如何更充分地利用数据是一个重要问题。基于训练策略的领域适应方法是神经机器翻译中较为广泛使用的方法，具有易于实现、作用显著的优点\upcite{DBLP:conf/naacl/SimianerWD19}。基于训练策略的领域适应方法是指在模型的训练阶段改变获得训练目标的过程，通过加权或者微调等方法更加高效地利用数据。
+\parinterval 受到统计机器翻译时代数据加权方法的启发，一个简单的思路是给神经机器翻译的训练数据分配不同的训练权重，从而使和目标领域更相似数据发挥更大的作用，并减少不相似数据的干扰。一种常用的做法是通过目标领域内和领域外语言模型计算样本的权重\upcite{DBLP:conf/emnlp/WangULCS17}。领域内语言模型对句子的打分越高，表示该句子与目标领域越相似，反之，领域外语言模型对句子的打分越高，表示该句子可能与目标领域无关，会对训练过程造成一些干扰。
+\parinterval 与句子级别进行加权方法相似，加权思想还可以应用在词级别，即对每个词进行权重评分，对目标领域中的词赋予更高的权重，以使模型倾向于生成更多目标领域的词\upcite{DBLP:journals/corr/abs-1906-03129}。
+\parinterval 数据选择方法会降低训练数据的数据量，而在数据量较少的时候模型表现可能较差，针对这个问题，一种方法是在不同的训练轮次动态地改变训练数据子集。动态数据选择使得每轮的训练数据均小于全部数据量，但是在每轮中的训练数据均不同，可能前一轮没有训练的数据在当前轮被包括进来，由此可以缓解训练数据减小的问题。一种做法是先将完整的数据送入模型，再根据相似度逐次减少每轮的数据量，最后得到在目标领域上效果最好的领域适应模型\upcite{DBLP:conf/emnlp/WeesBM17}。或者将最相似的句子最先送入模型，让模型可以最先学到跟目标领域最相关的知识，奠定良好的基础\upcite{DBLP:conf/naacl/ZhangSKMCD19}。
+\parinterval 还有一种方法是不从随机状态开始训练网络，而是使用翻译性能较好的源领域模型作为初始状态，因为源领域模型中包含着一些通用知识可以被目标领域借鉴。比如想获得口语领域的机器翻译模型，使用新闻领域模型作为初始状态训练口语领域模型。同时，微调方法经常配合预训练使用。微调方法的具体过程如下：
+\begin{itemize}
+\vspace{0.5em}
+\item 在资源丰富的源领域语料库上对系统进行训练直至收敛。
+\vspace{0.5em}
+\item 在资源贫乏的目标领域语料库上进行训练，对系统参数进行微调使其更适应目标领域。
+\vspace{0.5em}
+\end{itemize}
+\parinterval 然而微调的方法会带来严重的灾难性遗忘问题，即在目标领域上过拟合，导致在源领域上的翻译性能大幅度下降。如果想要保证模型在目标领域上有较好翻译性能的同时在源领域的翻译性能也不下降，一个比较常用的方法是进行混合微调\upcite{DBLP:conf/acl/ChuDK17}。具体做法是先在源领域数据上训练一个神经机器翻译模型，然后将目标领域数据复制数倍和源领域数据量相等，之后将数据混合后对神经机器翻译模型进行微调。混合微调方法既降低了目标领域数据量小导致的过拟合问题，又带来了更好的微调性能。除了混合微调外，还有研究人员使用\ref{multilingual-translation-model}节的知识蒸馏方法缓解灾难性遗忘问题，即对源领域和目标领域进行多次循环知识蒸馏，迭代学习对方领域的知识，可以保证在源领域和目标领域上的翻译性能共同逐步上升\upcite{DBLP:conf/emnlp/ZengLSGLYL19}。过拟合导致的灾难性遗忘还可以使用{\red{13.2节的正则化}}方法来降低\upcite{barone2017regularization}。
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsubsection{3. 基于模型推断的方法}
+\parinterval 推断是指将输入序列转换成的向量再次转化为输出序列的过程，是机器翻译模型生成翻译结果的最后一步。目前只有很少的工作集中在优化推断算法本身\upcite{khayrallah2017neural}，即通过更适用于领域适应任务的推断算法去选择最佳的结果，大部分基于模型推断的方法的相关工作集中在把多个模型融合成一个模型以增强模型的推断能力，来获得更适应目标领域的序列。目前比较常用的方法是集成推断，即把多个领域的模型使用{\red{14.5节（多模型集成）}}的多模型集成方法集成为一个模型用于推断\upcite{DBLP:journals/corr/FreitagA16}，具体过程如下：
+\begin{itemize}
+\vspace{0.5em}
+\item 在目标领域和多个源领域数据上分别训练多个独立的机器翻译模型。
+\vspace{0.5em}
+\item 把多个模型集成为一个模型。
+\vspace{0.5em}
+\item 使用目标领域数据对集成模型进行微调。
+\vspace{0.5em}
+\item 使用微调后的模型进行推断。
+\vspace{0.5em}
+\end{itemize}
+\parinterval 集成推断方法相较于直接使用大量数据训练成一个模型的主要优势在于速度快，多个领域的模型可以并行地独立训练，使训练时间大大缩短。集成推断也可以结合加权思想，对不同领域的句子，赋予每个模型不同的先验权重进行推断，来获得最佳的推断结果\upcite{DBLP:conf/acl/SaundersSGB19}。还有部分工作研究在推断过程中融入语言模型\upcite{2015OnGulcehre,DBLP:conf/emnlp/DomhanH17}或目标领域的罕见词\upcite{DBLP:conf/naacl/BapnaF19}。
 %----------------------------------------------------------------------------------------
 %    NEW SECTION

--- a/bibliography.bib
+++ b/bibliography.bib
@@ -9827,7 +9827,678 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2018}
 }
+@inproceedings{DBLP:conf/iccv/SunSSG17,
+  author    = {Chen Sun and
+               Abhinav Shrivastava and
+               Saurabh Singh and
+               Abhinav Gupta},
+  title     = {Revisiting Unreasonable Effectiveness of Data in Deep Learning Era},
+  pages     = {843--852},
+  publisher = {{IEEE} Computer Society},
+  year      = {2017}
+}
+@inproceedings{DBLP:conf/acl/DuhNST13,
+  author    = {Kevin Duh and
+               Graham Neubig and
+               Katsuhito Sudoh and
+               Hajime Tsukada},
+  title     = {Adaptation Data Selection using Neural Language Models: Experiments
+               in Machine Translation},
+  pages     = {678--683},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2013}
+}
+@inproceedings{DBLP:conf/wmt/FosterK07,
+  author    = {George F. Foster and
+               Roland Kuhn},
+  title     = {Mixture-Model Adaptation for {SMT}},
+  pages     = {128--135},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2007}
+}
+@inproceedings{DBLP:conf/iwslt/BisazzaRF11,
+  author    = {Arianna Bisazza and
+               Nick Ruiz and
+               Marcello Federico},
+  title     = {Fill-up versus interpolation methods for phrase-based {SMT} adaptation},
+  pages     = {136--143},
+  publisher = {International Symposium on Computer Architecture},
+  year      = {2011}
+}
+@inproceedings{niehues2012detailed,
+  title={Detailed analysis of different strategies for phrase table adaptation in SMT},
+  author={Niehues, Jan and Waibel, Alex},
+  publisher={Association for Machine Translation in the Americas},
+  year={2012}
+}
+@inproceedings{DBLP:conf/acl/SennrichSA13,
+  author    = {Rico Sennrich and
+               Holger Schwenk and
+               Walid Aransa},
+  title     = {A Multi-Domain Translation Model Framework for Statistical Machine
+               Translation},
+  pages     = {832--840},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2013}
+}
+@article{joty2015using,
+  title={Using joint models for domain adaptation in statistical machine translation},
+  author={Joty, Nadir Durrani Hassan Sajjad Shafiq and Vogel, Ahmed Abdelali Stephan},
+  journal={Proceedings of MT Summit XV},
+  pages={117},
+  year={2015}
+}
+@article{imamura2016multi,
+  title={Multi-domain adaptation for statistical machine translation based on feature augmentation},
+  author={Imamura, Kenji and Sumita, Eiichiro},
+  journal={Association for Machine Translation in the Americas},
+  pages={79},
+  year={2016}
+}
+@inproceedings{DBLP:conf/emnlp/MatsoukasRZ09,
+  author    = {Spyros Matsoukas and
+               Antti-Veikko I. Rosti and
+               Bing Zhang},
+  title     = {Discriminative Corpus Weight Estimation for Machine Translation},
+  pages     = {708--717},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
+  year      = {2009}
+}
+@inproceedings{DBLP:conf/emnlp/FosterGK10,
+  author    = {George F. Foster and
+               Cyril Goutte and
+               Roland Kuhn},
+  title     = {Discriminative Instance Weighting for Domain Adaptation in Statistical
+               Machine Translation},
+  pages     = {451--459},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
+  year      = {2010}
+}
+@article{shah2012general,
+  title={A general framework to weight heterogeneous parallel data for model adaptation in statistical machine translation},
+  author={Shah, Kashif and Barrault, Lo{\i}c and Schwenk, Holger and Le Mans, France},
+  journal={MT Summit, Octobre},
+  year={2012}
+}
+@inproceedings{DBLP:conf/iwslt/MansourN12,
+  author    = {Saab Mansour and
+               Hermann Ney},
+  title     = {A simple and effective weighted phrase extraction for machine translation
+               adaptation},
+  pages     = {193--200},
+  publisher = {International Symposium on Computer Architecture},
+  year      = {2012}
+}
+@inproceedings{DBLP:conf/cncl/ZhouCZ15,
+  author    = {Xinpeng Zhou and
+               Hailong Cao and
+               Tiejun Zhao},
+  title     = {Domain Adaptation for {SMT} Using Sentence Weight},
+  volume    = {9427},
+  pages     = {153--163},
+  publisher = {Springer},
+  year      = {2015}
+}
+@inproceedings{DBLP:conf/lrec/EckVW04,
+  author    = {Matthias Eck and
+               Stephan Vogel and
+               Alex Waibel},
+  title     = {Language Model Adaptation for Statistical Machine Translation Based
+               on Information Retrieval},
+  publisher = {European Language Resources Association},
+  year      = {2004}
+}
+@inproceedings{DBLP:conf/coling/ZhaoEV04,
+  author    = {Bing Zhao and
+               Matthias Eck and
+               Stephan Vogel},
+  title     = {Language Model Adaptation for Statistical Machine Translation via
+               Structured Query Models},
+  publisher = {International Conference on Computational Linguistics},
+  year      = {2004}
+}
+@inproceedings{DBLP:conf/acl/MooreL10,
+  author    = {Robert C. Moore and
+               William D. Lewis},
+  title     = {Intelligent Selection of Language Model Training Data},
+  pages     = {220--224},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2010}
+}
+@inproceedings{DBLP:conf/acl/DuhNST13,
+  author    = {Kevin Duh and
+               Graham Neubig and
+               Katsuhito Sudoh and
+               Hajime Tsukada},
+  title     = {Adaptation Data Selection using Neural Language Models: Experiments
+               in Machine Translation},
+  pages     = {678--683},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2013}
+}
+@inproceedings{DBLP:conf/coling/HoangS14,
+  author    = {Cuong Hoang and
+               Khalil Sima'an},
+  title     = {Latent Domain Translation Models in Mix-of-Domains Haystack},
+  pages     = {1928--1939},
+  publisher = {International Conference on Computational Linguistics},
+  year      = {2014}
+}
+@article{joty2015using,
+  title={Using joint models for domain adaptation in statistical machine translation},
+  author={Joty, Nadir Durrani Hassan Sajjad Shafiq and Vogel, Ahmed Abdelali Stephan},
+  journal={Proceedings of MT Summit XV},
+  pages={117},
+  year={2015}
+}
+@inproceedings{chen2016bilingual,
+  title={Bilingual methods for adaptive training data selection for machine translation},
+  author={Chen, Boxing and Kuhn, Roland and Foster, George and Cherry, Colin and Huang, Fei},
+  booktitle={Association for Machine Translation in the Americas},
+  pages={93--103},
+  year={2016}
+}
+@inproceedings{DBLP:conf/iwslt/Ueffing06,
+  author    = {Nicola Ueffing},
+  title     = {Using monolingual source-language data to improve {MT} performance},
+  pages     = {174--181},
+  publisher = {International Symposium on Computer Architecture},
+  year      = {2006}
+}
+@inproceedings{DBLP:conf/coling/WuWZ08,
+  author    = {Hua Wu and
+               Haifeng Wang and
+               Chengqing Zong},
+  title     = {Domain Adaptation for Statistical Machine Translation with Domain
+               Dictionary and Monolingual Corpora},
+  publisher = {International Conference on Computational Linguistics},
+  pages     = {993--1000},
+  year      = {2008}
+}
+@inproceedings{DBLP:conf/iwslt/Schwenk08,
+  author    = {Holger Schwenk},
+  title     = {Investigations on large-scale lightly-supervised training for statistical
+               machine translation},
+  pages     = {182--189},
+  publisher = {International Symposium on Computer Architecture},
+  year      = {2008}
+}
+@inproceedings{DBLP:conf/wmt/BertoldiF09,
+  author    = {Nicola Bertoldi and
+               Marcello Federico},
+  title     = {Domain Adaptation for Statistical Machine Translation with Monolingual
+               Resources},
+  pages     = {182--189},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2009}
+}
+@inproceedings{DBLP:conf/wmt/LambertSSA11,
+  author    = {Patrik Lambert and
+               Holger Schwenk and
+               Christophe Servan and
+               Sadaf Abdul-Rauf},
+  title     = {Investigations on Translation Model Adaptation Using Monolingual Data},
+  pages     = {284--293},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2011}
+}
+@inproceedings{DBLP:conf/eacl/Sennrich12,
+  author    = {Rico Sennrich},
+  title     = {Perplexity Minimization for Translation Model Domain Adaptation in
+               Statistical Machine Translation},
+  pages     = {539--549},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2012}
+}
+@inproceedings{DBLP:conf/wmt/FosterK07,
+  author    = {George F. Foster and
+               Roland Kuhn},
+  title     = {Mixture-Model Adaptation for {SMT}},
+  pages     = {128--135},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2007}
+}
+@inproceedings{DBLP:conf/emnlp/MatsoukasRZ09,
+  author    = {Spyros Matsoukas and
+               Antti-Veikko I. Rosti and
+               Bing Zhang},
+  title     = {Discriminative Corpus Weight Estimation for Machine Translation},
+  pages     = {708--717},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
+  year      = {2009}
+}
+@inproceedings{DBLP:conf/emnlp/FosterGK10,
+  author    = {George F. Foster and
+               Cyril Goutte and
+               Roland Kuhn},
+  title     = {Discriminative Instance Weighting for Domain Adaptation in Statistical
+               Machine Translation},
+  pages     = {451--459},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2010}
+}
+@inproceedings{DBLP:conf/wmt/ShahBS10,
+  author    = {Kashif Shah and
+               Lo{\"{\i}}c Barrault and
+               Holger Schwenk},
+  title     = {Translation Model Adaptation by Resampling},
+  pages     = {392--399},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2010}
+}
+@inproceedings{rousseau2011lium,
+  title={LIUM's systems for the IWSLT 2011 Speech Translation Tasks},
+  author={Rousseau, Anthony and Bougares, Fethi and Del{\'e}glise, Paul and Schwenk, Holger and Est{\`e}ve, Yannick},
+  publisher={International Workshop on Spoken Language Translation},
+  year={2011}
+}
+@inproceedings{DBLP:conf/lrec/EckVW04,
+  author    = {Matthias Eck and
+               Stephan Vogel and
+               Alex Waibel},
+  title     = {Language Model Adaptation for Statistical Machine Translation Based
+               on Information Retrieval},
+  publisher = {European Language Resources Association},
+  year      = {2004}
+}
+@inproceedings{DBLP:conf/coling/ZhaoEV04,
+  author    = {Bing Zhao and
+               Matthias Eck and
+               Stephan Vogel},
+  title     = {Language Model Adaptation for Statistical Machine Translation via
+               Structured Query Models},
+  publisher = {International Conference on Computational Linguistics},
+  year      = {2004}
+}
+@article{moore2010intelligent,
+  title = {Intelligent selection of language model training data},
+  author = {Moore, Robert C and Lewis, Will},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year = {2010}
+}
+@inproceedings{DBLP:conf/acl/UtiyamaI03,
+  author    = {Masao Utiyama and
+               Hitoshi Isahara},
+  title     = {Reliable Measures for Aligning Japanese-English News Articles and
+               Sentences},
+  pages     = {72--79},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2003}
+}
+@inproceedings{DBLP:conf/acl/MarieF17,
+  author    = {Benjamin Marie and
+               Atsushi Fujita},
+  title     = {Efficient Extraction of Pseudo-Parallel Sentences from Raw Monolingual
+               Data Using Word Embeddings},
+  pages     = {392--398},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2017}
+}
+@inproceedings{DBLP:conf/emnlp/WangZLUS14,
+  author    = {Rui Wang and
+               Hai Zhao and
+               Bao-Liang Lu and
+               Masao Utiyama and
+               Eiichiro Sumita},
+  title     = {Neural Network Based Bilingual Language Model Growing for Statistical
+               Machine Translation},
+  pages     = {189--195},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2014}
+}
+@inproceedings{DBLP:conf/coling/WangZLUS16,
+  author    = {Rui Wang and
+               Hai Zhao and
+               Bao-Liang Lu and
+               Masao Utiyama and
+               Eiichiro Sumita},
+  title     = {Connecting Phrase based Statistical Machine Translation Adaptation},
+  pages     = {3135--3145},
+  publisher = {International Conference on Computational Linguistics},
+  year      = {2016}
+}
+@article{chu2015integrated,
+  title={Integrated parallel data extraction from comparable corpora for statistical machine translation},
+  author={Chu, Chenhui},
+  year={2015},
+  publisher={Kyoto University}
+}
+@article{DBLP:journals/tit/Scudder65a,
+  author    = {H. J. Scudder III},
+  title     = {Probability of error of some adaptive pattern-recognition machines},
+  journal   = {{IEEE} Transactions on Information Theory},
+  volume    = {11},
+  number    = {3},
+  pages     = {363--371},
+  year      = {1965}
+}
+@inproceedings{DBLP:conf/coling/ChuW18,
+  author    = {Chenhui Chu and
+               Rui Wang},
+  title     = {A Survey of Domain Adaptation for Neural Machine Translation},
+  pages     = {1304--1319},
+  publisher = {International Conference on Computational Linguistics},
+  year      = {2018}
+}
+@inproceedings{chu2017empirical,
+  title={An empirical comparison of domain adaptation methods for neural machine translation},
+  author={Chu, Chenhui and Dabre, Raj and Kurohashi, Sadao},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  pages={385--391},
+  year={2017}
+}
+@article{DBLP:journals/corr/abs-1708-08712,
+  author    = {Hassan Sajjad and
+               Nadir Durrani and
+               Fahim Dalvi and
+               Yonatan Belinkov and
+               Stephan Vogel},
+  title     = {Neural Machine Translation Training in a Multi-Domain Scenario},
+  journal   = {CoRR},
+  volume    = {abs/1708.08712},
+  year      = {2017}
+}
+@inproceedings{DBLP:conf/acl/WangTNYCP20,
+  author    = {Wei Wang and
+               Ye Tian and
+               Jiquan Ngiam and
+               Yinfei Yang and
+               Isaac Caswell and
+               Zarana Parekh},
+  title     = {Learning a Multi-Domain Curriculum for Neural Machine Translation},
+  pages     = {7711--7723},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2020}
+}
+@inproceedings{DBLP:conf/acl/JiangLWZ20,
+  author    = {Haoming Jiang and
+               Chen Liang and
+               Chong Wang and
+               Tuo Zhao},
+  title     = {Multi-Domain Neural Machine Translation with Word-Level Adaptive Layer-wise
+               Domain Mixing},
+  pages     = {1823--1834},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2020}
+}
+@inproceedings{DBLP:conf/acl/ChuDK17,
+  author    = {Chenhui Chu and
+               Raj Dabre and
+               Sadao Kurohashi},
+  title     = {An Empirical Comparison of Domain Adaptation Methods for Neural Machine
+               Translation},
+  pages     = {385--391},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2017}
+}
+@inproceedings{DBLP:conf/emnlp/AxelrodHG11,
+  author    = {Amittai Axelrod and
+               Xiaodong He and
+               Jianfeng Gao},
+  title     = {Domain Adaptation via Pseudo In-Domain Data Selection},
+  pages     = {355--362},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
+  year      = {2011}
+}
+@inproceedings{DBLP:conf/icdm/Remus12,
+  author    = {Robert Remus},
+  title     = {Domain Adaptation Using Domain Similarity- and Domain Complexity-Based
+               Instance Selection for Cross-Domain Sentiment Analysis},
+  pages     = {717--723},
+  publisher = {{IEEE} Computer Society},
+  year      = {2012}
+}
+@inproceedings{DBLP:conf/acl/WangFUS17,
+  author    = {Rui Wang and
+               Andrew M. Finch and
+               Masao Utiyama and
+               Eiichiro Sumita},
+  title     = {Sentence Embedding for Neural Machine Translation Domain Adaptation},
+  pages     = {560--566},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2017}
+}
+@inproceedings{DBLP:conf/acl/HuXNC19,
+  author    = {Junjie Hu and
+               Mengzhou Xia and
+               Graham Neubig and
+               Jaime G. Carbonell},
+  title     = {Domain Adaptation of Neural Machine Translation by Lexicon Induction},
+  pages     = {2989--3001},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2019}
+}
+@inproceedings{2019Non,
+  title={Non-Parametric Adaptation for Neural Machine Translation},
+  author={Bapna, Ankur  and  Firat, Orhan },
+  booktitle={Conference of the North},
+  year={2019},
+}
+@inproceedings{britz2017effective,
+  title={Effective domain mixing for neural machine translation},
+  author={Britz, Denny and Le, Quoc and Pryzant, Reid},
+  booktitle={Proceedings of the Second Conference on Machine Translation},
+  pages={118--126},
+  year={2017}
+}
+@inproceedings{DBLP:conf/ranlp/KobusCS17,
+  author    = {Catherine Kobus and
+               Josep Maria Crego and
+               Jean Senellart},
+  title     = {Domain Control for Neural Machine Translation},
+  pages     = {372--378},
+  publisher = {International Conference Recent Advances in Natural
+               Language Processing},
+  year      = {2017}
+}
+@inproceedings{DBLP:conf/emnlp/WangULCS17,
+  author    = {Rui Wang and
+               Masao Utiyama and
+               Lemao Liu and
+               Kehai Chen and
+               Eiichiro Sumita},
+  title     = {Instance Weighting for Neural Machine Translation Domain Adaptation},
+  pages     = {1482--1488},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
+  year      = {2017}
+}
+@inproceedings{DBLP:conf/aclnmt/ChenCFL17,
+  author    = {Boxing Chen and
+               Colin Cherry and
+               George F. Foster and
+               Samuel Larkin},
+  title     = {Cost Weighting for Neural Machine Translation Domain Adaptation},
+  pages     = {40--46},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2017}
+}
+@article{DBLP:journals/corr/abs-1906-03129,
+  author    = {Shen Yan and
+               Leonard Dahlmann and
+               Pavel Petrushkov and
+               Sanjika Hewavitharana and
+               Shahram Khadivi},
+  title     = {Word-based Domain Adaptation for Neural Machine Translation},
+  journal   = {CoRR},
+  volume    = {abs/1906.03129},
+  year      = {2019}
+}
+@article{dakwale2017finetuning,
+  title={Finetuning for neural machine translation with limited degradation across in-and out-of-domain data},
+  author={Dakwale, Praveen and Monz, Christof},
+  journal={Proceedings of the XVI Machine Translation Summit},
+  volume={117},
+  year={2017}
+}
+@inproceedings{DBLP:conf/emnlp/ZengLSGLYL19,
+  author    = {Jiali Zeng and
+               Yang Liu and
+               Jinsong Su and
+               Yubin Ge and
+               Yaojie Lu and
+               Yongjing Yin and
+               Jiebo Luo},
+  title     = {Iterative Dual Domain Adaptation for Neural Machine Translation},
+  pages     = {845--855},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
+  year      = {2019}
+}
+@article{barone2017regularization,
+  title={Regularization techniques for fine-tuning in neural machine translation},
+  author={Barone, Antonio Valerio Miceli and Haddow, Barry and Germann, Ulrich and Sennrich, Rico},
+  journal={arXiv preprint arXiv:1707.09920},
+  year={2017}
+}
+@inproceedings{DBLP:conf/acl/SaundersB20,
+  author    = {Danielle Saunders and
+               Bill Byrne},
+  title     = {Reducing Gender Bias in Neural Machine Translation as a Domain Adaptation
+               Problem},
+  pages     = {7724--7736},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2020}
+}
+@inproceedings{khayrallah2017neural,
+  title={Neural lattice search for domain adaptation in machine translation},
+  author={Khayrallah, Huda and Kumar, Gaurav and Duh, Kevin and Post, Matt and Koehn, Philipp},
+  booktitle={Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)},
+  pages={20--25},
+  year={2017}
+}
+@inproceedings{DBLP:conf/emnlp/DouWHN19,
+  author    = {Zi-Yi Dou and
+               Xinyi Wang and
+               Junjie Hu and
+               Graham Neubig},
+  title     = {Domain Differential Adaptation for Neural Machine Translation},
+  pages     = {59--69},
+  publisher = {Association for Computational Linguistics},
+  year      = {2019}
+}
+@article{DBLP:journals/corr/FreitagA16,
+  author    = {Markus Freitag and
+               Yaser Al-Onaizan},
+  title     = {Fast Domain Adaptation for Neural Machine Translation},
+  journal   = {CoRR},
+  volume    = {abs/1612.06897},
+  year      = {2016}
+}
+@inproceedings{DBLP:conf/acl/SaundersSGB19,
+  author    = {Danielle Saunders and
+               Felix Stahlberg and
+               Adri{\`{a}} de Gispert and
+               Bill Byrne},
+  title     = {Domain Adaptive Inference for Neural Machine Translation},
+  pages     = {222--228},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2019}
+}
+@inproceedings{DBLP:conf/wmt/BritzLP17,
+  author    = {Denny Britz and
+               Quoc V. Le and
+               Reid Pryzant},
+  title     = {Effective Domain Mixing for Neural Machine Translation},
+  pages     = {118--126},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2017}
+}
+@article{DBLP:journals/ibmrd/Luhn58,
+  author    = {Hans Peter Luhn},
+  title     = {The Automatic Creation of Literature Abstracts},
+  journal   = {{IBM} J. Res. Dev.},
+  volume    = {2},
+  number    = {2},
+  pages     = {159--165},
+  year      = {1958}
+}
+@inproceedings{DBLP:conf/emnlp/DomhanH17,
+  author    = {Tobias Domhan and
+               Felix Hieber},
+  title     = {Using Target-side Monolingual Data for Neural Machine Translation
+               through Multi-task Learning},
+  pages     = {1500--1505},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
+  year      = {2017}
+}
+@inproceedings{DBLP:conf/naacl/SimianerWD19,
+  author    = {Patrick Simianer and
+               Joern Wuebker and
+               John DeNero},
+  title     = {Measuring Immediate Adaptation Performance for Neural Machine Translation},
+  pages     = {2038--2046},
+  publisher = {Annual Conference of the North American Chapter of the Association for Computational Linguistics},
+  year      = {2019}
+}
+@inproceedings{DBLP:conf/emnlp/WangULCS17,
+  author    = {Rui Wang and
+               Masao Utiyama and
+               Lemao Liu and
+               Kehai Chen and
+               Eiichiro Sumita},
+  title     = {Instance Weighting for Neural Machine Translation Domain Adaptation},
+  pages     = {1482--1488},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2017}
+}
+@article{DBLP:journals/corr/abs-1906-03129,
+  author    = {Shen Yan and
+               Leonard Dahlmann and
+               Pavel Petrushkov and
+               Sanjika Hewavitharana and
+               Shahram Khadivi},
+  title     = {Word-based Domain Adaptation for Neural Machine Translation},
+  journal   = {CoRR},
+  volume    = {abs/1906.03129},
+  year      = {2019}
+}
+@inproceedings{DBLP:conf/emnlp/WeesBM17,
+  author    = {Marlies van der Wees and
+               Arianna Bisazza and
+               Christof Monz},
+  title     = {Dynamic Data Selection for Neural Machine Translation},
+  pages     = {1400--1410},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
+  year      = {2017}
+}
+@inproceedings{DBLP:conf/naacl/ZhangSKMCD19,
+  author    = {Xuan Zhang and
+               Pamela Shapiro and
+               Gaurav Kumar and
+               Paul McNamee and
+               Marine Carpuat and
+               Kevin Duh},
+  title     = {Curriculum Learning for Domain Adaptation in Neural Machine Translation},
+  pages     = {1903--1915},
+  publisher = {Annual Conference of the North American Chapter of the Association for Computational Linguistics},
+  year      = {2019}
+}
+@inproceedings{DBLP:conf/acl/ChuDK17,
+  author    = {Chenhui Chu and
+               Raj Dabre and
+               Sadao Kurohashi},
+  title     = {An Empirical Comparison of Domain Adaptation Methods for Neural Machine
+               Translation},
+  pages     = {385--391},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2017}
+}
+@inproceedings{DBLP:conf/emnlp/DomhanH17,
+  author    = {Tobias Domhan and
+               Felix Hieber},
+  title     = {Using Target-side Monolingual Data for Neural Machine Translation
+               through Multi-task Learning},
+  pages     = {1500--1505},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
+  year      = {2017}
+}
+@inproceedings{DBLP:conf/naacl/BapnaF19,
+  author    = {Ankur Bapna and
+               Orhan Firat},
+  title     = {Non-Parametric Adaptation for Neural Machine Translation},
+  pages     = {1921--1931},
+  publisher = {Annual Conference of the North American Chapter of the Association for Computational Linguistics},
+  year      = {2019}
+}
 %%%%% chapter 16------------------------------------------------------
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%