Commit 62069deb by 单韦乔

13.7部分文字修正以及参考文献

parent ceb9ef94
...@@ -600,7 +600,7 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x}) ...@@ -600,7 +600,7 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x})
\sectionnewpage \sectionnewpage
\section{学习策略} \section{学习策略}
\parinterval 当我们在学习一个知识时,通常会遵循序渐进、由易到难的原则,这是一种很自然的学习策略。而当我们在训练一个模型时,通常是将全部的样本以随机的方式输入模型中进行学习,换句话说,就是让模型来平等地对待所有的训练样本,这和我们的直觉不太符合,因为不同的样本应该存在价值高低之分,这与任务以及数据本身密切相关。围绕训练样本的价值差异引发了诸如数据选择、主动学习、课程学习等一系列的学习策略,这些学习策略本质上是在不同任务、不同背景、不同假设下对如何高效的利用样本来进行学习这一问题的解决方法,本节即对这些相关技术进行介绍。 \parinterval 在学习一个知识时,人们通常会遵循序渐进、由易到难的原则,这是一种很自然的学习策略。然而当训练一个模型时,通常是将全部的样本以随机的方式输入到模型中,换句话说,就是让模型平等地对待所有的训练样本。这和直觉是不符合的,因为不同的样本应该存在价值高低之分,这与任务以及数据本身密切相关。围绕训练样本的价值差异,有诸如数据选择、主动学习、课程学习等一系列的关于学习策略的讨论,而这些学习策略本质上是研究如何在不同任务、不同背景、不同假设下高效的利用样本来进行学习,本节即对这些相关技术进行介绍。
%---------------------------------------------------------------------------------------- %----------------------------------------------------------------------------------------
% NEW SUB-SECTION % NEW SUB-SECTION
...@@ -608,7 +608,11 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x}) ...@@ -608,7 +608,11 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x})
\subsection{数据选择} \subsection{数据选择}
\parinterval 模型学习的过程本质上就是在学习训练数据的分布,我们期望模型学到的分布和真实数据的分布越接近越好,然而训练数据是我们从真实世界中采样得来的,可能和真实世界的数据分布不一致,这导致了训练数据的偏差。这种分布的不匹配有许多不同的表现形式,比如类别不平衡、领域差异、存在标签噪声等,这导致模型在实践中表现不佳。类别不平衡在分类任务中更为常见,可以通过重采样、代价敏感等手段来解决,数据选择则是缓解后两个问题的一种有效手段,它的学习策略是不让模型学所有的样本,而是静态或动态的选择有价值的样本来让模型学习,此外,在一些稀缺资源场景下还会面临标注数据稀少的情况,此时可以利用主动学习选择那些最有价值的样本让人工进行标注,从而降低成本。在这里,定义价值本质上是在定义评分函数,这是数据选择的核心问题,价值在不同任务背景下有不同的含义,这与任务的特性和它的基本假设有关。比如在领域数据选择中,价值表示样本与领域的相关性,在数据降噪中,价值表示样本的可信度,在主动学习中,价值表示样本的困难程度。下面对它们进行介绍。 \parinterval 训练模型的过程本质上就是模型在学习训练数据的分布,我们期望模型学到的分布和真实数据的分布能够越接近越好。然而训练数据是我们从真实世界中采样得来的,可能与真实世界的数据分布不一致,这导致训练数据存在偏差。这种分布的不匹配有许多不同的表现形式,比如类别不平衡、存在领域差异、存在标签噪声等,这导致模型在实践中表现不佳。
\parinterval 类别不平衡在分类任务中特别常见。为了解决这种问题,可以采用重采样、代价敏感等方式。此外,数据选择是缓解后两个问题的一种有效手段,它的策略是不让模型学所有的样本,而是静态或动态的选择有价值的样本来让模型学习。此外,在一些稀缺资源场景下,还存在标注数据稀少的问题。此时,可以利用主动学习选择那些最有价值的样本,并对这一部分样本进行人工标注,从而降低数据标注的成本。
\parinterval 在这里,定义价值本质上是在定义评分函数,这是数据选择的核心问题,价值在不同任务背景下有不同的含义,这与任务的特性和它的基本假设有关。比如,在领域数据选择中,价值表示样本与领域的相关性,在数据降噪中,价值表示样本的可信度,在主动学习中,价值表示样本的困难程度。下面对它们进行介绍。
%---------------------------------------------------------------------------------------- %----------------------------------------------------------------------------------------
% NEW SUBSUB-SECTION % NEW SUBSUB-SECTION
...@@ -616,39 +620,41 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x}) ...@@ -616,39 +620,41 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x})
\subsubsection{1. 领域适应中的数据选择} \subsubsection{1. 领域适应中的数据选择}
\parinterval 机器翻译模型是基于平行语料训练出来的,语料的质量、数量、领域对翻译效果都有很大的影响。其中,研究工作表明,无论是统计机器翻译还是神经机器翻译对于训练语料的领域都很敏感(Survey of data-selection methods in statistical machine translation;Effective Domain Mixing for Neural Machine Translation),因为每个领域都有自己独特的属性,比如语言风格、句子结构、专业术语等,例如“bank”这个英文单词,在金融领域通常被翻译为“银行”,而在计算机领域,一般被解释为“库”、“存储体”等。在通用领域训练出来的模型在特定领域上的翻译效果往往不理想,这本质上是数据分布不同导致,一种解决办法是我们可以只用特定领域的数据来让模型训练,然而特定领域数据往往比较稀缺,直接训练容易造成模型欠拟合,那么一种很自然的想法是我们能不能利用通用领域数据来帮助数据稀少的领域呢?这个研究方向被称为机器翻译的领域适应,其中资源丰富的领域称为源领域( Source Domain),资源稀缺的领域称为目标领域( Target Domain),更多详细的内容请参考16.5 领域适应。领域适应主要有基于模型和基于数据两类方法,基于数据的方法主要关注如何充分有效地利用训练样本,数据选择就是其中一种简单有效的方法,它的学习策略是在训练过程中动态或静态的从源领域语料中选取额外的、与目标领域数据相关的数据用于模型的训练,这样做的好处在于: \parinterval 由于机器翻译模型是使用平行语料训练的,因此,语料的质量、数量以及领域对翻译效果都有很大的影响。有研究工作表明,无论是使用统计机器翻译还是神经机器翻译技术,翻译模型对于训练语料的领域都很敏感\upcite{DBLP:journals/mt/EetemadiLTR15,britz2017effective}。因为每个领域都有自己独特的属性,比如语言风格、句子结构、专业术语等,例如“bank”这个英文单词,在金融领域通常被翻译为“银行”,而在计算机领域,一般被解释为“库”、“存储体”等。使用通用领域数据训练出来的模型,在特定领域上的翻译效果往往不理想,这本质上是数据分布不同导致的。面对这种问题,一种解决办法是用特定领域的数据来训练模型。然而,特定领域数据往往比较稀缺,直接使用这种数据训练容易造成模型欠拟合。
\parinterval 那么,一种很自然的想法是:能不能利用通用领域数据来帮助数据稀少的领域呢?这个研究方向被称为机器翻译的领域适应{\red (16.5领域适应)},其中资源丰富的领域被称为{\small\bfnew{源领域}}\index{源领域}(Source Domain)\index{Source Domain},资源稀缺的领域被称为{\small\bfnew{目标领域}}\index{目标领域}(Target Domain)\index{Target Domain}。领域适应主要有基于模型和基于数据两类方法,基于数据的方法主要关注如何充分有效地利用训练样本,数据选择就是其中一种简单有效的方法,它的学习策略是:在训练过程中,动态或静态的从源领域语料中选取部分数据,比如一些额外的、与目标领域数据相关的数据,并把这些数据用于模型的训练。这样做的好处在于:
\begin{itemize} \begin{itemize}
\vspace{0.5em} \vspace{0.5em}
\item 机器翻译系统需要根据数据规模设置合理的模型大小,模型的大小往往是与数据规模呈正相关的,选择部分数据而不是使用全部数据可以使模型更小,训练和运行成本更低,这在一些受限的环境中是一大优势。 \item 在机器翻译系统中,需要根据数据规模设置合理的模型大小,模型的大小往往是与数据规模呈正相关的。选择一部分数据,而不是使用全部数据可以使得模型更小,并且训练和运行成本更低,这在一些受限的环境中是一大优势。
\vspace{0.5em} \vspace{0.5em}
\item 任何大型语料库都可能包含许多与领域无关的数据,直接混合多个领域的数据进行训练可能会损害模型的性能,选择与特定领域的相关的数据可以让模型表现更好。 \item 在任何大型语料库中,都可能包含许多与领域无关的数据,如果直接混合多个领域的数据进行训练,可能会损害模型的性能。因此,选择与特定领域的相关的数据可以让模型的表现更好。
\vspace{0.5em} \vspace{0.5em}
\end{itemize} \end{itemize}
\parinterval 领域数据选择所要解决的核心问题是:给定一个目标领域数据集,对于源领域中的任意一个句子对,如何衡量该句子对和目标领域的相关性?目前的工作可以分为以下几类: \parinterval 领域数据选择所要解决的核心问题是:当给定一个目标领域数据集时,对于源领域中的任意一个句子对,如何衡量该句子对和目标领域的相关性?目前,该工作可以分为以下几类:
\begin{itemize} \begin{itemize}
\vspace{0.5em} \vspace{0.5em}
\item 基于语言模型交叉熵差(cross-entropy difference,CED)(Domain Adaptation Via Pseudo In-Domain Data Selection;Data Selection With Fewer Words;Instance Weighting for Neural Machine Translation Domain Adaptation;Combining translation and language model scoring for domain-specific data filtering)。该方法做法是在目标领域数据和通用数据上分别训练语言模型,然后用语言模型来给句子打分并做差,分数越低说明句子与目标领域越相关。 \item 基于语言模型{\small\bfnew{交叉熵差}}\index{交叉熵差}(Cross-entropy difference\index{Cross-entropy difference},CED)\upcite{DBLP:conf/emnlp/AxelrodHG11,DBLP:conf/wmt/AxelrodRHO15,DBLP:conf/emnlp/WangULCS17,DBLP:conf/iwslt/MansourWN11}。该方法是在目标领域数据和通用数据上,分别训练语言模型,然后用语言模型给句子打分,并求出两个语言模型打分的差。分数越低,说明句子与目标领域越相关。
\vspace{0.5em} \vspace{0.5em}
\item 基于文本分类(Semi-supervised Convolutional Networks for Translation Adaptation with Tiny Amount of In-domain Data;Bilingual Methods for Adaptive Training Data Selection for Machine Translation;Cost Weighting for Neural Machine Translation Domain Adaptation;Automatic Threshold Detection for Data Selection in Machine Translation)。将该问题转化为文本分类问题,用领域数据训练一个分类器,之后利用该分类器对给定的句子进行领域分类,最后用输出的概率来打分。 \item 基于文本分类\upcite{DBLP:conf/conll/ChenH16,chen2016bilingual,DBLP:conf/aclnmt/ChenCFL17,DBLP:conf/wmt/DumaM17}。该方法将原始问题转化为文本分类问题,首先用领域数据训练一个分类器,之后利用该分类器对给定的句子进行领域分类,最后使用输出的概率进行打分。
\vspace{0.5em} \vspace{0.5em}
\item 基于特征衰减算法(Feature Decay Algorithms,FDA)(Instance selection for machine translation using feature decay algorithms;Feature decay algorithms for neural machine translation;Selecting Backtranslated Data from Multiple Sources for Improved Neural Machine Translation;Data Selection with Feature Decay Algorithms Using an Approximated Target Side)。该算法基于特征匹配,试图从源领域中提取出一个句子集合,这些句子能够使目标领域语言特征的覆盖范围最大化。 \item 基于{\small\bfnew{特征衰减算法}}\index{特征衰减算法}(Feature Decay Algorithms\index{Feature Decay Algorithms},FDA)\upcite{DBLP:conf/wmt/BiciciY11,poncelas2018feature,DBLP:conf/acl/SotoSPW20,DBLP:journals/corr/abs-1811-03039}。该算法基于特征匹配,试图从源领域中提取出一个句子集合,{\red 这些句子能够使目标领域语言特征的覆盖范围最大化。(这句话什么意思?)}
\vspace{0.5em} \vspace{0.5em}
\end{itemize} \end{itemize}
\parinterval 尽管这些方法有所不同,但是它们的目的都是为了衡量样本和领域的相关性,这些评价指标最终服务于训练过程中的样本学习策略。样本学习策略主要分为静态和动态两种,早期的研究工作都是关注于设计评分函数,在学习策略上普遍采用基于静态的方法,即首先利用评分函数对源领域的数据进行打分排序,然后选取一定数量的数据合并到目标领域数据集中共同训练模型(Domain Adaptation Via Pseudo In-Domain Data Selection;Data Selection With Fewer Words;Bilingual Methods for Adaptive Training Data Selection for Machine Translation;Instance selection for machine translation using feature decay algorithms;Semi-supervised Convolutional Networks for Translation Adaptation with Tiny Amount of In-domain Data),这个过程其实是扩大了目标领域的数据规模,模型的收益主要来自于数据的增加。但是随着实践人们发现,基于静态的方法会存在两方面的缺陷: \parinterval 尽管这些方法有所不同,但是它们的目的都是为了衡量样本和领域的相关性,这些评价指标最终服务于训练过程中的样本学习策略。样本学习策略主要分为静态和动态两种,早期的研究工作都是关注于设计评分函数,在学习策略上普遍采用基于静态的方法,即首先利用评分函数对源领域的数据进行打分和排序,然后选取一定数量的数据合并到目标领域数据集中,与目标领域数据一起训练模型\upcite{DBLP:conf/emnlp/AxelrodHG11,DBLP:conf/wmt/AxelrodRHO15,chen2016bilingual,DBLP:conf/wmt/BiciciY11,DBLP:conf/conll/ChenH16},这个过程其实是扩大了目标领域的数据规模,模型的收益主要来自于数据量的增加。但是随着实践,人们发现基于静态的方法会存在两方面的缺陷:
\begin{itemize} \begin{itemize}
\vspace{0.5em} \vspace{0.5em}
\item 与在完整的源领域数据池上训练相比,在选定的子集上进行训练会导致词表覆盖率的降低和加剧单词长尾分布问题,这会对翻译系统的性能产生显著影响。(Data Selection With Fewer Words;Dynamic Data Selection for Neural Machine Translation) \item 与在完整的源领域数据池上训练相比,在选定的子集上进行训练会导致词表覆盖率降低,并加剧单词长尾分布问题。这些问题会对翻译系统的性能产生显著影响\upcite{DBLP:conf/wmt/AxelrodRHO15,DBLP:conf/emnlp/WeesBM17}
\vspace{0.5em} \vspace{0.5em}
\item 基于静态的方法可以看作一种数据过滤技术,它对数据的判定方式是“非黑即白”的,即接收或拒绝,这种方式一方面会受到评分函数的影响,一方面被拒绝的数据可能对于训练模型仍然有用,而且它们的有用性可能会随着训练的进展而改变。(Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection) \item 基于静态的方法可以看作一种数据过滤技术,它对数据的判定方式是“非黑即白”的,即接收或拒绝。在一方面,这导致静态的方法会受到评分函数的影响。另一方面,被拒绝的数据可能仍然有助于训练模型,而且在训练过程中,这些数据的有用性可能会改变\upcite{DBLP:conf/wmt/WangWHNC18}
\vspace{0.5em} \vspace{0.5em}
\end{itemize} \end{itemize}
\parinterval 为了解决这些问题,研究人员提出了动态的学习策略,这里的动态主要体现在模型训练过程中,源领域和目标领域的数据是以某种策略进行动态的组织。它的基本想法是:不完全抛弃相关性低的句子,而只是使模型给予相关性高的句子更高的关注度。具体在实现上,主要有两种方法,一种是将句子的领域相似性表达成概率分布,然后在训练过程中根据该分布对数据进行动态采样(Dynamic Data Selection for Neural Machine Translation;Dynamic Sentence Sampling for Efficient Training of Neural Machine Translation), 一种是在计算损失函数时根据句子的领域相似性以加权的方式进行训练(Instance Weighting for Neural Machine Translation Domain Adaptation;Cost Weighting for Neural Machine Translation Domain Adaptation)。相比于基于静态的二元选择,基于动态的方法是一种“软”选择方式,这使得模型有机会使用到其它数据,提高了训练数据的多样性,因此性能也更理想。 \parinterval 为了解决这些问题,研究人员提出了动态的学习策略。这里的动态主要体现在:在模型的训练过程中,使用某种策略动态地组织源领域和目标领域数据。这种方法不直接抛弃相关性低的句子,而是让模型更关注相关性高的句子。在实现上,主要有两种方法,一种是将句子的领域相似性表达成概率分布,然后在训练过程中根据该分布对数据进行动态采样\upcite{DBLP:conf/emnlp/WeesBM17,DBLP:conf/acl/WangUS18},另一种是在计算损失函数时,{\red 根据句子的领域相似性以加权的方式进行训练}\upcite{DBLP:conf/emnlp/WangULCS17,DBLP:conf/aclnmt/ChenCFL17}。相比基于静态的{\red 二元选择(感觉这整句话没说明白)},基于动态的方法是一种更“软”的选择方式,这使得模型有机会使用到其它数据,提高了训练数据的多样性,因此性能也更理想。
%---------------------------------------------------------------------------------------- %----------------------------------------------------------------------------------------
% NEW SUBSUB-SECTION % NEW SUBSUB-SECTION
...@@ -656,11 +662,11 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x}) ...@@ -656,11 +662,11 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x})
\subsubsection{2. 数据降噪} \subsubsection{2. 数据降噪}
\parinterval 除了领域差异,训练集偏差的另外一种常见表现形式是标签噪声。机器翻译的训练数据大多来源于网页爬取,这不可避免的会引入噪声,比如句子未对齐、多种语言单词混合、单词丢失等,相关研究表明神经机器翻译对于噪声数据很敏感,当噪声过多时就会使得模型的性能显著下降(On the impact of various types of noise on neural machine translation),因此无论是从模型鲁棒性还是训练效率出发,数据降噪都是很有意义的。事实上,数据降噪从统计机器翻译时代就已经有许多相关工作(Dealing with Input Noise in Statistical Machine Translation;Bilingual Data Cleaning for SMT using Graph-based Random Walk;Learning from Noisy Data in Statistical Machine Translation),2018年WMT也开放了关于平行语料过滤的任务,这说明数据降噪工作正在逐步引起人们的注意。 \parinterval 除了领域差异,{\red 训练集偏差(之前都没有提到)}的另外一种常见表现形式是标签噪声。由于机器翻译的训练数据大多是从网页上爬取的数据,这不可避免的会引入噪声。这类噪声一般体现为如下问题,比如句子未对齐、多种语言单词混合、单词丢失等,{\red 相关研究表明神经机器翻译对于噪声数据很敏感(好像之前提到过)},当噪声过多时就会使得模型的性能显著下降\upcite{DBLP:conf/aclnmt/KhayrallahK18},因此无论是从模型鲁棒性还是{\red 训练效率出发},数据降噪都是很有意义的。事实上,数据降噪从统计机器翻译时代就已经有许多相关工作\upcite{DBLP:conf/coling/FormigaF12,DBLP:conf/acl/CuiZLLZ13,DBLP:phd/dnb/Mediani17},2018年WMT也开放了关于平行语料过滤的任务,这说明数据降噪工作正在逐步引起人们的注意。
\parinterval 由于含有噪声的翻译数据通常都具有较为明显的特征,因此可以用比如:句子长度比、词对齐率、最长连续未对齐序列长度等一些启发式的特征来进行综合评分(MT Detection in Web-Scraped Parallel Corpora;Parallel Corpus Refinement as an Outlier Detection Algorithm;Zipporah: a Fast and Scalable Data Cleaning System for NoisyWeb-Crawled Parallel Corpora);也可以将该问题转化为文本分类或跨语言文本蕴含任务来进行筛选(Detecting Cross-Lingual Semantic Divergence for Neural Machine Translation;Identifying Semantic Divergences in Parallel Text without Annotations);此外,从某种意义上来说,数据降噪其实也可以算是一种领域数据选择,因为它的目标是选择可信度高的样本,因此我们可以人工构建一个可信度高的小型数据集,然后利用该数据集和通用数据集之间的差异性进行选择(Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection) \parinterval 由于含有噪声的翻译数据通常都具有较为明显的特征,因此可以用一些启发式的特征来进行综合评分\upcite{rarrick2011mt,taghipour2011parallel,Xu2017ZipporahAF},例如:句子长度比、词对齐率、最长连续未对齐序列长度等;也可以将该问题转化为文本分类,或跨语言文本蕴含任务来{\red 进行筛选}\upcite{DBLP:conf/aclnmt/CarpuatVN17,DBLP:conf/naacl/VyasNC18};此外,从某种意义上来说,数据降噪其实也可以算是一种{\red 领域数据选择},因为它的目标是选择可信度高的样本,因此我们可以人工构建一个可信度高的小型数据集,然后利用该数据集和通用数据集之间的差异性{\red 进行选择}\upcite{DBLP:conf/wmt/WangWHNC18}
\parinterval 早期的工作大多在关注过滤的方法,对于噪声数据中模型的鲁棒性训练和噪声样本的利用探讨较少。事实上,噪声是有强度的,有些噪声数据对于模型可能是有价值的,而且它们的价值可能会随着模型的状态而改变(Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection)。一个例子如图\ref{fig:13-51}所示(画图的时候zh-gloss那行不要了,zh翻译为中文), \parinterval {\red 早期的工作大多在关注过滤的方法,对于噪声数据中模型的鲁棒性训练和噪声样本的利用探讨较少}。事实上,噪声是有强度的,有些噪声数据对于模型可能是有价值的,而且它们的价值可能会随着模型的状态而改变\upcite{DBLP:conf/wmt/WangWHNC18}。一个例子如图\ref{fig:13-51}所示{\red (画图的时候zh-gloss那行不要了,zh翻译为中文)}
%---------------------------------------------- %----------------------------------------------
\begin{figure}[htp] \begin{figure}[htp]
...@@ -671,7 +677,7 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x}) ...@@ -671,7 +677,7 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x})
\end{figure} \end{figure}
%------------------------------------------- %-------------------------------------------
\parinterval 图中的中文句子中缺少了一部分翻译,但这两个句子都很流利,简单的基于长度或双语词典的方法可以很容易地对其进行过滤,但直观地看,这条训练数据对于训练NMT模型仍然有用,特别是在数据稀缺的情况下,因为中文句子和英文句子的前半部分仍然是翻译对。这表明了噪声数据的微妙之处,它不是一个简单的二元分类问题:一些训练样本可能部分有用,而它们的有用性也可能随着训练的进展而改变。因此简单的过滤并不一种很好的办法,一种合理的学习策略应该是既可以合理的利用这些数据,又不让其对模型产生负面影响。直觉上,这是一个动态的过程,当模型能力较弱时(比如在训练初期),这些数据就能对模型起到正面作用,反之亦然。受课程学习(Curriculum Learning,更详细内容见下节)、微调(fine-tune)等启发,研究学者们也提出了类似的学习策略,它的主要思想是:在训练过程中对批量数据的噪声水平进退火(anneal),使得模型在越来越干净的批量数据上进行训练(Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection;Dynamically Composing Domain-Data Selection with Clean-Data Selection by “Co-Curricular Learning” for Neural Machine Translation)。从宏观上看,整个训练过程其实是一个持续微调的过程,这和微调的思想基本一致。这种学习策略一方面充分利用了训练数据,一方面又避免了噪声数据对模型的负面影响,因此取得了不错的效果。 \parinterval 图中的中文句子中缺少了一部分翻译,但这两个句子都很流利,简单的基于长度或双语词典的方法可以很容易地对其进行过滤,但直观地看,这条训练数据对于训练NMT模型仍然有用,特别是在数据稀缺的情况下,因为中文句子和英文句子的前半部分仍然是翻译对。这表明了噪声数据的微妙之处,它不是一个简单的二元分类问题:一些训练样本可能部分有用,而它们的有用性也可能随着训练的进展而改变。因此简单的过滤并不一种很好的办法,一种合理的学习策略应该是既可以合理的利用这些数据,又不让其对模型产生负面影响。直觉上,这是一个动态的过程,当模型能力较弱时(比如在训练初期),这些数据就能对模型起到正面作用,反之亦然。受课程学习(Curriculum Learning,更详细内容见下节)、微调(fine-tune)等启发,研究学者们也提出了类似的学习策略,它的主要思想是:在训练过程中对批量数据的噪声水平进退火(anneal),使得模型在越来越干净的批量数据上进行训练\upcite{DBLP:conf/wmt/WangWHNC18,DBLP:conf/acl/WangCC19}。从宏观上看,整个训练过程其实是一个持续微调的过程,这和微调的思想基本一致。这种学习策略一方面充分利用了训练数据,一方面又避免了噪声数据对模型的负面影响,因此取得了不错的效果。
%---------------------------------------------------------------------------------------- %----------------------------------------------------------------------------------------
% NEW SUBSUB-SECTION % NEW SUBSUB-SECTION
...@@ -696,17 +702,17 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x}) ...@@ -696,17 +702,17 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x})
\vspace{0.5em} \vspace{0.5em}
\item 随机采样策略(Random Sampling,RS),随机采样,顾名思义就是不需要跟模型的预测结果做任何交互,直接从未标注样本池中随机筛选出一批样本给专家标注,常作为主动学习算法中最基础的对比实验。 \item 随机采样策略(Random Sampling,RS),随机采样,顾名思义就是不需要跟模型的预测结果做任何交互,直接从未标注样本池中随机筛选出一批样本给专家标注,常作为主动学习算法中最基础的对比实验。
\vspace{0.5em} \vspace{0.5em}
\item 不确定性采样的查询(Uncertainty Sampling)(An analysis of active learning strategies for sequence labeling tasks;Query learning with large margin classifers;Less is more: Active learning with support vector machines)这类方法选择那些当前基准分类器最不能确定其分类的样本,不确定性通常可以用信度最低(Least Confident)、边缘采样(Margin Sampling)、熵(Entropy)等方法来描述,从几何角度看,这种方法优先选择靠近分类边界的样例; \item 不确定性采样的查询(Uncertainty Sampling)\upcite{DBLP:conf/emnlp/SettlesC08,campbell2000query,DBLP:conf/icml/SchohnC00}这类方法选择那些当前基准分类器最不能确定其分类的样本,不确定性通常可以用信度最低(Least Confident)、边缘采样(Margin Sampling)、熵(Entropy)等方法来描述,从几何角度看,这种方法优先选择靠近分类边界的样例;
\vspace{0.5em} \vspace{0.5em}
\item 基于委员会的查询(Query-By-Committee)(Query by committee;Machine Learning;Query learning strategies using boosting and bagging;Employing em and pool-based active learning for text classifcation)这类方法选择那些训练后能够最大程度缩减版本空间的样本,可以采用Bagging,AdaBoost等分类器集成算法从版本空间中产生委员会,然后选择委员会中的假设预测分歧最大的样本; \item 基于委员会的查询(Query-By-Committee)\upcite{DBLP:conf/colt/SeungOS92,mitchell1996m,DBLP:conf/icml/AbeM98,mccallumzy1998employing}这类方法选择那些训练后能够最大程度缩减版本空间的样本,可以采用Bagging,AdaBoost等分类器集成算法从版本空间中产生委员会,然后选择委员会中的假设预测分歧最大的样本;
\vspace{0.5em} \vspace{0.5em}
\item 其它经典策略:梯度长度期望(Expected Gradient Length,EGL) 策略,根据未标注样本对当前模型的影响程度优先筛选出对模型影响最大的样本(Histograms of oriented gradients for human detection;Gradient-based learning applied to document recognition);方差最小(Variance Reduction,VR)策略,选择那些方差减少最多的样本数据(Optimum Experimental Designs, with SAS;A variance minimization criterion to active learning on graphs);结合生成对抗网络的方法(Generative adversarial active learning;Active decision boundary annotation with deep generative models;Adversarial sampling for active learning) \item 其它经典策略:梯度长度期望(Expected Gradient Length,EGL) 策略,根据未标注样本对当前模型的影响程度优先筛选出对模型影响最大的样本\upcite{DBLP:conf/cvpr/DalalT05,726791};方差最小(Variance Reduction,VR)策略,选择那些方差减少最多的样本数据\upcite{atkinson2007optimum,DBLP:journals/jmlr/JiH12};结合生成对抗网络的方法\upcite{DBLP:journals/corr/ZhuB17,DBLP:conf/iccv/HuijserG17,DBLP:conf/wacv/0007T20}
\vspace{0.5em} \vspace{0.5em}
\end{itemize} \end{itemize}
\parinterval 具体方法细节可以查阅相关论文,查询策略是主动学习框架中的核心,大量研究都在围绕采样策略和学习策略展开,在实际应用中需要根据任务情况来决定使用哪种策略。 \parinterval 具体方法细节可以查阅相关论文,查询策略是主动学习框架中的核心,大量研究都在围绕采样策略和学习策略展开,在实际应用中需要根据任务情况来决定使用哪种策略。
\parinterval 主动学习非常适合于专业领域的任务,因为专业领域的标注成本往往比较昂贵,比如医学、金融、法律等。事实上,主动学习在神经机器翻译中的利用并不是很多,这主要是因为主动学习仅是把那些有价值的单语数据选出来,然后交给人工标注,这个过程需要人工的参与,然而神经机器翻译有许多利用单语数据的方法,比如在目标端结合语言模型(On using very large target vocabulary for neural machine translation;On using monolingual corpora in neural machine translation);利用反向翻译(Back translation)(Improving neural machine translation models with monolingual data;Joint training for neural machine translation models with monolingual data;Iterative backtranslation for neural machine translation);利用多语言或迁移学习(Exploiting Out-of-Domain Parallel Data through Multilingual Transfer Learning for Low-Resource Neural Machine Translation;Zero-Resource Neural Machine Translation with Monolingual Pivot Data;Pivot-based transfer learning for neural machine translation between non-english languages);无监督机器翻译(Unsupervised machine translation using monolingual corpora only;Unsupervised neural machine translation)等。但是在一些特定的场景下,主动学习仍然会发挥重要作用,比如:在低资源或专业领域的神经机器翻译中,主动学习可以大大减少人工标注成本(Learning to Actively Learn Neural Machine Translation;Active Learning Approaches to Enhancing Neural Machine Translation);在交互式或增量式机器翻译中,(Active Learning for Interactive Neural Machine Translation of Data Streams;Continuous learning from human post-edits for neural machine translation;Online learning for effort reduction in interactive neural machine translation)主动学习可以让模型持续从外界反馈中受益 \parinterval 主动学习非常适合于专业领域的任务,因为专业领域的标注成本往往比较昂贵,比如医学、金融、法律等。事实上,主动学习在神经机器翻译中的利用并不是很多,这主要是因为主动学习仅是把那些有价值的单语数据选出来,然后交给人工标注,这个过程需要人工的参与,然而神经机器翻译有许多利用单语数据的方法,比如在目标端结合语言模型\upcite{DBLP:conf/acl/JeanCMB15,2015OnGulcehre};利用反向翻译(Back translation)\upcite{Sennrich2016ImprovingNM,DBLP:conf/aaai/Zhang0LZC18,hoang2018iterative};利用多语言或迁移学习\upcite{DBLP:conf/mtsummit/ImankulovaDFI19,DBLP:conf/emnlp/CurreyH19,DBLP:conf/emnlp/KimPPKN19};无监督机器翻译\upcite{DBLP:conf/iclr/LampleCDR18,DBLP:conf/iclr/ArtetxeLAC18}等。但是在一些特定的场景下,主动学习仍然会发挥重要作用,比如:在低资源或专业领域的神经机器翻译中,主动学习可以大大减少人工标注成本\upcite{DBLP:conf/conll/LiuBH18,DBLP:conf/emnlp/ZhaoZZZ20};在交互式或增量式机器翻译中,主动学习可以让模型持续从外界反馈中受益\upcite{Peris2018ActiveLF,DBLP:journals/pbml/TurchiNFF17,DBLP:journals/csl/PerisC19}{\red 上面cite的位置}
%---------------------------------------------------------------------------------------- %----------------------------------------------------------------------------------------
% NEW SUB-SECTION % NEW SUB-SECTION
...@@ -724,7 +730,7 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x}) ...@@ -724,7 +730,7 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x})
\vspace{0.5em} \vspace{0.5em}
\end{itemize} \end{itemize}
\parinterval 这是符合直觉的,可以想象,对于一个数学零基础的人来说,如果一开始就同时学习加减乘除和高等数学,效率自然是比较低下的。而如果按照正常的学习顺序,比如先学习加减乘除,然后学习各种函数,最后再学习高等数学,有了前面的基础,再学习后面的知识,效率就可以更高。事实上,课程学习自从一被提出就受到了研究人员的极大关注,除了想法本身有趣之外,还因为它作为一种和模型无关的训练策略,具有即插即用(Plug-and-Play)的特点,可以被广泛应用于各种计算密集型的领域中,以提高效率,比如计算机视觉(Computer Vision,CV)(Weakly supervised learning from large-scale web images;Self-paced reranking for zero-example multimedia search),自然语言处理(Natural Language Processing,NLP)(Competence-based curriculum learning for neural machine translation;Simple and effective curriculum pointer-generator networks for reading comprehension over long narratives)和神经网络结构搜索(Neural Architecture Search,NAS)(Breaking the curse of space explosion: Towards efficient nas with curriculum search)等。神经机器翻译就是自然语言处理中一个很契合课程学习的任务,这是因为神经机器翻译往往需要大规模的平行语料来训练模型,训练成本很高,所以使用课程学习来加快收敛是一个很自然的想法。 \parinterval 这是符合直觉的,可以想象,对于一个数学零基础的人来说,如果一开始就同时学习加减乘除和高等数学,效率自然是比较低下的。而如果按照正常的学习顺序,比如先学习加减乘除,然后学习各种函数,最后再学习高等数学,有了前面的基础,再学习后面的知识,效率就可以更高。事实上,课程学习自从一被提出就受到了研究人员的极大关注,除了想法本身有趣之外,还因为它作为一种和模型无关的训练策略,具有即插即用(Plug-and-Play)的特点,可以被广泛应用于各种计算密集型的领域中,以提高效率,比如计算机视觉\upcite{DBLP:conf/eccv/GuoHZZDSH18,DBLP:conf/mm/JiangMMH14}(Computer Vision,CV),自然语言处理\upcite{DBLP:conf/naacl/PlataniosSNPM19,DBLP:conf/acl/TayWLFPYRHZ19}(Natural Language Processing,NLP)和神经网络结构搜索\upcite{DBLP:conf/icml/GuoCZZ0HT20}(Neural Architecture Search,NAS)等。神经机器翻译就是自然语言处理中一个很契合课程学习的任务,这是因为神经机器翻译往往需要大规模的平行语料来训练模型,训练成本很高,所以使用课程学习来加快收敛是一个很自然的想法。
\parinterval 那么如何针对一个具体任务设计一个课程学习呢?相比于正常的以随机方式呈现训练数据的方法,课程学习的目标就是按照样本难易程度以某种策略调度给模型学习,因此课程学习主要解决两个核心问题: \parinterval 那么如何针对一个具体任务设计一个课程学习呢?相比于正常的以随机方式呈现训练数据的方法,课程学习的目标就是按照样本难易程度以某种策略调度给模型学习,因此课程学习主要解决两个核心问题:
...@@ -749,7 +755,7 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x}) ...@@ -749,7 +755,7 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x})
\parinterval 首先,难度评估器对训练样本按照由易到难的顺序进行排序,最开始调度器从相对容易的数据块中采样批量的训练数据,发送给模型进行训练,随着训练时间的推移,训练调度器将逐渐从更加困难的数据块中进行采样(至于何时,以及何种采样方式则取决于设定的策略),持续这个过程,直到从整个训练集进行均匀采样。 \parinterval 首先,难度评估器对训练样本按照由易到难的顺序进行排序,最开始调度器从相对容易的数据块中采样批量的训练数据,发送给模型进行训练,随着训练时间的推移,训练调度器将逐渐从更加困难的数据块中进行采样(至于何时,以及何种采样方式则取决于设定的策略),持续这个过程,直到从整个训练集进行均匀采样。
\parinterval 评估样本的难度和具体的任务相关,在神经机器翻译中,有很多种评估方法,可以利用语言学上的困难准则,比如句子长度、句子平均词频、句子语法解析树深度等(Competence-based curriculum learning for neural machine translation;Curriculum Learning and Minibatch Bucketing in Neural Machine Translation)。这些准则本质上属于人类的先验知识,符合人类的直觉,但不一定和模型相匹配,对人类来说简单的句子对模型来说并不总是容易的,所以研究学者们也提出了模型自动评估的方法,比如:利用语言模型(Dynamically Composing Domain-Data Selection with Clean-Data Selection by “Co-Curricular Learning” for Neural Machine Translation;Curriculum Learning for Domain Adaptation in Neural Machine Translation),利用神经机器翻译模型(An empirical exploration of curriculum learning for neural machine translation;Dynamic Curriculum Learning for Low-Resource Neural Machine Translation)等。值得注意的是,利用神经机器翻译来打分的方法分为静态和动态两种,静态的方法是利用在小数据集上训练的、更小的NMT模型来打分(An empirical exploration of curriculum learning for neural machine translation),动态的方法则是利用当前模型的状态来打分,这在广义上也叫作自步学习(Self-Paced Learning),具体可以利用比如模型的训练误差或变化率等(Dynamic Curriculum Learning for Low-Resource Neural Machine Translation) \parinterval 评估样本的难度和具体的任务相关,在神经机器翻译中,有很多种评估方法,可以利用语言学上的困难准则,比如句子长度、句子平均词频、句子语法解析树深度等\upcite{DBLP:conf/naacl/PlataniosSNPM19,DBLP:conf/ranlp/KocmiB17}。这些准则本质上属于人类的先验知识,符合人类的直觉,但不一定和模型相匹配,对人类来说简单的句子对模型来说并不总是容易的,所以研究学者们也提出了模型自动评估的方法,比如:利用语言模型\upcite{DBLP:conf/acl/WangCC19,DBLP:conf/naacl/ZhangSKMCD19},利用神经机器翻译模型\upcite{zhang2018empirical,DBLP:conf/coling/XuHJFWHJXZ20}等。值得注意的是,利用神经机器翻译来打分的方法分为静态和动态两种,静态的方法是利用在小数据集上训练的、更小的NMT模型来打分\upcite{zhang2018empirical},动态的方法则是利用当前模型的状态来打分,这在广义上也叫作自步学习(Self-Paced Learning),具体可以利用比如模型的训练误差或变化率等\upcite{DBLP:conf/coling/XuHJFWHJXZ20}
\parinterval 虽然样本的难度度量在不同的数据类型和任务中有所不同,但针对第二个问题,即课程规划通常与数据和任务无关,换句话说,在各种场景中,大多数课程学习都利用了类似的调度策略。具体而言,调度策略可以分为预定义的和自动的两种。预定义通常是将按照难易程度排序好的样本划分为块,每个块中包含一定数量的难度相似的样本,如图\ref{fig:13-54}所示: \parinterval 虽然样本的难度度量在不同的数据类型和任务中有所不同,但针对第二个问题,即课程规划通常与数据和任务无关,换句话说,在各种场景中,大多数课程学习都利用了类似的调度策略。具体而言,调度策略可以分为预定义的和自动的两种。预定义通常是将按照难易程度排序好的样本划分为块,每个块中包含一定数量的难度相似的样本,如图\ref{fig:13-54}所示:
...@@ -775,9 +781,9 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x}) ...@@ -775,9 +781,9 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x})
\parinterval 图中每行是一个训练阶段,类似于正常训练的epoch,只不过当前的可用数据是整个数据集的子集。类似的还有一些其他变体,比如训练到模型可见整个数据集之后,将最难的样本块复制添加到训练集中,或者是将最容易的数据块逐渐删除,然后再添加回来等,这些方法的基本想法都是想让模型在具备一定的能力之后更多关注于困难样本。 \parinterval 图中每行是一个训练阶段,类似于正常训练的epoch,只不过当前的可用数据是整个数据集的子集。类似的还有一些其他变体,比如训练到模型可见整个数据集之后,将最难的样本块复制添加到训练集中,或者是将最容易的数据块逐渐删除,然后再添加回来等,这些方法的基本想法都是想让模型在具备一定的能力之后更多关注于困难样本。
\parinterval 尽管预定义的方法简单有效,但存在的一个最大限制是,预定义的难度评估器和训练规划在训练过程中都是固定的,不够灵活,这可能会导致数据块的划分不合理,而且在一定程度上也忽略了当前模型的反馈,因此研究人员也提出了自动的方法,这种方法会根据模型的反馈来动态调整样本的难度或调度策略,模型的反馈可以是模型的不确定性(Uncertainty-Aware Curriculum Learning for Neural Machine Translation)、模型的能力(Competence-based Curriculum Learning for Neural Machine Translation;Dynamic Curriculum Learning for Low-Resource Neural Machine Translation)等,然后将模型的反馈和训练的轮次或者是数据的采样相挂钩,从而达到控制的目的,根据这种思想,还有直接利用强化学习的方法(Reinforced Curriculum Learning on Pre-trained Neural Machine Translation Models),这些方法在一定程度上使得整个训练过程和模型的状态相匹配,使得样本的选择过渡得更加平滑,因此在实践中取得了不错的效果。 \parinterval 尽管预定义的方法简单有效,但存在的一个最大限制是,预定义的难度评估器和训练规划在训练过程中都是固定的,不够灵活,这可能会导致数据块的划分不合理,而且在一定程度上也忽略了当前模型的反馈,因此研究人员也提出了自动的方法,这种方法会根据模型的反馈来动态调整样本的难度或调度策略,模型的反馈可以是模型的不确定性\upcite{DBLP:conf/acl/ZhouYWWC20}、模型的能力\upcite{DBLP:conf/naacl/PlataniosSNPM19,DBLP:conf/coling/XuHJFWHJXZ20}等,然后将模型的反馈和训练的轮次或者是数据的采样相挂钩,从而达到控制的目的,根据这种思想,还有直接利用强化学习的方法\upcite{DBLP:conf/aaai/ZhaoWNW20},这些方法在一定程度上使得整个训练过程和模型的状态相匹配,使得样本的选择过渡得更加平滑,因此在实践中取得了不错的效果。
\parinterval 从广义上说,大多数课程学习方法都是遵循由易到难的原则,然而在实践过程中人们逐渐赋予了课程学习更多的内涵,课程学习的含义早已超越了最原始的定义。一方面,课程学习可以与许多任务相结合,此时,评估准则并不一定总是样本的困难度,这取决于具体的任务,比如在多任务学习中(multi-task learning)(Curriculum learning of multiple tasks;Curriculum learning for multi-task classification of visual attributes),指的任务的难易程度或相关性;在领域适应任务中(Curriculum Learning for Domain Adaptation in Neural Machine Translation),指的是数据与领域的相似性;在噪声数据场景中,指的是样本的可信度(Dynamically Composing Domain-Data Selection with Clean-Data Selection by “Co-Curricular Learning” for Neural Machine Translation)。另一方面,在一些任务或数据中,由易到难并不总是有效,有时困难优先反而会取得更好的效果(Curriculum learning with deep convolutional neural networks;An empirical exploration of curriculum learning for neural machine translation),实际上这和我们的直觉不太符合,一种合理的解释是课程学习更适合标签噪声、离群值较多或者是目标任务困难的场景,能提高模型的鲁棒性和收敛速度,而困难优先则更适合数据集干净的场景,能使随机梯度下降(stochastic gradient descent,SGD)更快更稳定(Active bias: Training more accurate neural networks by emphasizing high variance samples)。课程学习不断丰富的内涵使得它有了越来越广泛的应用。 \parinterval 从广义上说,大多数课程学习方法都是遵循由易到难的原则,然而在实践过程中人们逐渐赋予了课程学习更多的内涵,课程学习的含义早已超越了最原始的定义。一方面,课程学习可以与许多任务相结合,此时,评估准则并不一定总是样本的困难度,这取决于具体的任务,比如在多任务学习中\upcite{DBLP:conf/cvpr/PentinaSL15,DBLP:conf/iccvw/SarafianosGNK17}(multi-task learning),指的任务的难易程度或相关性;在领域适应任务中\upcite{DBLP:conf/naacl/ZhangSKMCD19},指的是数据与领域的相似性;在噪声数据场景中,指的是样本的可信度\upcite{DBLP:conf/acl/WangCC19}。另一方面,在一些任务或数据中,由易到难并不总是有效,有时困难优先反而会取得更好的效果\upcite{zhang2018empirical}{\red Curriculum learning with deep convolutional neural networks},实际上这和我们的直觉不太符合,一种合理的解释是课程学习更适合标签噪声、离群值较多或者是目标任务困难的场景,能提高模型的鲁棒性和收敛速度,而困难优先则更适合数据集干净的场景,能使随机梯度下降(stochastic gradient descent,SGD)更快更稳定\upcite{DBLP:conf/nips/ChangLM17}。课程学习不断丰富的内涵使得它有了越来越广泛的应用。
%---------------------------------------------------------------------------------------- %----------------------------------------------------------------------------------------
% NEW SUB-SECTION % NEW SUB-SECTION
...@@ -791,15 +797,15 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x}) ...@@ -791,15 +797,15 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x})
\begin{itemize} \begin{itemize}
\vspace{0.5em} \vspace{0.5em}
\item 基于正则化的方法,通过对神经权重的更新施加约束来减轻灾难性的遗忘,通常是在损失函数中引入了一个额外的正则化项,使得模型在学习新数据时巩固先前的知识。(Learning without Forgetting ;Elastic Weight Consolidation ) \item 基于正则化的方法,通过对神经权重的更新施加约束来减轻灾难性的遗忘,通常是在损失函数中引入了一个额外的正则化项,使得模型在学习新数据时巩固先前的知识\upcite{DBLP:journals/pami/LiH18a}{\red Elastic Weight Consolidation}
\vspace{0.5em} \vspace{0.5em}
\item 基于示例的方法,以原始格式存储样本,或使用生成模型生成伪样本,在学习新任务的同时重放先前的任务样本以减轻遗忘。(iCaRL: Incremental Classifier and Representation Learning;End-to-End Incremental Learning ) \item 基于示例的方法,以原始格式存储样本,或使用生成模型生成伪样本,在学习新任务的同时重放先前的任务样本以减轻遗忘。(iCaRL: Incremental Classifier and Representation Learning;End-to-End Incremental Learning )
\vspace{0.5em} \vspace{0.5em}
\item 基于动态模型架构的方法,通过动态调整网络结构来响应新信息,例如增加神经元或网络层进行重新训练,或者是在新任务训练时只更新部分参数。(Progressive Neural Networks;PathNet: Evolution Channels Gradient Descent in Super Neural Networks) \item 基于动态模型架构的方法,通过动态调整网络结构来响应新信息,例如增加神经元或网络层进行重新训练,或者是在新任务训练时只更新部分参数\upcite{rusu2016progressive,DBLP:journals/corr/FernandoBBZHRPW17}
\vspace{0.5em} \vspace{0.5em}
\end{itemize} \end{itemize}
\parinterval 从某种程度上看,机器翻译中的多领域、多语言等都属于持续学习的场景,灾难性遗忘也是这些任务面临的主要问题之一。在多领域神经机器翻译中,我们期望模型既有通用领域的性能,并且在特定领域也表现良好,然而事实上,由于灾难性遗忘问题的存在,适应特定领域往往是以牺牲通用领域的性能为代价的(Overcoming Catastrophic Forgetting During Domain Adaptation of Neural Machine Translation;Investigating Catastrophic Forgetting During Continual Training for Neural Machine Translation),现有的解决方法大多都可以归到以上三类,具体内容可以参考16.5 领域适应。在多语言神经翻译中,最理想的情况是一个模型就能够实现在多个语言之间的映射,然而由于数据分布的极大不同,实际情况往往是:多语言模型能够提高低资源语言对互译的性能,但同时也会降低高资源语言对的性能。因此如何让模型从多语言训练数据中持续受益就是一个关键的问题,16.3 多语言翻译模型作了详细介绍。此外,在增量式模型优化场景中也会存在灾难性遗忘问题,相关内容可参考18.2 增量式模型优化 \parinterval 从某种程度上看,机器翻译中的多领域、多语言等都属于持续学习的场景,灾难性遗忘也是这些任务面临的主要问题之一。在多领域神经机器翻译中,我们期望模型既有通用领域的性能,并且在特定领域也表现良好,然而事实上,由于灾难性遗忘问题的存在,适应特定领域往往是以牺牲通用领域的性能为代价的\upcite{DBLP:conf/naacl/ThompsonGKDK19,DBLP:conf/coling/GuF20},现有的解决方法大多都可以归到以上三类,具体内容可以参考{\red 16.5 领域适应}。在多语言神经翻译中,最理想的情况是一个模型就能够实现在多个语言之间的映射,然而由于数据分布的极大不同,实际情况往往是:多语言模型能够提高低资源语言对互译的性能,但同时也会降低高资源语言对的性能。因此如何让模型从多语言训练数据中持续受益就是一个关键的问题,16.3 多语言翻译模型作了详细介绍。此外,在增量式模型优化场景中也会存在灾难性遗忘问题,相关内容可参考{\red 18.2 增量式模型优化}
%---------------------------------------------------------------------------------------- %----------------------------------------------------------------------------------------
% NEW SECTION % NEW SECTION
......
...@@ -6238,6 +6238,794 @@ author = {Yoshua Bengio and ...@@ -6238,6 +6238,794 @@ author = {Yoshua Bengio and
year={2020} year={2020}
} }
@inproceedings{DBLP:journals/mt/EetemadiLTR15,
author = {Sauleh Eetemadi and
William Lewis and
Kristina Toutanova and
Hayder Radha},
title = {Survey of data-selection methods in statistical machine translation},
publisher = {Machine Translation},
volume = {29},
number = {3-4},
pages = {189--223},
year = {2015}
}
@inproceedings{britz2017effective,
title={Effective domain mixing for neural machine translation},
author={Britz, Denny and Le, Quoc and Pryzant, Reid},
publisher={Proceedings of the Second Conference on Machine Translation},
pages={118--126},
year={2017}
}
@inproceedings{DBLP:conf/emnlp/AxelrodHG11,
author = {Amittai Axelrod and
Xiaodong He and
Jianfeng Gao},
title = {Domain Adaptation via Pseudo In-Domain Data Selection},
pages = {355--362},
publisher = {Conference on Empirical Methods in Natural Language Processing},
year = {2011}
}
@inproceedings{DBLP:conf/wmt/AxelrodRHO15,
author = {Amittai Axelrod and
Philip Resnik and
Xiaodong He and
Mari Ostendorf},
title = {Data Selection With Fewer Words},
pages = {58--65},
publisher = {Conference on Empirical Methods in Natural Language Processing},
year = {2015}
}
@inproceedings{DBLP:conf/emnlp/WangULCS17,
author = {Rui Wang and
Masao Utiyama and
Lemao Liu and
Kehai Chen and
Eiichiro Sumita},
title = {Instance Weighting for Neural Machine Translation Domain Adaptation},
pages = {1482--1488},
publisher = {Conference on Empirical Methods in Natural Language Processing},
year = {2017}
}
@inproceedings{DBLP:conf/iwslt/MansourWN11,
author = {Saab Mansour and
Joern Wuebker and
Hermann Ney},
title = {Combining translation and language model scoring for domain-specific
data filtering},
pages = {222--229},
publisher = {International Workshop on Spoken Language Translation},
year = {2011}
}
@inproceedings{DBLP:conf/conll/ChenH16,
author = {Boxing Chen and
Fei Huang},
title = {Semi-supervised Convolutional Networks for Translation Adaptation
with Tiny Amount of In-domain Data},
pages = {314--323},
publisher = {The SIGNLL Conference on Computational Natural Language Learning},
year = {2016}
}
@inproceedings{chen2016bilingual,
title={Bilingual methods for adaptive training data selection for machine translation},
author={Chen, Boxing and Kuhn, Roland and Foster, George and Cherry, Colin and Huang, Fei},
publisher={Association for Machine Translation in the Americas},
pages={93--103},
year={2016}
}
@inproceedings{DBLP:conf/aclnmt/ChenCFL17,
author = {Boxing Chen and
Colin Cherry and
George F. Foster and
Samuel Larkin},
title = {Cost Weighting for Neural Machine Translation Domain Adaptation},
pages = {40--46},
publisher = {Annual Meeting of the Association for Computational Linguistics},
year = {2017}
}
@inproceedings{DBLP:conf/wmt/DumaM17,
author = {Mirela{-}Stefania Duma and
Wolfgang Menzel},
title = {Automatic Threshold Detection for Data Selection in Machine Translation},
pages = {483--488},
publisher = {Proceedings of the Second Conference on Machine Translation},
year = {2017}
}
@inproceedings{DBLP:conf/wmt/BiciciY11,
author = {Ergun Bi{\c{c}}ici and
Deniz Yuret},
title = {Instance Selection for Machine Translation using Feature Decay Algorithms},
pages = {272--283},
publisher = {Proceedings of the Sixth Workshop on Statistical Machine Translation},
year = {2011}
}
@inproceedings{poncelas2018feature,
title={Feature decay algorithms for neural machine translation},
author={Poncelas, Alberto and Maillette de Buy Wenniger, Gideon and Way, Andy},
year={2018},
publisher={European Association for Machine Translation}
}
@inproceedings{DBLP:conf/acl/SotoSPW20,
author = {Xabier Soto and
Dimitar Sht. Shterionov and
Alberto Poncelas and
Andy Way},
title = {Selecting Backtranslated Data from Multiple Sources for Improved Neural
Machine Translation},
pages = {3898--3908},
publisher = {Annual Meeting of the Association for Computational Linguistics},
year = {2020}
}
@inproceedings{DBLP:journals/corr/abs-1811-03039,
author = {Alberto Poncelas and
Gideon Maillette de Buy Wenniger and
Andy Way},
title = {Data Selection with Feature Decay Algorithms Using an Approximated
Target Side},
publisher = {CoRR},
volume = {abs/1811.03039},
year = {2018}
}
@inproceedings{DBLP:conf/emnlp/WeesBM17,
author = {Marlies van der Wees and
Arianna Bisazza and
Christof Monz},
title = {Dynamic Data Selection for Neural Machine Translation},
pages = {1400--1410},
publisher = {Conference on Empirical Methods in Natural Language Processing},
year = {2017}
}
@inproceedings{DBLP:conf/wmt/WangWHNC18,
author = {Wei Wang and
Taro Watanabe and
Macduff Hughes and
Tetsuji Nakagawa and
Ciprian Chelba},
title = {Denoising Neural Machine Translation Training with Trusted Data and
Online Data Selection},
pages = {133--143},
publisher = {Proceedings of the Third Conference on Machine Translation},
year = {2018}
}
@inproceedings{DBLP:conf/acl/WangUS18,
author = {Rui Wang and
Masao Utiyama and
Eiichiro Sumita},
title = {Dynamic Sentence Sampling for Efficient Training of Neural Machine
Translation},
pages = {298--304},
publisher = {Annual Meeting of the Association for Computational Linguistics},
year = {2018}
}
@inproceedings{DBLP:conf/aclnmt/KhayrallahK18,
author = {Huda Khayrallah and
Philipp Koehn},
title = {On the Impact of Various Types of Noise on Neural Machine Translation},
pages = {74--83},
publisher = {Annual Meeting of the Association for Computational Linguistics},
year = {2018}
}
@inproceedings{DBLP:conf/coling/FormigaF12,
author = {Llu{\'{\i}}s Formiga and
Jos{\'{e}} A. R. Fonollosa},
title = {Dealing with Input Noise in Statistical Machine Translation},
pages = {319--328},
publisher = {International Conference on Computational Linguistics},
year = {2012}
}
@inproceedings{DBLP:conf/acl/CuiZLLZ13,
author = {Lei Cui and
Dongdong Zhang and
Shujie Liu and
Mu Li and
Ming Zhou},
title = {Bilingual Data Cleaning for {SMT} using Graph-based Random Walk},
pages = {340--345},
publisher = {Annual Meeting of the Association for Computational Linguistics},
year = {2013}
}
@phdthesis{DBLP:phd/dnb/Mediani17,
author = {Mohammed Mediani},
title = {Learning from Noisy Data in Statistical Machine Translation},
school = {Karlsruhe Institute of Technology, Germany},
year = {2017}
}
@inproceedings{rarrick2011mt,
title={MT detection in web-scraped parallel corpora},
author={Rarrick, Spencer and Quirk, Chris and Lewis, Will},
publisher={Machine Translation},
pages={422--430},
year={2011}
}
@inproceedings{taghipour2011parallel,
title={Parallel corpus refinement as an outlier detection algorithm},
author={Taghipour, Kaveh and Khadivi, Shahram and Xu, Jia},
publisher={Machine Translation},
pages={414--421},
year={2011}
}
@inproceedings{Xu2017ZipporahAF,
title={Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora},
author={Hainan Xu and Philipp Koehn},
booktitle={Conference on Empirical Methods in Natural Language Processing},
year={2017}
}
@inproceedings{DBLP:conf/aclnmt/CarpuatVN17,
author = {Marine Carpuat and
Yogarshi Vyas and
Xing Niu},
title = {Detecting Cross-Lingual Semantic Divergence for Neural Machine Translation},
pages = {69--79},
publisher = {Annual Meeting of the Association for Computational Linguistics},
year = {2017}
}
@inproceedings{DBLP:conf/naacl/VyasNC18,
author = {Yogarshi Vyas and
Xing Niu and
Marine Carpuat},
title = {Identifying Semantic Divergences in Parallel Text without Annotations},
pages = {1503--1515},
publisher = {Annual Conference of the North American Chapter of the Association for Computational Linguistics},
year = {2018}
}
@inproceedings{DBLP:conf/acl/WangCC19,
author = {Wei Wang and
Isaac Caswell and
Ciprian Chelba},
title = {Dynamically Composing Domain-Data Selection with Clean-Data Selection
by "Co-Curricular Learning" for Neural Machine Translation},
pages = {1282--1292},
publisher = {Annual Meeting of the Association for Computational Linguistics},
year = {2019}
}
@inproceedings{DBLP:conf/emnlp/SettlesC08,
author = {Burr Settles and
Mark Craven},
title = {An Analysis of Active Learning Strategies for Sequence Labeling Tasks},
pages = {1070--1079},
publisher = {Conference on Empirical Methods in Natural Language Processing},
year = {2008}
}
@inproceedings{campbell2000query,
title={Query learning with large margin classifiers},
author={Campbell, Colin and Cristianini, Nello and Smola, Alex and others},
publisher={International Conference on Machine Learning},
volume={20},
number={0},
pages={0},
year={2000}
}
@inproceedings{DBLP:conf/icml/SchohnC00,
author = {Greg Schohn and
David Cohn},
title = {Less is More: Active Learning with Support Vector Machines},
pages = {839--846},
publisher = {International Conference on Machine Learning},
year = {2000}
}
@inproceedings{DBLP:conf/colt/SeungOS92,
author = {H. Sebastian Seung and
Manfred Opper and
Haim Sompolinsky},
title = {Query by Committee},
pages = {287--294},
publisher = {Conference on Computational Learning Theory},
year = {1992}
}
@book{mitchell1996m,
title={Machine Learning},
author={Mitchell, Tom},
journal={McCraw Hill},
year={1996}
}
@inproceedings{DBLP:conf/icml/AbeM98,
author = {Naoki Abe and
Hiroshi Mamitsuka},
title = {Query Learning Strategies Using Boosting and Bagging},
pages = {1--9},
publisher = {International Conference on Machine Learning},
year = {1998}
}
@inproceedings{mccallumzy1998employing,
title={Employing EM and pool-based active learning for text classification},
author={McCallumzy, Andrew Kachites and Nigamy, Kamal},
publisher={International Conference on Machine Learning},
pages={359--367},
year={1998}
}
@inproceedings{DBLP:conf/cvpr/DalalT05,
author = {Navneet Dalal and
Bill Triggs},
title = {Histograms of Oriented Gradients for Human Detection},
pages = {886--893},
publisher = {{IEEE} Conference on Computer Vision and Pattern Recognition},
year = {2005}
}
@inproceedings{726791,
author={Yann {Lecun} and Leon {Bottou} and Yoshua {Bengio} and Patrick {Haffner}},
publisher={Proceedings of the IEEE},
title={Gradient-based learning applied to document recognition},
year={1998},
volume={86},
number={11},
pages={2278-2324}
}
@book{atkinson2007optimum,
title={Optimum experimental designs, with SAS},
author={Atkinson, Anthony and Donev, Alexander and Tobias, Randall and others},
volume={34},
year={2007},
publisher={Oxford University Press}
}
@inproceedings{DBLP:journals/jmlr/JiH12,
author = {Ming Ji and
Jiawei Han},
title = {A Variance Minimization Criterion to Active Learning on Graphs},
series = {{JMLR} Proceedings},
volume = {22},
pages = {556--564},
publisher = {International Conference on Artificial Intelligence and Statistics},
year = {2012}
}
@article{DBLP:journals/corr/ZhuB17,
author = {Jia{-}Jie Zhu and
Jos{\'{e}} Bento},
title = {Generative Adversarial Active Learning},
journal = {CoRR},
volume = {abs/1702.07956},
year = {2017}
}
@inproceedings{DBLP:conf/iccv/HuijserG17,
author = {Miriam W. Huijser and
Jan C. van Gemert},
title = {Active Decision Boundary Annotation with Deep Generative Models},
pages = {5296--5305},
publisher = {{IEEE} International Conference on Computer Vision},
year = {2017}
}
@inproceedings{DBLP:conf/wacv/0007T20,
author = {Christoph Mayer and
Radu Timofte},
title = {Adversarial Sampling for Active Learning},
pages = {3060--3068},
publisher = {{IEEE} Winter Conference on Applications of Computer Vision},
year = {2020}
}
@inproceedings{DBLP:conf/acl/JeanCMB15,
author = {S{\'{e}}bastien Jean and
KyungHyun Cho and
Roland Memisevic and
Yoshua Bengio},
title = {On Using Very Large Target Vocabulary for Neural Machine Translation},
pages = {1--10},
publisher = {Annual Meeting of the Association for Computational Linguistics},
year = {2015}
}
@inproceedings{2015OnGulcehre,
title = {On Using Monolingual Corpora in Neural Machine Translation},
author = {Gulcehre Caglar and
Firat Orhan and
Xu Kelvin and
Cho Kyunghyun and
Barrault Loic and
Lin Huei Chi and
Bougares Fethi and
Schwenk Holger and
Bengio Yoshua},
publisher = {Computer Science},
year = {2015},
}
@inproceedings{Sennrich2016ImprovingNM,
author = {Rico Sennrich and
Barry Haddow and
Alexandra Birch},
title = {Improving Neural Machine Translation Models with Monolingual Data},
publisher = {Annual Meeting of the Association for Computational Linguistics},
year = {2016}
}
@inproceedings{DBLP:conf/aaai/Zhang0LZC18,
author = {Zhirui Zhang and
Shujie Liu and
Mu Li and
Ming Zhou and
Enhong Chen},
title = {Joint Training for Neural Machine Translation Models with Monolingual
Data},
pages = {555--562},
publisher = {AAAI Conference on Artificial Intelligence},
year = {2018}
}
@inproceedings{hoang2018iterative,
title={Iterative back-translation for neural machine translation},
author={Hoang, Vu Cong Duy and Koehn, Philipp and Haffari, Gholamreza and Cohn, Trevor},
publisher={Proceedings of the 2nd Workshop on Neural Machine Translation and Generation},
pages={18--24},
year={2018}
}
@inproceedings{DBLP:conf/mtsummit/ImankulovaDFI19,
author = {Aizhan Imankulova and
Raj Dabre and
Atsushi Fujita and
Kenji Imamura},
title = {Exploiting Out-of-Domain Parallel Data through Multilingual Transfer
Learning for Low-Resource Neural Machine Translation},
pages = {128--139},
publisher = {Machine Translation},
year = {2019}
}
@inproceedings{DBLP:conf/emnlp/CurreyH19,
author = {Anna Currey and
Kenneth Heafield},
title = {Zero-Resource Neural Machine Translation with Monolingual Pivot Data},
pages = {99--107},
publisher = {Conference on Empirical Methods in Natural Language Processing},
year = {2019}
}
@inproceedings{DBLP:conf/emnlp/KimPPKN19,
author = {Yunsu Kim and
Petre Petrov and
Pavel Petrushkov and
Shahram Khadivi and
Hermann Ney},
title = {Pivot-based Transfer Learning for Neural Machine Translation between
Non-English Languages},
pages = {866--876},
publisher = {Annual Meeting of the Association for Computational Linguistics},
year = {2019}
}
@inproceedings{DBLP:conf/iclr/LampleCDR18,
author = {Guillaume Lample and
Alexis Conneau and
Ludovic Denoyer and
Marc'Aurelio Ranzato},
title = {Unsupervised Machine Translation Using Monolingual Corpora Only},
publisher = {International Conference on Learning Representations},
year = {2018}
}
@inproceedings{DBLP:conf/iclr/ArtetxeLAC18,
author = {Mikel Artetxe and
Gorka Labaka and
Eneko Agirre and
Kyunghyun Cho},
title = {Unsupervised Neural Machine Translation},
publisher = {International Conference on Learning Representations},
year = {2018}
}
%下面的publisher在看看
@inproceedings{DBLP:conf/conll/LiuBH18,
author = {Ming Liu and
Wray L. Buntine and
Gholamreza Haffari},
title = {Learning to Actively Learn Neural Machine Translation},
pages = {334--344},
publisher = {The SIGNLL Conference on Computational Natural Language Learning},
year = {2018}
}
@inproceedings{DBLP:conf/emnlp/ZhaoZZZ20,
author = {Yuekai Zhao and
Haoran Zhang and
Shuchang Zhou and
Zhihua Zhang},
title = {Active Learning Approaches to Enhancing Neural Machine Translation:
An Empirical Study},
pages = {1796--1806},
publisher = {Conference on Empirical Methods in Natural Language Processing},
year = {2020}
}
@inproceedings{Peris2018ActiveLF,
title={Active Learning for Interactive Neural Machine Translation of Data Streams},
author={{\'A}lvaro Peris and Francisco Casacuberta},
publisher={The SIGNLL Conference on Computational Natural Language Learning},
pages={151--160},
year={2018}
}
@inproceedings{DBLP:journals/pbml/TurchiNFF17,
author = {Marco Turchi and
Matteo Negri and
M. Amin Farajian and
Marcello Federico},
title = {Continuous Learning from Human Post-Edits for Neural Machine Translation},
publisher = {The Prague Bulletin of Mathematical Linguistics},
volume = {108},
pages = {233--244},
year = {2017}
}
@inproceedings{DBLP:journals/csl/PerisC19,
author = {{\'{A}}lvaro Peris and
Francisco Casacuberta},
title = {Online learning for effort reduction in interactive neural machine
translation},
publisher = {Computer Speech Language},
volume = {58},
pages = {98--126},
year = {2019}
}
@inproceedings{DBLP:conf/eccv/GuoHZZDSH18,
author = {Sheng Guo and
Weilin Huang and
Haozhi Zhang and
Chenfan Zhuang and
Dengke Dong and
Matthew R. Scott and
Dinglong Huang},
title = {CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images},
series = {Lecture Notes in Computer Science},
volume = {11214},
pages = {139--154},
publisher = {European Conference on Computer Vision},
year = {2018}
}
@inproceedings{DBLP:conf/mm/JiangMMH14,
author = {Lu Jiang and
Deyu Meng and
Teruko Mitamura and
Alexander G. Hauptmann},
title = {Easy Samples First: Self-paced Reranking for Zero-Example Multimedia
Search},
pages = {547--556},
publisher = {ACM International Conference on Multimedia},
year = {2014}
}
%下面的pubisher
@inproceedings{DBLP:conf/naacl/PlataniosSNPM19,
author = {Emmanouil Antonios Platanios and
Otilia Stretcu and
Graham Neubig and
Barnab{\'{a}}s P{\'{o}}czos and
Tom M. Mitchell},
title = {Competence-based Curriculum Learning for Neural Machine Translation},
pages = {1162--1172},
publisher = {Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
year = {2019}
}
@inproceedings{DBLP:conf/acl/TayWLFPYRHZ19,
author = {Yi Tay and
Shuohang Wang and
Anh Tuan Luu and
Jie Fu and
Minh C. Phan and
Xingdi Yuan and
Jinfeng Rao and
Siu Cheung Hui and
Aston Zhang},
title = {Simple and Effective Curriculum Pointer-Generator Networks for Reading
Comprehension over Long Narratives},
pages = {4922--4931},
publisher = {Annual Meeting of the Association for Computational Linguistics},
year = {2019}
}
@inproceedings{DBLP:conf/icml/GuoCZZ0HT20,
author = {Yong Guo and
Yaofo Chen and
Yin Zheng and
Peilin Zhao and
Jian Chen and
Junzhou Huang and
Mingkui Tan},
title = {Breaking the Curse of Space Explosion: Towards Efficient {NAS} with
Curriculum Search},
series = {Proceedings of Machine Learning Research},
volume = {119},
pages = {3822--3831},
publisher = {International Conference on Machine Learning},
year = {2020}
}
@inproceedings{DBLP:conf/ranlp/KocmiB17,
author = {Tom Kocmi and
Ondrej Bojar},
title = {Curriculum Learning and Minibatch Bucketing in Neural Machine Translation},
pages = {379--386},
publisher = {International Conference Recent Advances in Natural Language Processing},
year = {2017}
}
@inproceedings{DBLP:conf/naacl/ZhangSKMCD19,
author = {Xuan Zhang and
Pamela Shapiro and
Gaurav Kumar and
Paul McNamee and
Marine Carpuat and
Kevin Duh},
title = {Curriculum Learning for Domain Adaptation in Neural Machine Translation},
pages = {1903--1915},
publisher = {Annual Conference of the North American Chapter of the Association for Computational Linguistics},
year = {2019}
}
@inproceedings{zhang2018empirical,
title={An empirical exploration of curriculum learning for neural machine translation},
author={Zhang, Xuan and Kumar, Gaurav and Khayrallah, Huda and Murray, Kenton and Gwinnup, Jeremy and Martindale, Marianna J and McNamee, Paul and Duh, Kevin and Carpuat, Marine},
publisher={arXiv preprint arXiv:1811.00739},
year={2018}
}
@inproceedings{DBLP:conf/coling/XuHJFWHJXZ20,
author = {Chen Xu and
Bojie Hu and
Yufan Jiang and
Kai Feng and
Zeyang Wang and
Shen Huang and
Qi Ju and
Tong Xiao and
Jingbo Zhu},
title = {Dynamic Curriculum Learning for Low-Resource Neural Machine Translation},
pages = {3977--3989},
publisher = {International Committee on Computational Linguistics},
year = {2020}
}
@inproceedings{DBLP:conf/acl/ZhouYWWC20,
author = {Yikai Zhou and
Baosong Yang and
Derek F. Wong and
Yu Wan and
Lidia S. Chao},
title = {Uncertainty-Aware Curriculum Learning for Neural Machine Translation},
pages = {6934--6944},
publisher = {Annual Meeting of the Association for Computational Linguistics},
year = {2020}
}
@inproceedings{DBLP:conf/aaai/ZhaoWNW20,
author = {Mingjun Zhao and
Haijiang Wu and
Di Niu and
Xiaoli Wang},
title = {Reinforced Curriculum Learning on Pre-Trained Neural Machine Translation
Models},
pages = {9652--9659},
publisher = {AAAI Conference on Artificial Intelligence},
year = {2020}
}
@inproceedings{DBLP:conf/cvpr/PentinaSL15,
author = {Anastasia Pentina and
Viktoriia Sharmanska and
Christoph H. Lampert},
title = {Curriculum learning of multiple tasks},
pages = {5492--5500},
publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
year = {2015}
}
@inproceedings{DBLP:conf/iccvw/SarafianosGNK17,
author = {Nikolaos Sarafianos and
Theodore Giannakopoulos and
Christophoros Nikou and
Ioannis A. Kakadiaris},
title = {Curriculum Learning for Multi-task Classification of Visual Attributes},
pages = {2608--2615},
publisher = {IEEE International Conference on Computer Vision},
year = {2017}
}
@inproceedings{DBLP:conf/nips/ChangLM17,
author = {Haw{-}Shiuan Chang and
Erik G. Learned{-}Miller and
Andrew McCallum},
title = {Active Bias: Training More Accurate Neural Networks by Emphasizing
High Variance Samples},
publisher = {Conference and Workshop on Neural Information Processing Systems},
pages = {1002--1012},
year = {2017}
}
%ieee加{
@inproceedings{DBLP:journals/pami/LiH18a,
author = {Zhizhong Li and
Derek Hoiem},
title = {Learning without Forgetting},
publisher = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
volume = {40},
number = {12},
pages = {2935--2947},
year = {2018}
}
@inproceedings{rusu2016progressive,
title={Progressive neural networks},
author={Rusu, Andrei A and Rabinowitz, Neil C and Desjardins, Guillaume and Soyer, Hubert and Kirkpatrick, James and Kavukcuoglu, Koray and Pascanu, Razvan and Hadsell, Raia},
publisher={arXiv preprint arXiv:1606.04671},
year={2016}
}
@inproceedings{DBLP:journals/corr/FernandoBBZHRPW17,
author = {Chrisantha Fernando and
Dylan Banarse and
Charles Blundell and
Yori Zwols and
David Ha and
Andrei A. Rusu and
Alexander Pritzel and
Daan Wierstra},
title = {PathNet: Evolution Channels Gradient Descent in Super Neural Networks},
publisher = {CoRR},
volume = {abs/1701.08734},
year = {2017}
}
@inproceedings{DBLP:conf/naacl/ThompsonGKDK19,
author = {Brian Thompson and
Jeremy Gwinnup and
Huda Khayrallah and
Kevin Duh and
Philipp Koehn},
title = {Overcoming Catastrophic Forgetting During Domain Adaptation of Neural
Machine Translation},
pages = {2062--2068},
publisher = {Annual Meeting of the Association for Computational Linguistics},
year = {2019}
}
@inproceedings{DBLP:conf/coling/GuF20,
author = {Shuhao Gu and
Yang Feng},
title = {Investigating Catastrophic Forgetting During Continual Training for
Neural Machine Translation},
pages = {4315--4326},
publisher = {International Committee on Computational Linguistics},
year = {2020}
}
%%%%% chapter 13------------------------------------------------------ %%%%% chapter 13------------------------------------------------------
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
......
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论