合并分支 'caorunzhe' 到 'master'

Caorunzhe 查看合并请求 !665

合并分支 'caorunzhe' 到 'master'
Caorunzhe 查看合并请求 !665
e4b0260a · 曹润柘 · 8d1e72b9 · c51bfc7e · e4b0260a · e4b0260a
Commit e4b0260a authored Dec 22, 2020 by 曹润柘
--- a/Chapter13/Figures/figure-a-pair-of-noise-data-examples.png
+++ b/Chapter13/Figures/figure-a-pair-of-noise-data-examples.png
--- a/Chapter13/Figures/figure-a-predefined-course-planning.jpg
+++ b/Chapter13/Figures/figure-a-predefined-course-planning.jpg
--- a/Chapter13/Figures/figure-active-learning-framework.png
+++ b/Chapter13/Figures/figure-active-learning-framework.png
--- a/Chapter13/Figures/figure-curriculum-learning-framework.png
+++ b/Chapter13/Figures/figure-curriculum-learning-framework.png
--- a/Chapter13/Figures/figure-sample-block-partition.jpg
+++ b/Chapter13/Figures/figure-sample-block-partition.jpg
--- a/Chapter13/chapter13.tex
+++ b/Chapter13/chapter13.tex
@@ -660,14 +660,14 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x})
 \parinterval 由于含有噪声的翻译数据通常都具有较为明显的特征，因此可以用比如：句子长度比、词对齐率、最长连续未对齐序列长度等一些启发式的特征来进行综合评分（MT Detection in Web-Scraped Parallel Corpora；Parallel Corpus Refinement as an Outlier Detection Algorithm；Zipporah: a Fast and Scalable Data Cleaning System for NoisyWeb-Crawled Parallel Corpora）；也可以将该问题转化为文本分类或跨语言文本蕴含任务来进行筛选（Detecting Cross-Lingual Semantic Divergence for Neural Machine Translation；Identifying Semantic Divergences in Parallel Text without Annotations）；此外，从某种意义上来说，数据降噪其实也可以算是一种领域数据选择，因为它的目标是选择可信度高的样本，因此我们可以人工构建一个可信度高的小型数据集，然后利用该数据集和通用数据集之间的差异性进行选择（Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection）。
-\parinterval 早期的工作大多在关注过滤的方法，对于噪声数据中模型的鲁棒性训练和噪声样本的利用探讨较少。事实上，噪声是有强度的，有些噪声数据对于模型可能是有价值的，而且它们的价值可能会随着模型的状态而改变（Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection）。一个例子如图XX （一对噪声数据实例 ）所示（画图的时候zh-gloss那行不要了，zh翻译为中文），
+\parinterval 早期的工作大多在关注过滤的方法，对于噪声数据中模型的鲁棒性训练和噪声样本的利用探讨较少。事实上，噪声是有强度的，有些噪声数据对于模型可能是有价值的，而且它们的价值可能会随着模型的状态而改变（Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection）。一个例子如图\ref{fig:13-51}所示（画图的时候zh-gloss那行不要了，zh翻译为中文），
 %----------------------------------------------
 \begin{figure}[htp]
 \centering
-\input{./Chapter13/Figures/figure-ensemble-knowledge-distillation}
+\includegraphics[scale=0.5]{./Chapter13/Figures/figure-a-pair-of-noise-data-examples.png}
-\caption{迭代式知识蒸馏}
+\caption{一对噪声数据实例}
-\label{fig:13-42}
+\label{fig:13-51}
 \end{figure}
 %-------------------------------------------
@@ -679,14 +679,14 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x})
 \subsubsection{3. 主动学习}
-\parinterval 和数据选择密切相关的另外一个应用是主动学习（Active Learning），领域适应和数据降噪是拥有标注数据的情况，然而在一些实际的业务场景中，获得标注样本的代价往往比较高，大部分都是未标注数据，那么如何通过机器学习算法来降低人工标注的成本就是一个很有实际意义的问题，这个研究方向也称之为主动学习。既然人工标注的成本很大，那么就应该尽可能选择那些最有价值的样本交给人工来标注，之后再将标注的数据用于训练，从而逐步提升模型的效果，这也是主动学习的一个整体思路。因此主动学习主要由五个部分组成，包括：未标注样本池（unlabeled pool）、筛选策略（select queries）、标注者（human annotator）、标注数据集（labeled training set）、目标模型（machine learning model），如图XXX（主动学习框架）所示，整个过程以不断迭代的训练方式更新模型性能、未标注样本池和标注数据集，直到目标模型达到预设的性能或者不再提供标注数据为止。
+\parinterval 和数据选择密切相关的另外一个应用是主动学习（Active Learning），领域适应和数据降噪是拥有标注数据的情况，然而在一些实际的业务场景中，获得标注样本的代价往往比较高，大部分都是未标注数据，那么如何通过机器学习算法来降低人工标注的成本就是一个很有实际意义的问题，这个研究方向也称之为主动学习。既然人工标注的成本很大，那么就应该尽可能选择那些最有价值的样本交给人工来标注，之后再将标注的数据用于训练，从而逐步提升模型的效果，这也是主动学习的一个整体思路。因此主动学习主要由五个部分组成，包括：未标注样本池（unlabeled pool）、筛选策略（select queries）、标注者（human annotator）、标注数据集（labeled training set）、目标模型（machine learning model），如图\ref{fig:13-52}所示，整个过程以不断迭代的训练方式更新模型性能、未标注样本池和标注数据集，直到目标模型达到预设的性能或者不再提供标注数据为止。
 %----------------------------------------------
 \begin{figure}[htp]
 \centering
-\input{./Chapter13/Figures/figure-ensemble-knowledge-distillation}
+\includegraphics[scale=0.5]{./Chapter13/Figures/figure-active-learning-framework.png}
-\caption{迭代式知识蒸馏}
+\caption{主动学习框架}
-\label{fig:13-42}
+\label{fig:13-52}
 \end{figure}
 %-------------------------------------------
@@ -736,14 +736,14 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x})
 \vspace{0.5em}
 \end{itemize}
-\parinterval 我们把这两个问题抽象成两个模块：难度评估器和训练调度器，那么课程学习的一个大致的流程如下图xx（课程学习框架）所示：
+\parinterval 我们把这两个问题抽象成两个模块：难度评估器和训练调度器，那么课程学习的一个大致的流程如下图\ref{fig:13-53}所示：
 %----------------------------------------------
 \begin{figure}[htp]
 \centering
-\input{./Chapter13/Figures/figure-ensemble-knowledge-distillation}
+\includegraphics[scale=0.5]{./Chapter13/Figures/figure-curriculum-learning-framework.png}
-\caption{迭代式知识蒸馏}
+\caption{课程学习框架}
-\label{fig:13-42}
+\label{fig:13-53}
 \end{figure}
 %-------------------------------------------
@@ -751,25 +751,25 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x})
 \parinterval 评估样本的难度和具体的任务相关，在神经机器翻译中，有很多种评估方法，可以利用语言学上的困难准则，比如句子长度、句子平均词频、句子语法解析树深度等（Competence-based curriculum learning for neural machine translation；Curriculum Learning and Minibatch Bucketing in Neural Machine Translation）。这些准则本质上属于人类的先验知识，符合人类的直觉，但不一定和模型相匹配，对人类来说简单的句子对模型来说并不总是容易的，所以研究学者们也提出了模型自动评估的方法，比如：利用语言模型（Dynamically Composing Domain-Data Selection with Clean-Data Selection by “Co-Curricular Learning” for Neural Machine Translation；Curriculum Learning for Domain Adaptation in Neural Machine Translation），利用神经机器翻译模型（An empirical exploration of curriculum learning for neural machine translation；Dynamic Curriculum Learning for Low-Resource Neural Machine Translation）等。值得注意的是，利用神经机器翻译来打分的方法分为静态和动态两种，静态的方法是利用在小数据集上训练的、更小的NMT模型来打分（An empirical exploration of curriculum learning for neural machine translation），动态的方法则是利用当前模型的状态来打分，这在广义上也叫作自步学习（Self-Paced Learning），具体可以利用比如模型的训练误差或变化率等（Dynamic Curriculum Learning for Low-Resource Neural Machine Translation）。
-\parinterval 虽然样本的难度度量在不同的数据类型和任务中有所不同，但针对第二个问题，即课程规划通常与数据和任务无关，换句话说，在各种场景中，大多数课程学习都利用了类似的调度策略。具体而言，调度策略可以分为预定义的和自动的两种。预定义通常是将按照难易程度排序好的样本划分为块，每个块中包含一定数量的难度相似的样本，如图XX（样本块划分）所示：
+\parinterval 虽然样本的难度度量在不同的数据类型和任务中有所不同，但针对第二个问题，即课程规划通常与数据和任务无关，换句话说，在各种场景中，大多数课程学习都利用了类似的调度策略。具体而言，调度策略可以分为预定义的和自动的两种。预定义通常是将按照难易程度排序好的样本划分为块，每个块中包含一定数量的难度相似的样本，如图\ref{fig:13-54}所示：
 %----------------------------------------------
 \begin{figure}[htp]
 \centering
-\input{./Chapter13/Figures/figure-ensemble-knowledge-distillation}
+\includegraphics[scale=0.5]{./Chapter13/Figures/figure-sample-block-partition.jpg}
-\caption{迭代式知识蒸馏}
+\caption{样本块划分}
-\label{fig:13-42}
+\label{fig:13-54}
 \end{figure}
 %-------------------------------------------
-\parinterval 然后按照“先易后难”的原则人工定义一个调度策略，比如早期一种较为流行的方法是：在训练早期模型只在简单块中进行采样，随着训练过程的进行，比如在固定数量的训练轮次之后，将下一个块的样本合并到当前训练子集中，继续训练，直到合并了整个数据块，即整个训练集可见为止，之后再继续进行几个额外轮次的训练直到收敛。示意图如图xxx（一种预定义的课程规划）所示：
+\parinterval 然后按照“先易后难”的原则人工定义一个调度策略，比如早期一种较为流行的方法是：在训练早期模型只在简单块中进行采样，随着训练过程的进行，比如在固定数量的训练轮次之后，将下一个块的样本合并到当前训练子集中，继续训练，直到合并了整个数据块，即整个训练集可见为止，之后再继续进行几个额外轮次的训练直到收敛。示意图如图\ref{fig:13-55}所示：
 %----------------------------------------------
 \begin{figure}[htp]
 \centering
-\input{./Chapter13/Figures/figure-ensemble-knowledge-distillation}
+\includegraphics[scale=0.5]{./Chapter13/Figures/figure-a-predefined-course-planning.jpg}
-\caption{迭代式知识蒸馏}
+\caption{一种预定义的课程规划}
-\label{fig:13-42}
+\label{fig:13-55}
 \end{figure}
 %-------------------------------------------

--- a/Chapter14/Figures/figure-batch-time-mem.tex
+++ b/Chapter14/Figures/figure-batch-time-mem.tex
@@ -3,6 +3,7 @@
 \tikzstyle{snode} = [draw,inner sep=1pt,minimum width=3em,minimum height=0.5em,rounded corners=1pt,fill=green!20!white]
 \tikzstyle{pnode} = [draw,inner sep=1pt,minimum width=1em,minimum height=0.5em,rounded corners=1pt]
+\node [anchor=west] (des) at (1.5,3) {\normalsize\bfnew{$\bm{m}$：显存\quad$\bm{t}$：时间\quad$\bm{m_1>m_2}$\quad$\bm{t_1>t_2}$}};
 \node [anchor=west,snode] (s1) at (0,0) {\tiny{}};
 \node [anchor=north west,snode,minimum width=6.3em] (s2) at ([yshift=-0.3em]s1.south west) {\tiny{}};
 \node [anchor=north west,snode,minimum width=2em] (s3) at ([yshift=-0.3em]s2.south west) {\tiny{}};
@@ -76,13 +77,5 @@
 \draw [very thick,decorate,decoration={brace}] ([xshift=3pt]box1.north east) to node [midway,name=final] {} ([xshift=3pt]box1.south east);
 \draw [very thick,decorate,decoration={brace}] ([xshift=3pt]box3.north east) to node [midway,name=final] {} ([xshift=3pt]box3.south east);
-\node [rectangle,inner sep=0.5em,rounded corners=2pt,draw,fill=red!5,font=\scriptsize] at ([yshift=-2em,xshift=9em]sbi1.east) {
-\begin{tabular}{l}
- $m$: 显存 \\
- $t$: 时间 \\
- $m_1>m_2$ \\
- $t_1>t_2$
-\end{tabular}
-};
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter14/Figures/figure-beamsize-bleu.tex
+++ b/Chapter14/Figures/figure-beamsize-bleu.tex
 %%%------------------------------------------------------------------------------------------------------------
- \begin{tikzpicture}
+\begin{tikzpicture}
-      \scriptsize{
+\scriptsize{
 \begin{axis}[
-    width=8cm,
+width=8cm,
-    height=5cm,
+height=5cm,
-    yticklabel style={/pgf/number format/.cd,fixed,precision=2},
+yticklabel style={/pgf/number format/.cd,fixed,precision=2},
-    xticklabel style={color=white},
+xticklabel style={/pgf/number format/.cd,fixed,precision=2},
-    xlabel={\footnotesize{Beam Size}},ylabel={\footnotesize{BLEU}},
+xlabel={\footnotesize{$\log$\;(束大小)}},ylabel={\footnotesize{BLEU\ (\%)}},
-    ymin=28.8,ymax=30.4,
+ymin=28.8,ymax=30.4,
-    xmin=0,xmax=10,
+xmin=0,xmax=7,
-    xtick={0,4,6,7,8,9,10},
+xtick={0,1,2,3,4,5,6,7},
-    ytick={28.8,29.0,29.2,29.4,29.6,29.8,30.0,30.2,30.4},
+ytick={28.8,29.0,29.2,29.4,29.6,29.8,30.0,30.2,30.4},
-    legend style={yshift=-5em,xshift=0em,legend cell align=left,legend plot pos=right}
+xticklabels={0,1,2,3,4,5,6,7},
+yticklabels={28.8,29.0,29.2,29.4,29.6,29.8,30.0,30.2,30.4},
+legend style={yshift=-5em,xshift=0em,legend cell align=left,legend plot pos=right}
 ]
-\addplot[purple,mark=square,mark=star,very thick] coordinates {(0,29.3) (4,29.7) (6,30.05) (7,30.1) (7.6,30.2) (8,30.3) (8.3,30.2) (8.7,30.08) (9,29.98) (9.6,29.6)(10,28.8) };
+\addplot[purple,mark=square,mark=star,very thick] coordinates {(0,29.3) (1,29.7) (1.58,30.05) (2.32,30.1) (2.73,30.2) (3.32,30.3) (3.84,30.2) (4.23,30.08) (4.91,29.98) (5.81,29.6)(6.64,28.8) };
 \end{axis}
 }
-\node[inner sep=0pt] at (0,-1em) {1};
+\end{tikzpicture}
-\node[inner sep=0pt] at (9.15em,-1em) {2};
-\node[inner sep=0pt] at (13.75em,-1em) {3};
-\node[inner sep=0pt] at (16.05em,-1em) {5};
-\node[inner sep=0pt] at (18.35em,-1em) {10};
-\node[inner sep=0pt] at (20.65em,-1em) {30};
-\node[inner sep=0pt] at (22.9em,-1em) {100};
- \end{tikzpicture}

--- a/Chapter14/Figures/figure-comparison-of-different-attention-method.tex
+++ b/Chapter14/Figures/figure-comparison-of-different-attention-method.tex
@@ -119,9 +119,9 @@
    \draw[->,thick] ([yshift=-0.15em]dot3\i.north) -- ([yshift=-0.3em]attn2\i.south);
 }
-\node[anchor=north,align=left,inner sep=1pt,font=\footnotesize] () at (dot31.south) {(a) 标准的多层自注意力};
+\node[anchor=north,align=left,inner sep=1pt,font=\footnotesize] () at (dot31.south) {\small{(a) 标准的多层自注意力}};
-\node[anchor=north,align=left,inner sep=1pt,font=\footnotesize] () at (dot32.south) {(b) 共享自注意力};
+\node[anchor=north,align=left,inner sep=1pt,font=\footnotesize] () at (dot32.south) {\small{(b) 共享自注意力}};
-\node[anchor=north,align=left,inner sep=1pt,font=\footnotesize] () at (dot33.south) {(c) 共享编码-解码注意力};
+\node[anchor=north,align=left,inner sep=1pt,font=\footnotesize] () at (dot33.south) {\small{(c) 共享编码-解码注意力}};
 \end{scope}
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter14/Figures/figure-different-integration-model.tex
+++ b/Chapter14/Figures/figure-different-integration-model.tex
@@ -3,7 +3,7 @@
    \centering
    \hspace*{\fill}
-    \subfigure[假设选择]
+    \subfigure[\small{假设选择}]
    {
        \begin{tikzpicture}[scale=0.5]
            \tikzstyle{system} = [rectangle,very thick,minimum width=1cm,font=\tiny];
@@ -33,7 +33,7 @@
        \end{tikzpicture}
    }
    \hfill
-    \subfigure[预测融合]
+    \subfigure[\small{预测融合}]
    {
        \begin{tikzpicture}[scale=0.5]
            \tikzstyle{system} = [rectangle,very thick,minimum width=1cm,font=\tiny];
@@ -60,7 +60,7 @@
    }
    \hspace*{\fill}
    \\
-    \subfigure[译文重组]
+    \subfigure[\small{译文重组}]
    {
        \begin{tikzpicture}[scale=0.5]
            \tikzstyle{system} = [rectangle,very thick,minimum width=1cm,font=\tiny];

--- a/Chapter14/Figures/figure-different-softmax.tex
+++ b/Chapter14/Figures/figure-different-softmax.tex
@@ -65,7 +65,7 @@
    \draw [->,line width=1pt] (out3) |- (plabel9.east);
 \end{scope}
-\node [anchor=north,font=\scriptsize] () at ([yshift=-0.2em]STANDARD.south) {(a) 标准方法};
+\node [anchor=north,font=\scriptsize] () at ([yshift=-0.2em]STANDARD.south) {\small(a) 标准方法};
-\node [anchor=north,font=\scriptsize] () at ([xshift=1.2em]SELECTION.south) {(b) 词汇选择};
+\node [anchor=north,font=\scriptsize] () at ([xshift=1.2em]SELECTION.south) {\small(b) 词汇选择};
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter14/Figures/figure-hypothesis-generation.tex
+++ b/Chapter14/Figures/figure-hypothesis-generation.tex
@@ -42,7 +42,7 @@
    \node [] () at ([yshift=0.4cm]output3.north) {$\vdots$};
 \end{scope}
-\node [align=center,anchor=north,font=\small] () at ([yshift=-0.3cm]MULTIPLE.south) {\footnotesize{(a) 多系统输出结果融合}};
+\node [align=center,anchor=north,font=\small] () at ([yshift=-0.3cm]MULTIPLE.south) {\small{(a) 多系统输出结果融合}};
-\node [align=center,anchor=north,font=\small] () at ([yshift=-0.3cm]SINGLE.south) {\footnotesize{(b) 单系统多输出结果融合}};
+\node [align=center,anchor=north,font=\small] () at ([yshift=-0.3cm]SINGLE.south) {\small{(b) 单系统多输出结果融合}};
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter14/Figures/figure-iteration.tex
+++ b/Chapter14/Figures/figure-iteration.tex
@@ -6,10 +6,10 @@
 \begin{tikzpicture}[node distance = 0,scale = 0.75]
 \tikzstyle{every node}=[scale=0.75]
 \node (encoder)[er,very thick,draw=taupegray,fill=ugreen!20]{\Large{编码器}};
-\node (decoder_1)[er,very thick,draw=taupegray,right of=encoder,xshift=4cm,fill=red!20]{\Large{解码器1}};
+\node (decoder_1)[er,very thick,draw=taupegray,right of=encoder,xshift=4cm,fill=red!20]{\Large{解码器}};
-\node (decoder_2)[er,very thick,draw=taupegray,right of=decoder_1,xshift=4cm,fill=red!20]{\Large{解码器2}};
+\node (decoder_2)[er,very thick,draw=taupegray,right of=decoder_1,xshift=4cm,fill=red!20]{\Large{解码器}};
 \node (point)[right of=decoder_2,xshift=2.5cm,]{\LARGE{...}};
-\node (decoder_3)[er,very thick,draw=taupegray,right of=point,xshift=2.5cm,fill=red!20]{\Large{解码器3}};
+\node (decoder_3)[er,very thick,draw=taupegray,right of=point,xshift=2.5cm,fill=red!20]{\Large{解码器}};
 \draw [->,very thick,draw=black!70]([xshift=0.2cm]encoder.east) --  ([xshift=-0.2cm]decoder_1.west);
 \draw [->,very thick,draw=black!70]([xshift=0.2cm]decoder_1.east) --  ([xshift=-0.2cm]decoder_2.west);
 \draw [->,very thick,draw=black!70]([xshift=0.2cm]decoder_2.east) --  ([xshift=-0.1cm]point.west);

--- a/Chapter14/Figures/figure-main-module.tex
+++ b/Chapter14/Figures/figure-main-module.tex
@@ -5,7 +5,7 @@
 \node [anchor=south] (text) at ([xshift=0.5em,yshift=-3.5em]part1.south) {\scriptsize{源语言句子（编码器输出）}};
 \node [anchor=east,draw=black!70,rounded corners,drop shadow,very thick,minimum width=6em,minimum height=3.5em,fill=blue!15,align=center,text=black] (part2) at ([xshift=10em]part1.east) {\scriptsize{搜索模块}};
-\node [anchor=south] (text1) at ([xshift=0.5em,yshift=2.2em]part1.north) {\scriptsize{已经生成的目标语单词}};
+\node [anchor=south] (text1) at ([xshift=0.5em,yshift=2.2em]part1.north) {\scriptsize{译文中已经生成的单词}};
 \node [anchor=south] (text2) at ([xshift=0.5em,yshift=2.2em]part2.north) {\scriptsize{预测当前位置的单词分布}};
 \draw [->,draw=black, thick] ([yshift=2em]part1.north) -- ([yshift=0.1em]part1.north);

--- a/Chapter14/Figures/figure-multi-modality.tex
+++ b/Chapter14/Figures/figure-multi-modality.tex
@@ -4,15 +4,16 @@
 %%% outline
 %-------------------------------------------------------------------------
 \begin{tikzpicture}
-	\tikzstyle{word} = [draw=ugreen!20,minimum size=1.8em, fill=ugreen!40, font=\scriptsize, rounded corners=1pt]
+	\tikzstyle{word} = [font=\scriptsize]
 	\tikzstyle{tgt} = [minimum height=1.6em,minimum width=5.2em,fill=black!10!yellow!30,font=\footnotesize,drop shadow={shadow xshift=0.15em,shadow yshift=-0.15em,}]
-	\tikzstyle{p} = [fill=blue!15,minimum width=0.4em,inner sep=0pt]
+	\tikzstyle{p} = [fill=ugreen!15,minimum width=0.4em,inner sep=0pt]
-\node[ rounded corners=3pt, fill=red!20, drop shadow, minimum width=10em,minimum height=4em]  (encoder) at (0,0) {Transformer 编码器    };
+\node[ rounded corners=3pt, fill=red!20, drop shadow, minimum width=10em,minimum height=4em,draw]  (encoder) at (0,0) {Transformer 编码器    };
-\node[anchor=west, rounded corners=3pt, fill=red!20, drop shadow, minimum width=14em,minimum height=4em] (decoder) at ([xshift=0.8cm]encoder.east) {Transformer 解码器};
+\node[anchor=west, rounded corners=3pt, fill=blue!20, drop shadow, minimum width=14em,minimum height=4em,draw] (decoder) at ([xshift=0.8cm]encoder.east) {Transformer 解码器};
-\node[anchor=north,word] (en1) at ([yshift=-1.3em,xshift=-3em]encoder.south) {谢};
+\node[anchor=north,word] (en1) at ([yshift=-1.3em,xshift=-3em]encoder.south) {干};
-\node[anchor=north,word] (en2) at ([yshift=-1.3em]encoder.south) {谢};
+\node[anchor=north,word] (en2) at ([yshift=-1.3em,xshift=-1em]encoder.south) {得};
-\node[anchor=north,word] (en3) at ([yshift=-1.3em,xshift=3em]encoder.south) {你};
+\node[anchor=north,word] (en3) at ([yshift=-1.3em,xshift=1em]encoder.south) {好};
+\node[anchor=north,word] (en4) at ([yshift=-1.3em,xshift=3em]encoder.south) {！};
 \node[anchor=north,word] (de1) at ([yshift=-1.3em,xshift=-4em]decoder.south) {1};
@@ -24,7 +25,7 @@
 \node[p,anchor=south,minimum height=0.7em] (w1_3) at ([xshift=0.3em]w1_2.south east){};
 \node[p,anchor=south,minimum height=0.6em] (w1_4) at ([xshift=0.3em]w1_3.south east){};
 \node[p,anchor=south,minimum height=0.2em] (w1_5) at ([xshift=0.3em]w1_4.south east){};
-\node[p,anchor=south,minimum height=0.3em] (w1_6) at ([xshift=0.3em]w1_5.south east){};
+\node[p,anchor=south,minimum height=1.9em] (w1_6) at ([xshift=0.3em]w1_5.south east){};
 \node[p,anchor=south,minimum height=0.6em] (w1_7) at ([xshift=0.3em]w1_6.south east){};
 \node[p,anchor=south,minimum height=0.8em] (w1_8) at ([xshift=0.3em]w1_7.south east){};
@@ -37,30 +38,32 @@
 \node[p,anchor=south,minimum height=0.6em] (w2_7) at ([xshift=0.3em]w2_6.south east){};
 \node[p,anchor=south,minimum height=0.8em] (w2_8) at ([xshift=0.3em]w2_7.south east){};
-\node[p,anchor=south, minimum height=0.5em] (w3_1) at ([xshift=3.2em,yshift=1.5em]decoder.north){};
+\node[p,anchor=south, minimum height=0.4em] (w3_1) at ([xshift=3.2em,yshift=1.5em]decoder.north){};
-\node[p,anchor=south,minimum height=2em] (w3_2) at ([xshift=0.3em]w3_1.south east){};
+\node[p,anchor=south,minimum height=0.5em] (w3_2) at ([xshift=0.3em]w3_1.south east){};
 \node[p,anchor=south,minimum height=0.7em] (w3_3) at ([xshift=0.3em]w3_2.south east){};
-\node[p,anchor=south,minimum height=0.6em] (w3_4) at ([xshift=0.3em]w3_3.south east){};
+\node[p,anchor=south,minimum height=2em] (w3_4) at ([xshift=0.3em]w3_3.south east){};
-\node[p,anchor=south,minimum height=1.9em] (w3_5) at ([xshift=0.3em]w3_4.south east){};
+\node[p,anchor=south,minimum height=0.8em] (w3_5) at ([xshift=0.3em]w3_4.south east){};
 \node[p,anchor=south,minimum height=0.3em] (w3_6) at ([xshift=0.3em]w3_5.south east){};
-\node[p,anchor=south,minimum height=0.6em] (w3_7) at ([xshift=0.3em]w3_6.south east){};
+\node[p,anchor=south,minimum height=0.4em] (w3_7) at ([xshift=0.3em]w3_6.south east){};
-\node[p,anchor=south,minimum height=0.8em] (w3_8) at ([xshift=0.3em]w3_7.south east){};
+\node[p,anchor=south,minimum height=0.6em] (w3_8) at ([xshift=0.3em]w3_7.south east){};
-\node[inner sep=0pt,font=\scriptsize] at ([yshift=0.3em]w1_2.north){Thanks};
+\node[inner sep=0pt,font=\scriptsize] at ([yshift=0.3em]w1_2.north){Good};
-\node[inner sep=0pt,font=\scriptsize] at ([yshift=0.3em]w2_2.north){a};
+\node[inner sep=0pt,font=\scriptsize] at ([yshift=0.3em]w1_6.north){Well};
-\node[inner sep=0pt,font=\scriptsize] at ([yshift=0.3em]w2_5.north){to};
+\node[inner sep=0pt,font=\scriptsize] at ([yshift=0.3em]w2_2.north){job};
-\node[inner sep=0pt,font=\scriptsize] at ([yshift=0.3em]w3_2.north){lot};
+\node[inner sep=0pt,font=\scriptsize] at ([yshift=0.3em]w2_5.north){done};
-\node[inner sep=0pt,font=\scriptsize] at ([yshift=0.3em]w3_5.north){you};
+\node[inner sep=0pt,font=\scriptsize] at ([yshift=0.3em]w3_4.north){!};
-\draw[-latex, very thick, ublue] ([yshift=0.1em]en1.north) -- ([xshift=-3em,yshift=-0.1em]encoder.south);
-\draw[-latex, very thick, ublue] ([yshift=0.1em]en2.north) -- ([xshift=0em,yshift=-0.1em]encoder.south);
-\draw[-latex, very thick, ublue] ([yshift=0.1em]en3.north) -- ([xshift=3em,yshift=-0.1em]encoder.south);
-\draw[-latex, very thick, ublue] ([yshift=0.1em]de1.north) -- ([xshift=-4em,yshift=-0.1em]decoder.south);
+\draw[->, thick] ([yshift=0.1em]en1.north) -- ([xshift=-3em,yshift=-0.1em]encoder.south);
-\draw[-latex, very thick, ublue] ([yshift=0.1em]de2.north) -- ([xshift=0em,yshift=-0.1em]decoder.south);
+\draw[->, thick] ([yshift=0.1em]en2.north) -- ([xshift=-1em,yshift=-0.1em]encoder.south);
-\draw[-latex, very thick, ublue] ([yshift=0.1em]de3.north) -- ([xshift=4em,yshift=-0.1em]decoder.south);
+\draw[->, thick] ([yshift=0.1em]en3.north) -- ([xshift=1em,yshift=-0.1em]encoder.south);
+\draw[->, thick] ([yshift=0.1em]en4.north) -- ([xshift=3em,yshift=-0.1em]encoder.south);
-\draw[-latex, very thick, ublue] (encoder.east) -- (decoder.west);
+\draw[->, thick] ([yshift=0.1em]de1.north) -- ([xshift=-4em,yshift=-0.1em]decoder.south);
+\draw[->, thick] ([yshift=0.1em]de2.north) -- ([xshift=0em,yshift=-0.1em]decoder.south);
+\draw[->, thick] ([yshift=0.1em]de3.north) -- ([xshift=4em,yshift=-0.1em]decoder.south);
+\draw[->, line width=1.5pt] (encoder.east) -- (decoder.west);
 \begin{pgfonlayer}{background}
 {
@@ -70,14 +73,14 @@
 }
 \end{pgfonlayer}
-\draw[-latex, very thick, ublue] ([yshift=-1.2em]box1.south) -- (box1.south);
+\draw[->,thick] ([yshift=-1.2em]box1.south) -- (box1.south);
-\draw[-latex, very thick, ublue] ([yshift=-1.2em]box2.south) -- (box2.south);
+\draw[->, thick] ([yshift=-1.2em]box2.south) -- (box2.south);
-\draw[-latex, very thick, ublue] ([yshift=-1.2em]box3.south) -- (box3.south);
+\draw[->, thick] ([yshift=-1.2em]box3.south) -- (box3.south);
-\node[tgt,anchor=west,align=left] (tgt1) at ([xshift=2em]box3.east) {Thanks a lot};
+\node[tgt,anchor=west,align=left] (tgt1) at ([xshift=2em]box3.east) {Good job!};
-\node[tgt,,anchor=north,align=left](tgt2) at ([yshift=-1em]tgt1.south) {Thanks to you};
+\node[tgt,,anchor=north,align=left](tgt2) at ([yshift=-1em]tgt1.south) {Well done!};
-\node[tgt,,anchor=north,align=left] (tgt3) at ([yshift=-1em]tgt2.south) {Thanks a you};
+\node[tgt,,anchor=north,align=left] (tgt3) at ([yshift=-1em]tgt2.south) {Good done!};
-\node[tgt,,anchor=north,align=left] (tgt4) at ([yshift=-1em]tgt3.south) {Thanks to lot};
+\node[tgt,,anchor=north,align=left] (tgt4) at ([yshift=-1em]tgt3.south) {Well job!};
 \node[text=ugreen] at ([xshift=1em]tgt1.east){\ding{51}};
 \node[text=ugreen] at ([xshift=1em]tgt2.east){\ding{51}};
 \node[text=red] at ([xshift=1em]tgt3.east){\ding{55}};

--- a/Chapter14/Figures/figure-non-autoregressive.tex
+++ b/Chapter14/Figures/figure-non-autoregressive.tex
@@ -4,53 +4,48 @@
 %%% outline
 %-------------------------------------------------------------------------
 \begin{tikzpicture}
-	\tikzstyle{word} = [draw=ugreen!20,minimum size=1.5em, fill=ugreen!40, font=\scriptsize, rounded corners=1pt]
+	\tikzstyle{word} = [font=\scriptsize]
-\node[rounded corners=3pt, fill=red!20, drop shadow, minimum width=11em,minimum height=4em]  (encoder) at (0,0) {Transformer 编码器    };
+\node[rounded corners=3pt, fill=red!20, drop shadow, minimum width=10em,minimum height=4em,draw]  (encoder) at (0,0) {Transformer 编码器    };
-\node[draw=blue!10,anchor=west, rounded corners=2pt, fill=blue!20,minimum width=2.5cm,minimum height=2em] (attention) at ([xshift=0.8cm]encoder.east) {注意力模块};
+\node[draw,anchor=west, rounded corners=2pt, fill=orange!20,minimum width=2.5cm,minimum height=2em] (attention) at ([xshift=0.8cm]encoder.east) {注意力模块};
-\node[anchor=west, rounded corners=3pt, fill=red!20, drop shadow, minimum width=12em,minimum height=4em] (decoder) at ([xshift=0.8cm]attention.east) {Transformer 解码器};
+\node[anchor=west, rounded corners=3pt, fill=blue!20, drop shadow, minimum width=10em,minimum height=4em,draw] (decoder) at ([xshift=0.8cm]attention.east) {Transformer 解码器};
-\node[anchor=north,word] (en1) at ([yshift=-1.4em,xshift=-3.6em]encoder.south) {hello};
+\node[anchor=north,word] (en1) at ([yshift=-1.3em,xshift=-3em]encoder.south) {hello};
-\node[anchor=north,word] (en2) at ([yshift=-1.4em,xshift=-1.2em]encoder.south) {,};
+\node[anchor=north,word] (en2) at ([yshift=-1.6em,xshift=-1em]encoder.south) {,};
-\node[anchor=north,word] (en3) at ([yshift=-1.4em,xshift=1.2em]encoder.south) {world};
+\node[anchor=north,word] (en3) at ([yshift=-1.3em,xshift=1em]encoder.south) {world};
-\node[anchor=north,word] (en4) at ([yshift=-1.4em,xshift=3.6em]encoder.south) {!};
+\node[anchor=north,word] (en4) at ([yshift=-1.3em,xshift=3em]encoder.south) {!};
-\node[anchor=north,word] (de1) at ([yshift=-1.4em,xshift=-5em]decoder.south) {1};
+\node[anchor=north,word] (de1) at ([yshift=-1.3em,xshift=-3em]decoder.south) {1};
-\node[anchor=north,word] (de2) at ([yshift=-1.4em,xshift=-3em]decoder.south) {2};
+\node[anchor=north,word] (de2) at ([yshift=-1.3em,xshift=-1em]decoder.south) {2};
-\node[anchor=north,word] (de3) at ([yshift=-1.4em,xshift=-1 em]decoder.south) {3};
+\node[anchor=north,word] (de3) at ([yshift=-1.3em,xshift=1 em]decoder.south) {3};
-\node[anchor=north,word] (de4) at ([yshift=-1.4em,xshift=1em]decoder.south) {4};
+\node[anchor=north,word] (de4) at ([yshift=-1.3em,xshift=3em]decoder.south) {4};
-\node[anchor=north,word] (de5) at ([yshift=-1.4em,xshift=3em]decoder.south) {5};
-\node[anchor=north,word] (de6) at ([yshift=-1.4em,xshift=5em]decoder.south) {6};
+\node[anchor=south,word] (out1) at ([yshift=1.3em,xshift=-3em]decoder.north) {你好};
-\node[anchor=south,word] (out1) at ([yshift=1.4em,xshift=-5em]decoder.north) {你};
+\node[anchor=south,word] (out2) at ([yshift=1.3em,xshift=-1em]decoder.north) {，};
-\node[anchor=south,word] (out2) at ([yshift=1.4em,xshift=-3em]decoder.north) {好};
+\node[anchor=south,word] (out3) at ([yshift=1.3em,xshift=1em]decoder.north) {世界};
-\node[anchor=south,word] (out3) at ([yshift=1.4em,xshift=-1em]decoder.north) {，};
+\node[anchor=south,word] (out4) at ([yshift=1.3em,xshift=3em]decoder.north) {！};
-\node[anchor=south,word] (out4) at ([yshift=1.4em,xshift=1em]decoder.north) {世};
-\node[anchor=south,word] (out5) at ([yshift=1.4em,xshift=3em]decoder.north) {界};
-\node[anchor=south,word] (out6) at ([yshift=1.4em,xshift=5em]decoder.north) {！};
+\draw[->, thick] ([yshift=0.1em]en1.north) -- ([xshift=-3em,yshift=-0.1em]encoder.south);
+\draw[->, thick] ([yshift=0.3em]en2.north) -- ([xshift=-1em,yshift=-0.1em]encoder.south);
-\draw[-latex, very thick,ublue] ([yshift=0.1em]en1.north) -- ([xshift=-3.6em,yshift=-0.1em]encoder.south);
+\draw[->, thick] ([yshift=0.1em]en3.north) -- ([xshift=1em,yshift=-0.1em]encoder.south);
-\draw[-latex, very thick,ublue] ([yshift=0.1em]en2.north) -- ([xshift=-1.2em,yshift=-0.1em]encoder.south);
+\draw[->, thick] ([yshift=0.1em]en4.north) -- ([xshift=3em,yshift=-0.1em]encoder.south);
-\draw[-latex, very thick,ublue] ([yshift=0.1em]en3.north) -- ([xshift=1.2em,yshift=-0.1em]encoder.south);
-\draw[-latex, very thick,ublue] ([yshift=0.1em]en4.north) -- ([xshift=3.6em,yshift=-0.1em]encoder.south);
+\draw[->,thick] ([yshift=0.1em]de1.north) -- ([xshift=-3em]decoder.south);
+\draw[->, thick] ([yshift=0.1em]de2.north) -- ([xshift=-1em]decoder.south);
-\draw[-latex, very thick,ublue] ([yshift=0.1em]de1.north) -- ([xshift=-5em]decoder.south);
+\draw[->, thick] ([yshift=0.1em]de3.north) -- ([xshift=1em]decoder.south);
-\draw[-latex, very thick,ublue] ([yshift=0.1em]de2.north) -- ([xshift=-3em]decoder.south);
+\draw[->, thick] ([yshift=0.1em]de4.north) -- ([xshift=3em]decoder.south);
-\draw[-latex, very thick,ublue] ([yshift=0.1em]de3.north) -- ([xshift=-1em]decoder.south);
-\draw[-latex, very thick,ublue] ([yshift=0.1em]de4.north) -- ([xshift=1em]decoder.south);
+\draw[->, thick] ([xshift=-3em,yshift=0.1em]decoder.north) -- ([yshift=-0.1em]out1.south);
-\draw[-latex, very thick,ublue] ([yshift=0.1em]de5.north) -- ([xshift=3em]decoder.south);
+\draw[->, thick] ([xshift=-1em,yshift=0.1em]decoder.north) -- ([yshift=-0.1em]out2.south);
-\draw[-latex, very thick,ublue] ([yshift=0.1em]de6.north) -- ([xshift=5em]decoder.south);
+\draw[->, thick] ([xshift=1em,yshift=0.1em]decoder.north) -- ([yshift=-0.1em]out3.south);
+\draw[->, thick] ([xshift=3em,yshift=0.1em]decoder.north) -- ([yshift=-0.1em]out4.south);
-\draw[-latex, very thick,ublue] ([xshift=-5em,yshift=0.1em]decoder.north) -- ([yshift=-0.1em]out1.south);
-\draw[-latex, very thick,ublue] ([xshift=-3em,yshift=0.1em]decoder.north) -- ([yshift=-0.1em]out2.south);
-\draw[-latex, very thick,ublue] ([xshift=-1em,yshift=0.1em]decoder.north) -- ([yshift=-0.1em]out3.south);
+\draw[->,line width=1.5pt] (encoder.east) -- (attention.west);
-\draw[-latex, very thick,ublue] ([xshift=1em,yshift=0.1em]decoder.north) -- ([yshift=-0.1em]out4.south);
+\draw[->,line width=1.5pt] (attention.east) -- (decoder.west);
-\draw[-latex, very thick,ublue] ([xshift=3em,yshift=0.1em]decoder.north) -- ([yshift=-0.1em]out5.south);
-\draw[-latex, very thick,ublue] ([xshift=5em,yshift=0.1em]decoder.north) -- ([yshift=-0.1em]out6.south);
+\draw[decorate,decoration={brace, mirror},ublue, very thick] ([xshift=0.5em,yshift=-0.4em]de1.-135) -- node[font=\scriptsize,text=black,yshift=-1em]{预测译文长度 \& 计算位置编码}([xshift=-0.5em,yshift=-0.4em]de4.-45);
-\draw[-latex, very thick, ublue] (encoder.east) -- (attention.west);
-\draw[-latex, very thick, ublue] (attention.east) -- (decoder.west);
-\draw[decorate,decoration={brace, mirror},ublue, very thick] ([yshift=-0.4em]de1.-135) -- node[font=\scriptsize,text=black,yshift=-1em]{预测译文长度 \& 计算位置编码}([yshift=-0.4em]de6.-45);
 %\begin{pgfonlayer}{background}
 %{

--- a/Chapter14/Figures/figure-reproduction-rate.tex
+++ b/Chapter14/Figures/figure-reproduction-rate.tex
@@ -73,8 +73,13 @@
 \node[draw=taupegray,thick,fill=blue!7,inner sep=0pt,minimum height=13.3em,minimum width=9.5em,rounded corners=4pt,drop shadow] (box3) at (12em,10.1em){};
 }
 \end{pgfonlayer}
-	\node[] at ([xshift=-2em]box1.west){{$M \times$}};
+     \node[] at ([yshift=1.8em]box2.north){\normalsize{译文长度：5}};
-	\node[] at ([xshift=2em]box3.east){{$\times N$}};
+     \node[] at ([xshift=-2em,yshift=0.5em]box2.west){\normalsize{繁衍率}};
+     \node[] at ([xshift=-2em,yshift=-0.5em]box2.west){\normalsize{预测器}};
+     \node[] at ([xshift=-2em]box1.west){\normalsize{编码器}};
+	\node[] at ([xshift=-1em,yshift=-2.5em]box1.west){{$M \times$}};
+\node[] at ([xshift=2em]box3.east){\normalsize{解码器}};
+	\node[] at ([xshift=1em,yshift=-6em]box3.east){{$\times N$}};
 	\draw[line,dotted,rounded corners=4pt,violet] (box2.north) -- ([yshift=1em]box2.north) -- ([yshift=1em,xshift=6.2em]box2.north) -- ([xshift=-2.2em]tgt_emb.west) -- (tgt_emb.west);
 	\draw[line,-,dotted,rounded corners=4pt,violet,] (src_emb.east) -- ([xshift=-2em]tgt_emb.west);

--- a/Chapter14/Figures/figure-reranking.tex
+++ b/Chapter14/Figures/figure-reranking.tex
@@ -29,7 +29,7 @@
 \end{pgfonlayer}
 	\node[anchor=north,font=\scriptsize,align=center] (w1) at ([yshift=-2em]encoder.south){\scriptsize\bfnew{There exist different} \\ \scriptsize\bfnew{opinions on this question}};
 	\node[anchor=north,font=\scriptsize,align=center] (w2) at ([yshift=-2em]decoder.south){\scriptsize\bfnew{There exist different} \\ \scriptsize\bfnew{opinions on this question}};
-	\node[anchor=north,font=\scriptsize,text=gray] (w3) at ([yshift=0.6em]w2.south){\scriptsize\bfnew{(copy source sentence)}};
+	\node[anchor=north,font=\scriptsize,text=gray] (w3) at ([yshift=0.6em]w2.south){\scriptsize\bfnew{（复制源语言句子）}};
 	\node[anchor=south,font=\scriptsize,align=center] (w4) at ([yshift=1.6em]box2.north){\scriptsize\bfnew{on this question} \\ \scriptsize\bfnew{There exist different opinions}};
 	\node[anchor=south,font=\scriptsize,align=center] (w5) at ([yshift=1.6em]box3.north){\tiny\bfnew{对 \ 这个 \ 问题 \ 存在 \ 不同的 \ 看法}};
 	\node[font=\tiny] at ([xshift=-0.8em,yshift=-0.6em]encoder.east) {$N\times$};

--- a/Chapter14/Figures/figure-weight-similarity.tex
+++ b/Chapter14/Figures/figure-weight-similarity.tex
@@ -16,7 +16,7 @@
 		\setlength{\fboxsep}{2.2mm} % box size
 		\begin{tabular}{C{.20\textwidth}C{.20\textwidth}C{.20\textwidth}C{.20\textwidth}}
 			\setlength{\tabcolsep}{0pt}
-			\subfigure [\footnotesize{自注意力}] {
+			\subfigure [\small{自注意力}] {
 				\begin{tabular}{ccC{1em}}
 					\setlength{\tabcolsep}{0pt}
 					~
@@ -62,7 +62,7 @@
 			}
 			&
-			\subfigure [\footnotesize{编码-解码注意力}] {
+			\subfigure [\small{编码-解码注意力}] {
 				\setlength{\tabcolsep}{0pt}
 				\begin{tabular}{ccC{1em}}
 					\setlength{\tabcolsep}{0pt}

--- a/Chapter14/chapter14.tex
+++ b/Chapter14/chapter14.tex
@@ -79,15 +79,15 @@
 \begin{itemize}
 \vspace{0.5em}
-\item 搜索的基本问题在神经机器翻译中有着特殊的现象。比如，在统计机器翻译中，降低搜索错误是提升翻译效果的一种手段。但是神经机器翻译中，简单的降低搜索错误可能无法带来性能的提升，甚至造成翻译品质的下降\upcite{li-etal-2018-simple,Stahlberg2019OnNS}；
+\item 搜索的基本问题在神经机器翻译中有着特殊的现象。比如，在统计机器翻译中，降低搜索错误是提升翻译效果的一种手段。但是神经机器翻译中，简单的降低搜索错误可能无法带来性能的提升，甚至会造成翻译品质的下降\upcite{li-etal-2018-simple,Stahlberg2019OnNS}；
 \vspace{0.5em}
-\item 搜索的时延很高，系统实际部署的成本很高。与统计机器翻译系统不同的是，神经机器翻译依赖大量的浮点预算。这也造成神经机器翻译系统的推断会比统计机器翻译系统慢很多。虽然可以使用GPU来加速神经机器翻译的推断速度，但是也大大增加了成本；
+\item 搜索的时延很高，系统实际部署的成本很高。与统计机器翻译系统不同的是，神经机器翻译依赖大量的浮点运算。这也导致神经机器翻译系统的推断会比统计机器翻译系统慢很多。虽然可以使用GPU来加速神经机器翻译的推断速度，但是也大大增加了成本；
 \vspace{0.5em}
 \item 神经机器翻译在优化过程中容易陷入局部最优，单模型的表现并不稳定。由于神经机器翻译优化的目标函数非常不光滑，每次训练得到的模型往往只是一个局部最优解。在新数据上使用这个局部最优模型进行推断时，模型的表现可能不稳定。
 \vspace{0.5em}
 \end{itemize}
-\parinterval 研究者们也针对以上问题开展了大量的研究工作。在\ref{sec:14-2}节中，我们会对神经机器翻译推断中所涉及的一些基本问题进行讨论。虽然这些问题在统计机器翻译中均有涉及，但是在神经机器翻译中却有着不同的现象和解决思路。在\ref{sec:14-3}-\ref{sec:14-5}节中，我们会针对如何改进神经机器翻译推断效率和怎样进行多模型融合这两个问题展开讨论。
+\parinterval 研究人员也针对以上问题开展了大量的研究工作。在\ref{sec:14-2}节中，本章会对神经机器翻译推断中所涉及的一些基本问题进行讨论。虽然这些问题在统计机器翻译中均有涉及，但是在神经机器翻译中却有着不同的现象和解决思路。在\ref{sec:14-3}-\ref{sec:14-5}节中，会针对如何改进神经机器翻译推断效率和怎样进行多模型融合这两个问题展开讨论。
 %----------------------------------------------------------------------------------------
 %    NEW SECTION
@@ -105,32 +105,32 @@
 \parinterval 机器翻译有两种常用的推断方式\ \dash \ 自左向右推断和自右向左推断。自左向右推断符合现实世界中人类的语言使用规律，因为人在翻译一个句子时，总是习惯从句子开始的部分往后生成\footnote{有些语言中，文字是自右向左书写，这时自右向左推断更符合人类使用这种语言的习惯。}。不过，有时候人也会使用当前单词后面的译文信息。也就是说，翻译也需要“未来” 的文字信息。于是很容易想到使用自右向左的方法对译文进行生成。
-\parinterval 以上两种推断方式在神经机器翻译中都有应用，对于源语言句子$\seq{x}=\{x_1,x_2,\dots,x_m\}$和目标语句子$\seq{y}=\{y_1,y_2,\dots,y_n\}$，自左向右的翻译可以被描述为：
+\parinterval 以上两种推断方式在神经机器翻译中都有应用，对于源语言句子$\seq{x}=\{x_1,x_2,\dots,x_m\}$和目标语言句子$\seq{y}=\{y_1,y_2,\dots,y_n\}$，自左向右的翻译可以被描述为公式\eqref{eq:14-1}：
 \begin{eqnarray}
 \funp{P}(\seq{y}\vert\seq{x}) &=& \prod_{j=1}^n \funp{P}(y_j\vert\seq{y}_{<j},\seq{x})
 \label{eq:14-1}
 \end{eqnarray}
-\parinterval 自右向左的翻译可以被描述为：
+\parinterval 自右向左的翻译可以被描述为公式\eqref{eq:14-2}：
 \begin{eqnarray}
 \funp{P}(\seq{y}\vert\seq{x}) &=&\prod_{j=1}^n \funp{P}(y_{n+1-j}\vert\seq{y}_{>j},\seq{x})
 \label{eq:14-2}
 \end{eqnarray}
-\noindent 其中，$\seq{y}_{<j}=\{y_1,y_2,\dots,y_{j-1}\}$，$\seq{y}_{>j}=\{y_{j+1},y_{j+2},\dots,y_n\}$。可以看到，自左向右推断和自右向左推断本质上是一样的。{\chapterten} $\sim$ {\chaptertwelve} 均使用了自左向右的推断方法。自右向左推断比较简单的实现方式是：在训练过程中直接将双语数据中的目标语句子进行反向，之后仍然使用原始的模型进行训练即可。在推断的时候，生成的目标语词串也需要进行反向得到最终的译文。有时候，使用自右向左的推断方式会取得更好的效果\upcite{DBLP:conf/wmt/SennrichHB16}。不过更多情况下需要同时使用词串左端（历史）和右端（未来）的信息。有多种思路可以融合左右两端信息：
+\noindent 其中，$\seq{y}_{<j}=\{y_1,y_2,\dots,y_{j-1}\}$，$\seq{y}_{>j}=\{y_{j+1},y_{j+2},\dots,y_n\}$。可以看到，自左向右推断和自右向左推断本质上是一样的。{\chapterten}到{\chaptertwelve}均使用了自左向右的推断方法。自右向左推断比较简单的实现方式是：在训练过程中直接将双语数据中的目标语言句子进行反向，之后仍然使用原始的模型进行训练即可。在推断的时候，生成的目标语言词串也需要进行反向得到最终的译文。有时候，使用自右向左的推断方式会取得更好的效果\upcite{DBLP:conf/wmt/SennrichHB16}。不过更多情况下需要同时使用词串左端（历史）和右端（未来）的信息。有多种思路可以融合左右两端信息：
 \begin{itemize}
 \vspace{0.5em}
-\item {\small\sffamily\bfseries{重排序}}\index{重排序}（Reranking）\index{Reranking}。可以用一个基础模型（比如自左向右的模型）得到每个源语言句子的$n$-best翻译结果，之后同时用基础模型的得分和自右向左模型对$n$-best翻译结果进行重排序\upcite{Liu2016AgreementOT,DBLP:conf/wmt/SennrichHB16,DBLP:conf/wmt/LiLXLLLWZXWFCLL19}。也有研究者利用最小贝叶斯风险的方法进行重排序\upcite{Stahlberg2018TheUO}。由于这类方法不会改变基础模型的翻译过程，因此相对“安全”，不会对系统性能造成副作用。
+\item {\small\sffamily\bfseries{重排序}}\index{重排序}（Reranking）\index{Reranking}。可以用一个基础模型（比如自左向右的模型）得到每个源语言句子的$n$-best翻译结果，之后同时用基础模型的得分和自右向左模型对$n$-best翻译结果进行重排序\upcite{Liu2016AgreementOT,DBLP:conf/wmt/SennrichHB16,DBLP:conf/wmt/LiLXLLLWZXWFCLL19}。也有研究人员利用最小贝叶斯风险的方法进行重排序\upcite{Stahlberg2018TheUO}。由于这类方法不会改变基础模型的翻译过程，因此相对“安全”，不会对系统性能造成副作用。
 \vspace{0.5em}
-\item {\small\sffamily\bfseries{双向推断}}\index{双向推断}（Bidirectional Inference）\index{Bidirectional Inference}。另一种方法是，让自左向右和自右向左模型同步进行，也就是同时考虑译文左侧和右侧的文字信息\upcite{DBLP:conf/aaai/ZhangSQLJW18,Zhou2019SynchronousBN,DBLP:conf/aaai/ZhangSQLJW18}。例如，可以同时对左边和右边生成的译文进行注意力计算，得到当前位置的单词预测结果。这种方法能够更加充分的融合双向翻译的优势。
+\item {\small\sffamily\bfseries{双向推断}}\index{双向推断}（Bidirectional Inference）\index{Bidirectional Inference}。除了自左向右推断和自右向左推断，另一种方法让自左向右和自右向左模型同步进行，也就是同时考虑译文左侧和右侧的文字信息\upcite{DBLP:conf/aaai/ZhangSQLJW18,Zhou2019SynchronousBN,DBLP:conf/aaai/ZhangSQLJW18}。例如，可以同时对左边和右边生成的译文进行注意力计算，得到当前位置的单词预测结果。这种方法能够更加充分地融合双向翻译的优势。
 \vspace{0.5em}
 \item {\small\sffamily\bfseries{多阶段推断}}\index{多阶段推断}（Multi-stage Inference）\index{Multi-stage Inference}。在第一阶段，通过一个基础模型生成一个初步的翻译结果。在第二阶段，同时使用第一阶段生成的翻译结果和源语言句子，进一步生成更好的译文\upcite{Li2017EnhancedNM,ElMaghraby2018EnhancingTF,Geng2018AdaptiveMD}。由于第一阶段的结果已经包含了完整的译文信息，因此在第二阶段中，系统实际上已经同时使用了整个译文串的两端信息。上述过程可以扩展为迭代式的译文生成方法，配合掩码等技术，可以在生成每个译文单词时，同时考虑左右两端的上下文信息\upcite{Lee2018DeterministicNN,Gu2019LevenshteinT,Guo2020JointlyMS}。
 \vspace{0.5em}
 \end{itemize}
-\parinterval 不论是自左向右还是自右向左推断，本质上都是在对上下文信息进行建模。除了自左向右和自右向左的推断策略，研究者们也提出了许多新的译文生成策略，比如，从中部向外生成、按源语言顺序生成\upcite{Stahlberg2018AnOS}、基于插入的方式生成\upcite{Stern2019InsertionTF,stling2017NeuralMT}等。或者将翻译问题松弛化为一个连续空间模型的优化问题，进而在推断的过程中同时使用译文串左右两端的信息\upcite{Geng2018AdaptiveMD}。
+\parinterval 不论是自左向右还是自右向左推断，本质上都是在对上下文信息进行建模。除了自左向右和自右向左的推断策略，研究人员也提出了许多新的译文生成策略，比如，从中部向外生成\upcite{DBLP:conf/nips/MehriS18}、按源语言顺序生成\upcite{Stahlberg2018AnOS}、基于插入的方式生成\upcite{Stern2019InsertionTF,stling2017NeuralMT}等。或者将翻译问题松弛化为一个连续空间模型的优化问题，进而在推断的过程中同时使用译文串左右两端的信息\upcite{Geng2018AdaptiveMD}。
 \parinterval 最近，以BERT 为代表的预训练语言模型已经证明，一个单词的“历史” 和“未来” 信息对于生成当前单词都是有帮助的\upcite{devlin2019bert}。类似的观点也在神经机器翻译编码器设计中得到验证。比如，在基于循环神经网络的模型中，经常同时使用自左向右和自右向左的方式对源语言句子进行编码。还有，Transformer 编码器会使用整个句子的信息对每一个源语言位置进行表示。因此，在神经机器翻译的解码端采用类似的策略是有其合理性的。
@@ -176,15 +176,15 @@ a &=& \omega_{\textrm{low}}\cdot |\seq{x}| \label{eq:14-3}\\
 b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 \end{eqnarray}
 \vspace{0.5em}
-\noindent 其中，$\omega_{\textrm{low}}$和$\omega_{\textrm{high}}$分别表示译文长度的下限和上限，比如，很多系统中设置为$\omega_{\textrm{low}}=1/2$，$\omega_{\textrm{high}}=2$，表示译文至少有源语言句子一半长，最多有源语言句子两倍长。$\omega_{\textrm{low}}$和$\omega_{\textrm{high}}$的设置对推断效率影响很大，$\omega_{\textrm{high}}$可以被看作是一个推断的终止条件，最理想的情况是$\omega_{\textrm{high}} \cdot |\seq{x}|$恰巧就等于最佳译文的长度，这时没有任何计算的浪费。反过来的一种情况，$\omega_{\textrm{high}} \cdot |\seq{x}|$远大于最佳译文的长度，这时很多计算都是无用的。为了找到长度预测的准确率和召回率之间的平衡，一般需要大量的实验最终确定$\omega_{\textrm{low}}$和$\omega_{\textrm{high}}$。当然，利用统计模型预测$\omega_{\textrm{low}}$ 和$\omega_{\textrm{high}}$也是非常值得探索的方向，比如基于繁衍率的模型\upcite{Gu2017NonAutoregressiveNM,Feng2016ImprovingAM}。
+\noindent 其中，$\omega_{\textrm{low}}$和$\omega_{\textrm{high}}$分别表示译文长度的下限和上限，比如，很多系统中设置为$\omega_{\textrm{low}}=1/2$，$\omega_{\textrm{high}}=2$，表示译文至少有源语言句子一半长，最多有源语言句子两倍长。$\omega_{\textrm{low}}$和$\omega_{\textrm{high}}$的设置对推断效率影响很大，$\omega_{\textrm{high}}$可以被看作是一个推断的终止条件，最理想的情况是$\omega_{\textrm{high}} \cdot |\seq{x}|$恰巧就等于最佳译文的长度，这时没有浪费任何计算资源。反过来的一种情况，$\omega_{\textrm{high}} \cdot |\seq{x}|$远大于最佳译文的长度，这时很多计算都是无用的。为了找到长度预测的准确率和召回率之间的平衡，一般需要大量的实验最终确定$\omega_{\textrm{low}}$和$\omega_{\textrm{high}}$。当然，利用统计模型预测$\omega_{\textrm{low}}$ 和$\omega_{\textrm{high}}$也是非常值得探索的方向，比如基于繁衍率的模型\upcite{Gu2017NonAutoregressiveNM,Feng2016ImprovingAM}。
 \vspace{0.5em}
-\item 覆盖度模型。译文长度过长或过短的问题，本质上对应着 {\small\sffamily\bfseries{过翻译}}\index{过翻译}（Over Translation）\index{Over Translation}和{\small\sffamily\bfseries{欠翻译}}\index{欠翻译}（Under Translation）\index{Under Translation}的问题\upcite{Yang2018OtemUtemOA}。这两种问题出现的原因主要在于：神经机器翻译没有对过翻译和欠翻译建模，即机器翻译覆盖度问题\upcite{TuModeling}。针对此问题，最常用的方法是在推断的过程中引入一个度量覆盖度的模型。比如，使用GNMT 覆盖度模型定义模型得分\upcite{Wu2016GooglesNM}：
+\item 覆盖度模型。译文长度过长或过短的问题，本质上对应着 {\small\sffamily\bfseries{过翻译}}\index{过翻译}（Over Translation）\index{Over Translation}和{\small\sffamily\bfseries{欠翻译}}\index{欠翻译}（Under Translation）\index{Under Translation}的问题\upcite{Yang2018OtemUtemOA}。这两种问题出现的原因主要在于：神经机器翻译没有对过翻译和欠翻译建模，即机器翻译覆盖度问题\upcite{TuModeling}。针对此问题，最常用的方法是在推断的过程中引入一个度量覆盖度的模型。比如，使用GNMT 覆盖度模型定义模型得分\upcite{Wu2016GooglesNM}，如\eqref{eq:14-5}\eqref{eq:14-6}所示：
 \begin{eqnarray}
 \textrm{score}(\seq{x},\seq{y}) &=& \frac{\log \funp{P}(\seq{y} | \seq{x})}{\textrm{lp}(\seq{y})} + \textrm{cp}(\seq{x},\seq{y}) \label {eq:14-5}\\
 \textrm{cp}(\seq{x},\seq{y}) &=& \beta \cdot \sum_{i=1}^{|\seq{x}|} \log(\textrm{min} (\sum_{j}^{|\seq{y}|} a_{ij} , 1))
 \label{eq:14-6}
 \end{eqnarray}
-\noindent 其中，$\textrm{cp}(\seq{x},\seq{y}) $表示覆盖度模型，它度量了译文对源语言每个单词的覆盖程度。$\textrm{cp}(\seq{x},\seq{y}) $的定义中，$a_{ij}$表示源语言第$i$个位置与目标语第$j$个位置的注意力权重，这样$\sum \limits_{j}^{|\seq{y}|} a_{ij}$就可以用来衡量源语言第$i$个单词被翻译了“多少”，如果它大于1，表明翻译多了；如果小于1，表明翻译少了。公式\eqref{eq:14-6}会惩罚那些欠翻译的翻译假设。覆盖度模型的一种改进形式是\upcite{li-etal-2018-simple}：
+\noindent 其中，$\textrm{cp}(\seq{x},\seq{y}) $表示覆盖度模型，它度量了译文对源语言每个单词的覆盖程度。$\textrm{cp}(\seq{x},\seq{y}) $的定义中，$\beta$是一需要自行设置的超参数，$a_{ij}$表示源语言第$i$个位置与目标语言第$j$个位置的注意力权重，这样$\sum \limits_{j}^{|\seq{y}|} a_{ij}$就可以用来衡量源语言第$i$个单词被翻译了“多少”，如果它大于1，表明翻译多了；如果小于1，表明翻译少了。公式\eqref{eq:14-6}会惩罚那些欠翻译的翻译假设。覆盖度模型的一种改进形式是\upcite{li-etal-2018-simple}：
 \begin{eqnarray}
 \textrm{cp}(\seq{x},\seq{y}) &=& \sum_{i=1}^{|\seq{x}|} \log( \textrm{max} ( \sum_{j}^{|\seq{y}|} a_{ij},\beta))
@@ -200,11 +200,11 @@ b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 \subsection{搜索终止条件}
-\parinterval 在机器翻译推断中，何时终止搜索是一个非常基础的问题。如\chaptertwo 所述，系统研发者一方面希望尽可能遍历更大的搜索空间，找到更好的结果，另一方面也希望在尽可能短的时间内得到结果。这时搜索的终止条件就是一个非常关键的指标。在束搜索中有很多终止条件可以使用，比如，在生成一定数量的译文之后就终止搜索，或者当最佳译文与排名第二的译文之间的分数差距超过一个阈值时就终止搜索等。
+\parinterval 在机器翻译推断中，何时终止搜索是一个非常基础的问题。如{\chaptertwo}所述，系统研发者一方面希望尽可能遍历更大的搜索空间，找到更好的结果，另一方面也希望在尽可能短的时间内得到结果。这时搜索的终止条件就是一个非常关键的指标。在束搜索中有很多终止条件可以使用，比如，在生成一定数量的译文之后就终止搜索，或者当最佳译文与排名第二的译文之间的分数差距超过一个阈值时就终止搜索等。
 \parinterval 在统计机器翻译中，搜索的终止条件相对容易设计。因为所有的翻译结果都可以用相同步骤的搜索过程生成，比如，在CYK解码中搜索的步骤仅与构建的分析表大小有关。在神经机器翻译中，这个问题要更加复杂。当系统找到一个完整的译文之后，可能还有很多译文没有被生成完，这时就面临着一个问题\ \dash \ 如何决定是否继续搜索。
-\parinterval 针对这些问题，研究人员设计了很多新的方法。比如，可以在束搜索中使用启发性信息让搜索尽可能早的停止，同时保证搜索结果是“最优的”\upcite{DBLP:conf/emnlp/HuangZM17}。也可以将束搜索建模为优化问题\upcite{Wiseman2016SequencetoSequenceLA,DBLP:conf/emnlp/Yang0M18}，进而设计出新的终止条件\upcite{Ma2019LearningTS}。很多开源机器翻译系统也都使用了简单有效的终止条件，比如，在OpenNMT 系统中当搜索束中当前最好的假设生成了完整的译文搜索就会停止\upcite{KleinOpenNMT}，在RNNSearch系统中当找到预设数量的译文时搜索就会停止，同时在这个过程中会不断减小搜索束的大小\upcite{bahdanau2014neural}。
+\parinterval 针对这些问题，研究人员设计了很多新的方法。比如，可以在束搜索中使用启发性信息让搜索尽可能早地停止，同时保证搜索结果是“最优的”\upcite{DBLP:conf/emnlp/HuangZM17}。也可以将束搜索建模为优化问题\upcite{Wiseman2016SequencetoSequenceLA,DBLP:conf/emnlp/Yang0M18}，进而设计出新的终止条件\upcite{Ma2019LearningTS}。很多开源机器翻译系统也都使用了简单有效的终止条件，比如，在OpenNMT 系统中当搜索束中当前最好的假设生成了完整的译文搜索就会停止\upcite{KleinOpenNMT}，在RNNSearch系统中当找到预设数量的译文时搜索就会停止，同时在这个过程中会不断减小搜索束的大小\upcite{bahdanau2014neural}。
 \parinterval 实际上，设计搜索终止条件反映了搜索时延和搜索精度之间的一种折中\upcite{Eisner2011LearningST,Jiang2012LearnedPF}。在很多应用中，这个问题会非常关键。比如，在同声传译中，对于输入的长文本，何时开始翻译、何时结束翻译都是十分重要的\upcite{Zheng2020OpportunisticDW,Ma2019STACLST}。在很多线上翻译应用中，翻译结果的响应不能超过一定的时间，这时就需要一种{\small\sffamily\bfseries{时间受限的搜索}}\index{时间受限的搜索}（Time-constrained Search）\index{Time-constrained Search}策略\upcite{DBLP:conf/emnlp/StahlbergHSB17}。
@@ -214,28 +214,29 @@ b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 \subsection{译文多样性}
-\parinterval 机器翻译系统的输出并不仅限于单个译文。很多情况下，需要多个译文。比如，译文重排序中通常就需要系统的$n$-best输出，而在交互式机器翻译中也往往需要提供多个译文供用户选择\upcite{Peris2017InteractiveNM,Peris2018ActiveLF}。但是，无论是统计机器翻译还是神经机器翻译，都面临一个同样的问题：$n$-best输出中的译文十分相似。表\ref{tab:14-2}就展示了一个神经机器翻译输出的多个翻译结果，可以看到这些译文的区别很小。这个问题也被看做是机器翻译缺乏{\small\sffamily\bfseries{译文多样性}}\index{译文多样性}（Translation Diversity）\index{Translation Diversity}的问题\upcite{Gimpel2013ASE,Li2016MutualIA,DBLP:conf/emnlp/DuanLXZ09,DBLP:conf/acl/XiaoZZW10,xiao2013bagging}。
+\parinterval 机器翻译系统的输出并不仅限于单个译文。很多情况下，需要多个译文。比如，译文重排序中通常就需要系统的$n$-best输出，而在交互式机器翻译中也往往需要提供多个译文供用户选择\upcite{Peris2017InteractiveNM,Peris2018ActiveLF}。但是，无论是统计机器翻译还是神经机器翻译，都面临一个同样的问题：$n$-best输出中的译文十分相似。实例\ref{eg:14-1}就展示了一个神经机器翻译输出的多个翻译结果，可以看到这些译文的区别很小。这个问题也被看做是机器翻译缺乏{\small\sffamily\bfseries{译文多样性}}\index{译文多样性}（Translation Diversity）\index{Translation Diversity}的问题\upcite{Gimpel2013ASE,Li2016MutualIA,DBLP:conf/emnlp/DuanLXZ09,DBLP:conf/acl/XiaoZZW10,xiao2013bagging}。
-%----------------------------------------------------------------------
+\begin{example}
-\begin{table}[htp]{
+源语言句子：我们期待安理会尽早就此作出决定。
-\begin{center}
-\caption{汉英神经机器翻译系统的3-best结果实例}
+\qquad\ 机器译文\ \,1\ ：We look forward to the Security Council making a decision on this 
-{
-\begin{tabular}{c|l}
+\hspace{8.3em}as soon as possible.
-源文 & 那只敏捷的棕色狐狸跳过了那只懒惰的狗。 \\
-\hline
+\qquad\ 机器译文\ \,2\ ：We look forward to the Security Council making a decision on this
-\rule{0pt}{10pt} {\small{NO.1}} & The agile brown fox jumped over the lazy dog. \\
-\rule{0pt}{10pt} {\small{NO.2}} &The quick brown fox jumped over the lazy dog. \\
-\rule{0pt}{10pt} {\small{NO.3}} & The quick brown fox jumps over the lazy dog. \\
-\end{tabular}
-\label{tab:14-2}
-}
-\end{center}
-}\end{table}
-%----------------------------------------------------------------------
-\parinterval  机器翻译输出缺乏多样性会带来很多问题。一个直接的问题是在重排序时很难选择到更好的译文，因为所有候选都没有太大的差别。另一方面，当需要利用$n$-best输出来表示翻译假设空间时，缺乏多样性的译文也会使得翻译后验概率的估计不够准确，造成建模的偏差。在一些模型训练方法中，这种后验概率估计的偏差也会造成较大的影响\upcite{DBLP:conf/acl/ShenCHHWSL16}。从人工翻译的角度，同一个源语言句子的译文应该是多样的，因此过于相似的译文也无法反映足够多的翻译现象。
-\parinterval 因此增加译文多样性成为了机器翻译中一个有价值的研究方向。在统计机器翻译中就有很多尝试\upcite{DBLP:conf/emnlp/DuanLXZ09,DBLP:conf/acl/XiaoZZW10,xiao2013bagging}。主要思路是通过加入一些“扰动”让翻译模型的行为发生变化，进而得到区别更大的译文。类似的方法也同样适用于神经机器翻译。例如，可以在推断过程中加入额外的模型，用于惩罚出现相似译文的情况\upcite{Li2016ADO,Li2016MutualIA}。也可以在翻译模型中引入新的隐含变量或者加入新的干扰，进而控制多样性译文的输出\upcite{He2018SequenceTS,Shen2019MixtureMF,Wu2020GeneratingDT}。类似的，也可以利用模型中局部结构的多样性来生成多样的译文\upcite{Sun2020GeneratingDT}。除了考虑每个译文之间的多样性，也可以对译文进行分组，之后增加不同组之间的多样性\upcite{Vijayakumar2016DiverseBS}。
+\hspace{8.3em}issue as soon as possible.
+\qquad\ 机器译文\ \,3\ ：We hope that the Security Council will make a decision on this
+\hspace{8.4em}issue as soon as possible.
+\label{eg:14-1}
+\end{example}
+\parinterval  机器翻译输出缺乏多样性会带来很多问题。一个直接的问题是在重排序时很难选择到更好的译文，因为所有候选都没有太大的差别。此外，当需要利用$n$-best输出来表示翻译假设空间时，缺乏多样性的译文也会使得翻译后验概率的估计不够准确，造成建模的偏差。在一些模型训练方法中，这种后验概率估计的偏差也会造成较大的影响\upcite{DBLP:conf/acl/ShenCHHWSL16}。从人工翻译的角度，同一个源语言句子的译文应该是多样的，因此过于相似的译文也无法反映足够多的翻译现象。
+\parinterval 因此增加译文多样性成为了机器翻译中一个有价值的研究方向。在统计机器翻译中就有很多尝试\upcite{DBLP:conf/emnlp/DuanLXZ09,DBLP:conf/acl/XiaoZZW10,xiao2013bagging}。主要思路是通过加入一些“扰动”让翻译模型的行为发生变化，进而得到区别更大的译文。类似的方法也同样适用于神经机器翻译。例如，可以在推断过程中加入额外的模型，用于惩罚出现相似译文的情况\upcite{Li2016ADO,Li2016MutualIA}。也可以在翻译模型中引入新的隐含变量或者加入新的干扰，进而控制多样性译文的输出\upcite{He2018SequenceTS,Shen2019MixtureMF,Wu2020GeneratingDT}。类似地，也可以利用模型中局部结构的多样性来生成多样的译文\upcite{Sun2020GeneratingDT}。除了考虑每个译文之间的多样性，也可以对译文进行分组，之后增加不同组之间的多样性\upcite{Vijayakumar2016DiverseBS}。
 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -245,7 +246,7 @@ b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 \parinterval 机器翻译的错误分为两类：搜索错误和模型错误。搜索错误是指由于搜索算法的限制，即使潜在的搜索空间中有更好的解，模型也无法找到。比较典型的例子是，在对搜索进行剪枝的时候，如果剪枝过多，找到的结果很有可能不是最优的。这时就出现了搜索错误。
-\parinterval 在统计机器翻译中，搜索错误可以通过减少剪枝进行缓解。比较简单的方式是增加搜索束宽度，这往往会带来一定的性能提升\upcite{Xiao2016ALA}。也可以对搜索问题进行单独建模，以保证学习到的模型出现更少的搜索错误\upcite{Liu2014SearchAwareTF,Yu2013MaxViolationPA}。但是，在神经机器翻译中，这个问题却表现出不同的现象：在很多神经机器翻译系统中，随着搜索束的增大，系统的BLEU不升反降。图\ref{fig:14-3}展示了BLEU随束大小的变化曲线。这个现象与传统的常识是相违背的，因此也有一些研究尝试解释这个现象\upcite{Stahlberg2019OnNS,Niehues2017AnalyzingNM}。在实验中，研究人员也发现增加搜索束的大小会导致翻译生成的结果变得更短。他们将这个现象归因于：增加搜索束的大小，会导致更多的模型错误，因为神经机器翻译的建模是基于局部归一的最大似然估计\upcite{Sountsov2016LengthBI,Murray2018CorrectingLB,StahlbergNeural}。另一方面，也有研究人员把这种翻译过短的现象归因于搜索错误\upcite{Stahlberg2019OnNS}。由于搜索时所面临的搜索空间是十分巨大的，因此搜索时可能无法找到模型定义的“最好”的译文。在某种意义上，这也体现了一种训练和推断不一致的问题。
+\parinterval 在统计机器翻译中，搜索错误可以通过减少剪枝进行缓解。比较简单的方式是增加搜索束宽度，这往往会带来一定的性能提升\upcite{Xiao2016ALA}。也可以对搜索问题进行单独建模，以保证学习到的模型出现更少的搜索错误\upcite{Liu2014SearchAwareTF,Yu2013MaxViolationPA}。但是，在神经机器翻译中，这个问题却表现出不同的现象：在很多神经机器翻译系统中，随着搜索束的增大，系统的BLEU不升反降。图\ref{fig:14-3}展示了BLEU随束大小的变化曲线\footnote{为了使该图更加规整直观，横坐标处将束大小进行了取对数操作。}。这个现象与传统的常识是相违背的，因此也有一些研究尝试解释这个现象\upcite{Stahlberg2019OnNS,Niehues2017AnalyzingNM}。在实验中，研究人员也发现增加搜索束的大小会导致翻译生成的结果变得更短。他们将这个现象归因于：增加搜索束的大小，会导致更多的模型错误，因为神经机器翻译的建模是基于局部归一的最大似然估计\upcite{Sountsov2016LengthBI,Murray2018CorrectingLB,StahlbergNeural}。此外，也有研究人员把这种翻译过短的现象归因于搜索错误\upcite{Stahlberg2019OnNS}。由于搜索时所面临的搜索空间是十分巨大的，因此搜索时可能无法找到模型定义的“最好”的译文。在某种意义上，这也体现了一种训练和推断不一致的问题。
 %----------------------------------------------------------------------
 \begin{figure}[htp]
@@ -256,7 +257,7 @@ b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 \end{figure}
 %----------------------------------------------------------------------
-\parinterval 一种解决问题的思路是从训练和推断的行为和目标不一致的角度切入。比如，为了解决{\small\sffamily\bfseries{曝光偏置}}\index{曝光偏置}（Exposure Bias）\index{Exposure Bias}问题\upcite{Ranzato2016SequenceLT}，可以让系统使用前面步骤的预测结果作为预测下一个词所需要的历史信息，而不是依赖于标准答案\upcite{Bengio2015ScheduledSF,Zhang2019BridgingTG}。另一方面，为了解决训练和推断目标不一致的问题，可以在训练的时候模拟推断的行为，同时让模型训练的目标与评价系统的标准尽可能一致\upcite{DBLP:conf/acl/ShenCHHWSL16}。
+\parinterval 一种解决问题的思路是从训练和推断的行为和目标不一致的角度切入。比如，为了解决{\small\sffamily\bfseries{曝光偏置}}\index{曝光偏置}（Exposure Bias）\index{Exposure Bias}问题\upcite{Ranzato2016SequenceLT}，可以让系统使用前面步骤的预测结果作为预测下一个词所需要的历史信息，而不是依赖于标准答案\upcite{Bengio2015ScheduledSF,Zhang2019BridgingTG}。此外，为了解决训练和推断目标不一致的问题，可以在训练的时候模拟推断的行为，同时让模型训练的目标与评价系统的标准尽可能一致\upcite{DBLP:conf/acl/ShenCHHWSL16}。
 \parinterval 需要注意的是，前面提到的搜索束变大造成的翻译品质下降的问题还有其它解决方法。比如，可以通过对结果重排序来缓解这个问题\upcite{DBLP:conf/emnlp/Yang0M18}，也可以通过设计更好的覆盖度模型来生成长度更加合理的译文\upcite{li-etal-2018-simple}。从这个角度说，上述问题的成因也较为复杂，因此需要同时考虑模型错误和搜索错误。
@@ -276,7 +277,7 @@ b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 \parinterval 神经机器翻译需要对输入和输出的单词进行分布式表示，比如，每一个单词都用一个512 维向量进行表示。但是，由于真实的词表通常很大，因此计算并保存这些单词的向量表示会消耗较多的计算和存储资源。特别是对于基于Softmax 的输出层，使用大词表往往会占用较多的系统运算时间。虽然可以通过BPE 和限制词汇表规模的方法降低输出层计算的负担\upcite{DBLP:conf/acl/SennrichHB16a}，但是为了获得可接受的翻译品质，词汇表也不能过小，因此输出层的计算仍然十分耗时。
-\parinterval 对于这个问题，可以通过改变输出层的网络结构进行缓解\upcite{DBLP:conf/acl/JeanCMB15}。一种比较简单的方法是对可能输出的单词进行筛选，简称词汇选择。这里，可以利用类似于统计机器翻译的翻译表，获得每个源语言单词最可能的译文。在翻译过程中，利用注意力机制找到每个目标语位置对应的源语言位置，之后获得这些源语言单词最可能的翻译候选。之后，Softmax 只需要在这个有限的翻译候选单词集合上进行计算，大大降低了输出层的计算量。尤其对于CPU 上的系统，这个方法往往会带来明显的速度提升，同时保证翻译品质。图\ref{fig:14-4}给出了词汇选择方法的示意图。
+\parinterval 通过改变输出层的网络结构，可以一定程度上缓解这个问题\upcite{DBLP:conf/acl/JeanCMB15}。一种比较简单的方法是对可能输出的单词进行筛选，简称词汇选择。这里，可以利用类似于统计机器翻译的翻译表，获得每个源语言单词最可能的译文。在翻译过程中，利用注意力机制找到每个目标语言位置对应的源语言位置，之后获得这些源语言单词最可能的翻译候选。之后，Softmax 只需要在这个有限的翻译候选单词集合上进行计算，大大降低了输出层的计算量。尤其对于CPU 上的系统，这个方法往往会带来明显的速度提升，同时保证翻译品质。图\ref{fig:14-4}给出了词汇选择方法的示意图。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -295,7 +296,7 @@ b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 \subsection{消除冗余计算}
-\parinterval 消除不必要的计算是加速机器翻译系统的另一种方法。比如，在统计机器翻译时代，假设重组就是一种典型的避免冗余计算的手段（见\chapterseven）。对于神经机器翻译中的Transformer 模型，消除冗余计算的一种简单有效的方法是对解码端的注意力结果进行缓存。在生成每个目标语译文时，Transformer 模型会对当前位置之前的所有位置进行自注意力操作，但是这些计算里只有和当前位置相关的计算是“新” 的，前面位置之间的注意力结果已经在之前的解码步骤里计算过，因此可以对其进行缓存。
+\parinterval 消除不必要的计算是加速机器翻译系统的另一种方法。比如，在统计机器翻译时代，假设重组就是一种典型的避免冗余计算的手段（见\chapterseven）。对于神经机器翻译中的Transformer 模型，消除冗余计算的一种简单有效的方法是对解码端的注意力结果进行缓存。在生成每个目标语言译文时，Transformer 模型会对当前位置之前的所有位置进行自注意力操作，但是这些计算里只有和当前位置相关的计算是“新” 的，前面位置之间的注意力结果已经在之前的解码步骤里计算过，因此可以对其进行缓存。
 \parinterval 此外，由于Transformer 模型较为复杂，还存在很多冗余。比如，Transformer 的每一层会包含自注意力机制、层正则化、残差连接、前馈神经网络等多种不同的结构。同时，不同结构之间还会包含一些线性变换。多层Transformer（通常为6 层）模型会更加复杂。但是，这些层可能在做相似的事情，甚至有些计算根本就是重复的。图\ref{fig:14-5}中展示了解码端自注意力和编码-解码注意力中不同层的注意力权重的相似性，这里的相似性利用Jensen-Shannon散度进行度量\upcite{61115}。可以看到，自注意力中，2-5层之间的注意力权重的分布非常相似。编码-解码注意力也有类似的现象，临近的层之间有非常相似的注意力权重。这个现象说明：在多层神经网络中有些计算是冗余的，因此很自然的想法是消除这些冗余使得机器翻译变得更“轻”。
@@ -307,7 +308,7 @@ b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 \label{fig:14-5}
 \end{figure}
 %----------------------------------------------
-\parinterval 一种方法是将不同层的注意力权重进行共享，这样上层的注意力权重可以复用下层的注意力权重\upcite{Xiao2019SharingAW}。在编码-解码注意力中，由于注意力机制中输入的Value 都是一样的\footnote{在Transformer 解码端，编码解码注意力输入的Value 是编码端的输出，因此是相同的（见\chaptertwelve}，我们甚至可以直接复用前一层注意力计算的结果。图\ref{fig:14-6}给出了不同方法的对比，其中$S$表示注意力权重，$A$表示注意模型的输出。可以看到，使用共享的思想，可以大大减少冗余的计算。
+\parinterval 一种方法是将不同层的注意力权重进行共享，这样上层的注意力权重可以复用下层的注意力权重\upcite{Xiao2019SharingAW}。在编码-解码注意力中，由于注意力机制中输入的Value 都是一样的\footnote{在Transformer解码端，编码解码注意力输入的Value是编码端的输出，因此是相同的（见\chaptertwelve}，甚至可以直接复用前一层注意力计算的结果。图\ref{fig:14-6}给出了不同方法的对比，其中$S$表示注意力权重，$A$表示注意模型的输出。可以看到，使用共享的思想，可以大大减少冗余的计算。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -328,15 +329,15 @@ b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 \subsection{轻量解码端及小模型}
-\parinterval 在推断时，神经机器翻译的解码端是最耗时的，因为每个目标语位置需要单独输出单词的分布，同时在搜索过程中每一个翻译假设都要被扩展成多个翻译假设，进一步增加了计算量。因此，另一种加速系统的思路是使用更加轻量的解码器\upcite{DBLP:journals/corr/HintonVD15,Munim2019SequencelevelKD}。
+\parinterval 在推断时，神经机器翻译的解码端是最耗时的，因为每个目标语言位置需要单独输出单词的分布，同时在搜索过程中每一个翻译假设都要被扩展成多个翻译假设，进一步增加了计算量。因此，另一种加速系统的思路是使用更加轻量的解码器\upcite{DBLP:journals/corr/HintonVD15,Munim2019SequencelevelKD}。
 \parinterval 比较简单的做法是把解码端的网络变得更“浅”、更“窄”。所谓浅网络是指使用更少的层构建神经网络，比如，使用3 层，甚至1 层网络的Transformer 解码器。所谓窄网络是指将网络中某些层中神经元的数量减少。不过，直接训练这样的小模型会带来翻译品质的下降。这时会考虑使用知识蒸馏（也称作知识精炼）等技术来提升小模型的品质。
-\parinterval 另一种思路是化简Transformer 解码端的神经网络。比如，可以使用平均注意力机制代替原始Transformer 中的自注意力机制\upcite{DBLP:journals/corr/abs-1805-00631}，也可以使用运算更轻的卷积操作代替注意力模块\upcite{Wu2019PayLA}。前面提到的基于共享注意力机制的模型也是一种典型的轻量模型\upcite{Xiao2019SharingAW}。这些方法本质上也是对对注意力模型的结构的优化，这类思想在近几年也受到了很多关注 \upcite{Kitaev2020ReformerTE,Katharopoulos2020TransformersAR,DBLP:journals/corr/abs-2006-04768}。
+\parinterval 另一种思路是化简Transformer 解码端的神经网络。比如，可以使用平均注意力机制代替原始Transformer 中的自注意力机制\upcite{DBLP:journals/corr/abs-1805-00631}，也可以使用运算更轻的卷积操作代替注意力模块\upcite{Wu2019PayLA}。前面提到的基于共享注意力机制的模型也是一种典型的轻量模型\upcite{Xiao2019SharingAW}。这些方法本质上也是对注意力模型的结构的优化，这类思想在近几年也受到了很多关注 \upcite{Kitaev2020ReformerTE,Katharopoulos2020TransformersAR,DBLP:journals/corr/abs-2006-04768}。
 \parinterval 此外，使用异构神经网络也是一种平衡精度和速度的有效方法。在很多研究中发现，基于Transformer 的编码器对翻译品质的影响更大，而解码端的作用会小一些。因此，一种想法是使用更快速的解码端结构，比如，用基于循环神经网络的解码端代替基于Transformer 的解码端\upcite{Chen2018TheBO}。这样，既能发挥Transformer 在编码上的优势，同时也能利用循环神经网络在解码端速度上的优势。使用类似的思想，也可以用卷积神经网络等结构进行解码端的设计。
-\parinterval 针对轻量级Transformer模型的设计也包括层级的结构剪枝，这类方法试图通过跳过某些操作或者某些层来降低计算量。典型的相关工作是样本自适应网络结构，如 FastBERT\upcite{Liu2020FastBERTAS}、Depth Adaptive Transformer\upcite{Elbayad2020DepthAdaptiveT} 和LayerDrop\upcite{DBLP:conf/iclr/FanGJ20}等，与传统的Transformer的解码过程不同，这类网络结构在推断时不需要计算全部的解码层，而是根据输入自动选择模型的部分层进行计算，达到加速和减少参数量的目的。此外，矩阵分解也是一种轻量级模型解决方案，这类方法通过矩阵分解的方法提升计算效率，通过简化复杂的矩阵计算来达到加速模型训练和推断的目的。例如, 有研究人员提出词频自适应表示方法，词频越高则对应的词向量维度越大，反之越小，该方法可以显著减少词向量参数矩阵大小\upcite{DBLP:conf/iclr/BaevskiA19}。
+\parinterval 针对轻量级Transformer模型的设计也包括层级的结构剪枝，这类方法试图通过跳过某些操作或者某些层来降低计算量。典型的相关工作是样本自适应网络结构，如 FastBERT\upcite{Liu2020FastBERTAS}、Depth Adaptive Transformer\upcite{Elbayad2020DepthAdaptiveT} 和LayerDrop\upcite{DBLP:conf/iclr/FanGJ20}等，与传统的Transformer的解码过程不同，这类网络结构在推断时不需要计算全部的解码层，而是根据输入自动选择模型的部分层进行计算，达到加速和减少参数量的目的。
 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -377,9 +378,9 @@ b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 \vspace{0.5em}
 \item 半精度浮点运算。半精度浮点运算是随着近几年GPU 技术发展而逐渐流行的一种运算方式。简单来说，半精度的表示要比单精度需要更少的存储单元，所表示的浮点数范围也相应的变小。不过，实践中已经证明神经机器翻译中的许多运算用半精度计算就可以满足对精度的要求。因此，直接使用半精度运算可以大大加速系统的训练和推断进程，同时对翻译品质的影响很小。不过，需要注意的是，在分布式训练的时候，由于参数服务器需要对多个计算节点上的梯度进行累加，因此保存参数的部分仍然会使用单精度浮点以保证多次累加之后不会造成精度过大的损失。
 \vspace{0.5em}
-\item  整型运算。整数运算是一种比浮点运算“轻” 很多的运算。无论是芯片占用面积、能耗还是处理单次运算的时钟周期数，整数运算相比浮点运算都有着明显的优势。因此，使用整数运算也是很有潜力的加速手段。不过，整数的表示和浮点数有着很大的不同。一个基本的问题是，整数是不连续的，因此无法准确的刻画浮点数中很小的小数。对于这个问题，一种解决方法是利用“量化+ 反量化+ 缩放” 的策略让整数运算近似浮点运算的效果\upcite{DBLP:journals/corr/abs-1906-00532,DBLP:conf/cvpr/JacobKCZTHAK18,DBLP:journals/corr/abs-1910-10485}。所谓“量化” 就是把一个浮点数离散化为一个整数，“反量化” 是这个过程的逆过程。由于浮点数可能超出整数的范围，因此会引入一个缩放因子。在量化前将浮点数缩放到整数可以表示的范围，反量化前再缩放回原始浮点数的表示范围。这种方法在理论上可以带来很好的加速效果。不过由于量化和反量化的操作本身也有时间消耗，而且在不同处理器上的表现差异较大。因此不同的实现方式带来的加速效果并不相同，需要通过实验测算。
+\item  整型运算。整型运算是一种比浮点运算“轻” 很多的运算。无论是芯片占用面积、能耗还是处理单次运算的时钟周期数，整型运算相比浮点运算都有着明显的优势。因此，使用整型运算也是很有潜力的加速手段。不过，整数的表示和浮点数有着很大的不同。一个基本的问题是，整数是不连续的，因此无法准确的刻画浮点数中很小的小数。对于这个问题，一种解决方法是利用“量化+ 反量化+ 缩放” 的策略让整型运算达到近似浮点运算的效果\upcite{DBLP:journals/corr/abs-1906-00532,DBLP:conf/cvpr/JacobKCZTHAK18,DBLP:journals/corr/abs-1910-10485}。所谓“量化” 就是把一个浮点数离散化为一个整数，“反量化” 是这个过程的逆过程。由于浮点数可能超出整数的范围，因此会引入一个缩放因子。在量化前将浮点数缩放到整数可以表示的范围，反量化前再缩放回原始浮点数的表示范围。这种方法在理论上可以带来很好的加速效果。不过由于量化和反量化的操作本身也有时间消耗，而且在不同处理器上的表现差异较大。因此不同的实现方式带来的加速效果并不相同，需要通过实验测算。
 \vspace{0.5em}
-\item 低精度整型运算。使用更低精度的整型运算是进一步加速的手段之一。比如使用16 位整数、8 位整数，甚至4 位整数在理论上都会带来速度的提升，如表\ref{tab:14-3}所示。不过，并不是所有处理器都支持低精度整数的运算。开发这样的系统，一般需要硬件和特殊低精度整数计算库的支持。而且相关计算大多是在CPU 上实现，应用会受到一定的限制。
+\item 低精度整型运算。使用更低精度的整型运算是进一步加速的手段之一。比如使用16 位整数、8 位整数，甚至4 位整数在理论上都会带来速度的提升，如表\ref{tab:14-3}所示。不过，并不是所有处理器都支持低精度整型的运算。开发这样的系统，一般需要硬件和特殊低精度整型计算库的支持。而且相关计算大多是在CPU 上实现，应用会受到一定的限制。
 \vspace{0.5em}
 \end{itemize}
@@ -404,7 +405,7 @@ b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 \section{非自回归翻译}
-\parinterval 目前大多数神经机器翻译模型都使用了编码器-解码器框架来实现，编码器的输出会被送入到解码器，解码器自左向右逐词生成目标语言句子，也就是，第$j$个目标语言单词的生成依赖于先前生成的$j-1$个词。这种翻译方式也被称作{\small\sffamily\bfseries{自回归解码}}\index{自回归解码}（Autoregressive Decoding）\index{Autoregressive Decoding}。虽然以Transformer为代表的模型使得训练过程高度并行化，加快了训练速度。但由于推断过程自回归的特性，模型无法同时生成目标语的所有单词，这导致模型的推断过程非常缓慢，这对于神经机器的实际应用是个很大的挑战。因此，如何设计一个在训练和推断阶段都能够并行化的模型是目前研究的热点之一。
+\parinterval 目前大多数神经机器翻译模型都使用了编码器-解码器框架来实现，编码器的输出会被送入到解码器，解码器自左向右逐词生成目标语言句子，也就是，第$j$个目标语言单词的生成依赖于先前生成的$j-1$个词。这种翻译方式也被称作{\small\sffamily\bfseries{自回归解码}}\index{自回归解码}（Autoregressive Decoding）\index{Autoregressive Decoding}。虽然以Transformer为代表的模型使得训练过程高度并行化，加快了训练速度。但由于推断过程自回归的特性，模型无法同时生成译文中的所有单词，这导致模型的推断过程非常缓慢，这对于神经机器的实际应用是个很大的挑战。因此，如何设计一个在训练和推断阶段都能够并行化的模型是目前研究的热点之一。
 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -412,16 +413,16 @@ b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 \subsection{自回归VS非自回归}
-\parinterval 目前主流的神经机器翻译的推断是一种{\small\sffamily\bfseries{自回归翻译}}\index{自回归翻译}（Autoregressive Translation）\index{Autoregressive Translation}过程。所谓自回归是一种描述时间序列生成的方式。对于目标序列$\seq{y}=\{y_1,\dots,y_n\}$，自回归模型假设$j$时刻状态$y_j$的生成依赖于之前的状态$\{y_1,\dots,y_{j-1}\}$，而且$y_j$与$\{y_1,\dots,y_{j-1}\}$构成线性关系，那么生成$y_j$就是自回归的序列生成过程。神经机器翻译借用了这个概念，但是并不要求使用线性模型。对于输入的源语言序列$\seq{x}=\{x_1,\dots,x_m\}$，用自回归翻译模型生成译文序列$\seq{y}=\{y_1,\dots,y_n\}$的概率可以被定义为：
+\parinterval 目前主流的神经机器翻译的推断是一种{\small\sffamily\bfseries{自回归翻译}}\index{自回归翻译}（Autoregressive Translation）\index{Autoregressive Translation}过程。所谓自回归是一种描述时间序列生成的方式。对于目标序列$\seq{y}=\{y_1,\dots,y_n\}$，自回归模型假设$j$时刻状态$y_j$的生成依赖于之前的状态$\{y_1,\dots,y_{j-1}\}$，而且$y_j$与$\{y_1,\dots,y_{j-1}\}$构成线性关系，那么生成$y_j$就是自回归的序列生成过程。神经机器翻译借用了这个概念，但是并不要求使用线性模型。对于输入的源语言序列$\seq{x}=\{x_1,\dots,x_m\}$，用自回归翻译模型生成译文序列$\seq{y}=\{y_1,\dots,y_n\}$的概率可以被定义为公式\eqref{eq:14-8}：
 \begin{eqnarray}
 \funp{P}(\seq{y}|\seq{x}) &=& \prod_{j=1}^n {\funp{P}(y_j|y_{<j},\seq{x})}
 \label{eq:14-8}
 \end{eqnarray}
-\noindent 即译文单词$y_{j}$的生成依赖前面已经生成的单词序列$y_{<j}=\{y_1,\dots,y_{j-1}\}$和源语言序列$\{x_1,\dots,x_m\}$。这种自回归的翻译方式符合人们阅读和生成句子时的习惯。它在机器翻译等任务上也取得了较好的性能，特别是配合束搜索也能够有效的寻找近似最优译文。但是，由于解码器的每个步骤必须顺序地而不是并行地运行，自回归翻译模型会阻碍不同译文单词生成的并行化。特别是在GPU 上，翻译的自回归性会大大降低计算的并行度，导致推断过程的效率比较低下，设备利用率低。
+\noindent 即译文单词$y_{j}$的生成依赖前面已经生成的单词序列$y_{<j}=\{y_1,\dots,y_{j-1}\}$和源语言序列$\{x_1,\dots,x_m\}$。这种自回归的翻译方式符合人们阅读和生成句子时的习惯。它在机器翻译等任务上也取得了较好的性能，特别是配合束搜索也能够有效地寻找近似最优译文。但是，由于解码器的每个步骤必须顺序地而不是并行地运行，自回归翻译模型会阻碍不同译文单词生成的并行化。特别是在GPU 上，翻译的自回归性会大大降低计算的并行度，导致推断过程的效率比较低下，设备利用率低。
-\parinterval 对于这个问题，研究者也考虑移除翻译的自归回性，进行{\small\sffamily\bfseries{非自回归翻译}}\index{非自回归翻译}（Non-Autoregressive Translation，NAT）\index{Non-Autoregressive Translation}\upcite{Gu2017NonAutoregressiveNM}。一个简单的非自回归翻译模型将问题建模为：
+\parinterval 对于这个问题，研究人员也考虑移除翻译的自回归性，进行{\small\sffamily\bfseries{非自回归翻译}}\index{非自回归翻译}（Non-Autoregressive Translation，NAT）\index{Non-Autoregressive Translation}\upcite{Gu2017NonAutoregressiveNM}。一个简单的非自回归翻译模型将问题建模为公式\eqref{eq:14-9}：
 \begin{eqnarray}
 \funp{P}(\seq{y}|\seq{x}) &=& \prod_{j=1}^n {\funp{P}(y_j|\seq{x})}
@@ -436,9 +437,9 @@ b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 \subsection{非自回归翻译模型的结构}
-\parinterval 在介绍非自回归模型的具体结构之前，先来看看如何实现一个简单的非自回归翻译模型。这里用标准的Transformer来举例。首先为了一次性能够生成所有的词，需要丢弃解码端对未来信息屏蔽的矩阵，从而去掉模型的自回归性。此外，还要考虑生成一问的长度。自回归模型每步的输入是上一步解码出的结果，当预测到终止符<eos>时序列的生成就自动停止了，然而非自回归模型却没有这样的特性，因此还需要一个长度预测器来预测出其长度，之后再用这个长度得到每个位置的表示，进而完成整个序列的生成。
+\parinterval 在介绍非自回归模型的具体结构之前，先来看看如何实现一个简单的非自回归翻译模型。这里用标准的Transformer来举例。首先为了一次性能够生成所有的词，需要丢弃解码端对未来信息屏蔽的矩阵，从而去掉模型的自回归性。此外，还要考虑生成译文的长度。自回归模型每步的输入是上一步解码出的结果，当预测到终止符<eos>时序列的生成就自动停止了，然而非自回归模型却没有这样的特性，因此还需要一个长度预测器来预测出其长度，之后再用这个长度得到每个位置的表示，进而完成整个序列的生成。
-\parinterval 图\ref{fig:14-12}就是一个最简单的非自回归翻译模型，它的推断过程可以一次性解码出整个目标序列。但是这样得到的模型所翻译出的句子质量很低。比如，在IWSLT英德等数据上的BLEU值只有个位数，而现在最好的自回归模型已经能够达到30左右的BLEU值。这是因为每个位置词的分布$\funp{P}(y_j)$只依赖于源语言句子$\seq{x}$，使得$\funp{P}(y_j)$的预测不准确。
+\parinterval 图\ref{fig:14-12}就是一个最简单的非自回归翻译模型，它的推断过程可以一次性解码出整个目标语言序列。但是这样得到的模型所翻译出的句子质量很低。比如，在IWSLT英德等数据上的BLEU值只有个位数，而现在最好的自回归模型已经能够达到30左右的BLEU值。这是因为每个位置词的分布$\funp{P}(y_j)$只依赖于源语言句子$\seq{x}$，使得$\funp{P}(y_j)$的预测不准确。
 %----------------------------------------------------------------------
 \begin{figure}[htp]
@@ -449,7 +450,7 @@ b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 \end{figure}
 %----------------------------------------------------------------------
-\parinterval 完全独立地对每个词建模，会出现什么问题呢？来看一个例子，将中文“谢谢你”翻译成英文，可以翻译成“Thanks to you”或者“Thanks a lot”。假设生成这两种翻译的概率是相等的，即一半的概率是“Thanks to you”，另一半的概率是“Thanks a lot”。由于非自回归模型的条件独立性假设，解码时第二个词“to”和“a”的概率是差不多大的，第三个词“you”和“lot”的概率差不多大的，会使得模型生成出“Thanks to lot”或者“Thanks a you”这样错误的翻译，如图\ref{fig:14-13}所示。这便是影响句子质量的关键问题，称之为{\small\sffamily\bfseries{多峰问题}}\index{多峰问题}（Multi-modality Problem）\index{Multi-modality Problem}\upcite{Gu2017NonAutoregressiveNM}。针对非自回归模型难以处理多峰问题进行改进是提升非自回归模型质量的关键。
+\parinterval 完全独立地对每个词建模，会出现什么问题呢？来看一个例子，将汉语“干得好！”翻译成英文，可以翻译成“Good job!”或者“Well done!”。假设生成这两种翻译的概率是相等的，即一半的概率是“Good job!”，另一半的概率是“Well done!”。由于非自回归模型的条件独立性假设，解码时第一个词“Good”和“Well”的概率是差不多大的，第二个词“job”和“done”的概率差不多大的，会使得模型生成出“Good done!”或者“Well job!”这样错误的翻译，如图\ref{fig:14-13}所示。这便是影响句子质量的关键问题，称之为{\small\sffamily\bfseries{多峰问题}}\index{多峰问题}（Multi-modality Problem）\index{Multi-modality Problem}\upcite{Gu2017NonAutoregressiveNM}。针对非自回归模型难以处理多峰问题进行改进是提升非自回归模型质量的关键。
 %----------------------------------------------------------------------
 \begin{figure}[htp]
@@ -479,9 +480,9 @@ b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 \end{figure}
 %----------------------------------------------------------------------
-\parinterval 与自回归翻译模型类似，Transformer模型的编码器和解码器都完全由前馈神经网络和多头注意力模块组成。在非自回归模型在解码开始之前，非自回归模型需要知道目标句子的长度，以便并行生成所有单词。更重要的是，非自回归模型需要一次性生成出所有的目标词，因此不能像自回归模型那样用已生成的词作为第一个解码器层的输入。那么非自回归模型解码器的输入是什么呢？如果完全省略第一个解码器层的输入，或者仅使用位置嵌入，将会导致非常差的性能。这里使用繁衍率来解决这个问题，繁衍率指的是对于每个源语言单词预测所对应的目标语单词的个数（见\chapterthree）。翻译过程取决于繁衍率序列（图\ref{fig:14-14}中的数字1\ 1\ 2\ 0\ 1），最终目标句子长度则由所有源语言单词对应的繁衍率之和决定。这个繁衍率序列可以通过外部词对齐工具来得到，从而来训练这个繁衍率预测器。但由于外部词对齐系统的会出现错误，因此在模型收敛之后，需要在繁衍率预测器上加一个强化学习的损失来进行微调。
+\parinterval 与自回归翻译模型类似，Transformer模型的编码器和解码器都完全由前馈神经网络和多头注意力模块组成。在解码开始之前，非自回归模型需要知道译文的长度，以便并行生成所有单词。更重要的是，非自回归模型需要一次性生成出所有的译文单词，因此不能像自回归模型那样用已生成的词作为第一个解码器层的输入。那么非自回归模型解码器的输入是什么呢？如果完全省略第一个解码器层的输入，或者仅使用位置嵌入，将会导致性能非常差。这里使用繁衍率来解决这个问题，繁衍率指的是对于每个源语言单词预测所对应的目标语言单词的个数（见\chaptersix）。翻译过程取决于繁衍率序列（图\ref{fig:14-14}中的数字1\ 1\ 2\ 0\ 1），最终译文长度则由所有源语言单词对应的繁衍率之和决定。这个繁衍率序列可以通过外部词对齐工具得到，从而来训练这个繁衍率预测器。但由于外部词对齐系统会出现错误，因此在模型收敛之后，需要在繁衍率预测器上加一个强化学习的损失来进行微调。
-\parinterval 另外，在每个解码器层中还包括额外的位置注意力模块，该模块与Transformer模型的其它部分中使用的多头注意力机制相同，如下：
+\parinterval 另外，在每个解码器层中还包括额外的位置注意力模块，该模块与Transformer模型的其它部分中使用的多头注意力机制相同，如公式\eqref{eq:14-10}：
 \begin{eqnarray}
 \textrm{Attention}(\mathbi{Q},\mathbi{K},\mathbi{V}) &=& \textrm{Softmax}(\frac{\mathbi{Q}{\mathbi{K}}^{T}}{\sqrt{d_k}})\cdot \mathbi{V}
@@ -506,7 +507,7 @@ b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 \subsubsection{3.自回归模型打分}
-\parinterval 通过采样不同的繁衍率序列，可以得到多个不同的翻译候选。之后，把这些不同的译文再交给自回归模型来评分，选择一个最好的结果作为最终的翻译。在这里，可以使用强制解码同时对多个译文进行打分，因此这个过程可以充分并行。通常，这种方法能够很有效的提升非自回归翻译模型的译文质量，并且保持较高的推断速度\upcite{Gu2017NonAutoregressiveNM,Wei2019ImitationLF,Guo2019NonAutoregressiveNM,Wang2019NonAutoregressiveMT,Ma2019FlowSeqNC}。但是，缺点是需要同时部署自回归和非自回归两套系统。
+\parinterval 通过采样不同的繁衍率序列，可以得到多个不同的翻译候选。之后，把这些不同的译文再交给自回归模型来评分，选择一个最好的结果作为最终的翻译。在这里，可以使用强制解码同时对多个译文进行打分，因此这个过程可以充分并行。通常，这种方法能够很有效地提升非自回归翻译模型的译文质量，并且保持较高的推断速度\upcite{Gu2017NonAutoregressiveNM,Wei2019ImitationLF,Guo2019NonAutoregressiveNM,Wang2019NonAutoregressiveMT,Ma2019FlowSeqNC}。但是，缺点是需要同时部署自回归和非自回归两套系统。
 %----------------------------------------------------------------------------------------
 %    NEW SUBSECTION
@@ -518,11 +519,11 @@ b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 \begin{itemize}
 \vspace{0.5em}
-\item 基于层级知识蒸馏的方法\upcite{Li2019HintBasedTF}。由于自回归模型和非自回归模型的结构相差不大，因此可以将翻译质量更高的自回归模型作为“教师”，通过给非自回归模型提供监督信号，使其逐块的学习前者的分布。研究人员发现了两点非常有意思的现象：1）非自回归模型输出的重复单词的位置的隐藏状态非常相似。2）非自回归模型的注意力分布比自回归模型的分布更加分散。这两点发现启发了研究人员使用自回归模型中的隐层状态来指导非自回归模型学习。通过计算两个模型隐层状态的距离以及注意力矩阵的KL散度\footnote{KL散度即相对熵}作为额外的损失来帮助非自回归模型的训练过程。
+\item 基于层级知识蒸馏的方法\upcite{Li2019HintBasedTF}。由于自回归模型和非自回归模型的结构相差不大，因此可以将翻译质量更高的自回归模型作为“教师”，通过给非自回归模型提供监督信号，使其逐块地学习前者的分布。研究人员发现了两点非常有意思的现象：1）非自回归模型输出的重复单词的位置的隐藏状态非常相似。2）非自回归模型的注意力分布比自回归模型的分布更加分散。这两点发现启发了研究人员使用自回归模型中的隐层状态来指导非自回归模型学习。通过计算两个模型隐层状态的距离以及注意力矩阵的KL散度\footnote{KL散度即相对熵}作为额外的损失来帮助非自回归模型的训练过程。
 \vspace{0.5em}
 \item 基于模仿学习的方法\upcite{Wei2019ImitationLF}。这种观点认为非自回归模型可以从性能优越的自回归模型中学得知识。{\small\bfnew{模仿学习}}\index{模仿学习}（Imitation Learning\index{Imitation Learning}）是强化学习中的一个概念，即从专家那里学习正确的行为，与监督学习很相似\upcite{Ho2016ModelFreeIL,Ho2016GenerativeAI,Duan2017OneShotIL}。与其不同的是，模仿学习不是照搬专家的行为，而是学习专家为什么要那样做。换句话说，学习的不是专家的镜像，而是一个专家的行为分布。这里，可以将自回归模型作为专家，非自回归模型学习不同时间步和不同层的解码状态，最后将模仿学习的损失与交叉熵损失加权求和后作为最终的优化目标。
 \vspace{0.5em}
-\item 基于正则化因子的方法\upcite{Wang2019NonAutoregressiveMT}。非自回归模型的翻译结果中存在着两种非常严重的错误：重复翻译和不完整的翻译。第一种问题是因为解码器隐层状态中相邻的两个位置过于相似，因此翻译出来的单词也一样。对于第二个问题，通常将其归咎于非自回归模型在翻译的过程中丢失了一些源语句信息，从而造成了翻译效果的下降。针对这两个问题，可以通过在相邻隐层状态间添加相似度约束来计算一个重构损失。具体来说，对于目前正在进行的翻译$\seq{x}\to\seq{y}$，通过利用一个反向的自回归模型再将$\seq{y}$翻译成$\seq{x'}$，最后计算$\seq{x}$与$\seq{x'}$的差异性作为损失。
+\item 基于正则化因子的方法\upcite{Wang2019NonAutoregressiveMT}。非自回归模型的翻译结果中存在着两种非常严重的错误：重复翻译和不完整的翻译。第一种问题是因为解码器隐层状态中相邻的两个位置过于相似，因此翻译出来的单词也一样。对于第二个问题，通常将其归咎于非自回归模型在翻译的过程中丢失了一些源语言句子的信息，从而造成了翻译效果的下降。针对这两个问题，可以通过在相邻隐层状态间添加相似度约束来计算一个重构损失。具体来说，对于目前正在进行的翻译$\seq{x}\to\seq{y}$，通过利用一个反向的自回归模型再将$\seq{y}$翻译成$\seq{x'}$，最后计算$\seq{x}$与$\seq{x'}$的差异性作为损失。
 \vspace{0.5em}
 \end{itemize}
@@ -534,7 +535,7 @@ b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 \parinterval 非自回归翻译消除了序列生成中不同位置间的依赖，在每个位置都进行独立的预测，但这反过来会导致翻译质量显着下降，因为缺乏不同单词间依赖关系的建模。因此，也有研究聚焦于在非自回归模型中添加一些自回归组件来提升网络结构的表达能力。
-\parinterval 一种做法是将语法作为目标语言句子的框架\upcite{Akoury2019SyntacticallyST}。具体来说，先自回归地预测出一个目标语言的句法块序列，将句法树作为序列信息的抽象，然后根据句法块序列非自回归地生成所有目标语言单词。如图\ref{fig:14-21}所示,该模型由一个编码器和两个解码器组成。其中编码器和第一个解码器与标准的Transformer模型相同，用来自回归地预测句法树信息；第二个解码器将第一个解码器的句法信息作为输入，之后再非自回归地生成整个目标句子。在训练过程中，通过使用外部句法分析器获得对句法预测任务的监督信号。虽然可以简单地让模型预测整个句法树，但是这种方法会显著增加自回归步骤的数量，从而增大时间开销。因此，为了维持句法信息与解码时间的平衡，这里预测一些由句法类型和子树大小组成的块标识符（如VP3）而不是整个句法树。
+\parinterval 一种做法是将语法作为目标语言句子的框架\upcite{Akoury2019SyntacticallyST}。具体来说，先自回归地预测出一个目标语言的句法块序列，将句法树作为序列信息的抽象，然后根据句法块序列非自回归地生成所有目标语言单词。如图\ref{fig:14-21}所示,该模型由一个编码器和两个解码器组成。其中编码器和第一个解码器与标准的Transformer模型相同，用来自回归地预测句法树信息；第二个解码器将第一个解码器的句法信息作为输入，之后再非自回归地生成整个译文。在训练过程中，通过使用外部句法分析器获得对句法预测任务的监督信号。虽然可以简单地让模型预测整个句法树，但是这种方法会显著增加自回归步骤的数量，从而增大时间开销。因此，为了维持句法信息与解码时间的平衡，这里预测一些由句法类型和子树大小组成的块标识符（如VP3）而不是整个句法树。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -545,7 +546,7 @@ b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 \end{figure}
 %----------------------------------------------
-\parinterval 另一种做法是半自回归地生成目标句子\upcite{Wang2018SemiAutoregressiveNM}。如图\ref{fig:14-20}所示，自回归模型从左到右依次生成目标句子，具有“最强”的自回归性；而非自回归模型完全独立的生成每个目标词，具有“最弱”的自回归性；半自回归模型则是将整个目标句子分成$k$个块，在组内执行非自回归解码，在组间则执行自回归的解码，能够在每个时间步并行产生多个连续的单词。通过调整块的大小，半自回归模型可以灵活的调整到自回归模型（当$k$等于1）和非自回归模型（当$k$大于最大的目标语长度）上来。
+\parinterval 另一种做法是半自回归地生成译文\upcite{Wang2018SemiAutoregressiveNM}。如图\ref{fig:14-20}所示，自回归模型从左到右依次生成译文，具有“最强”的自回归性；而非自回归模型完全独立的生成每个译文单词，具有“最弱”的自回归性；半自回归模型则是将整个译文分成$k$个块，在组内执行非自回归解码，在组间则执行自回归的解码，能够在每个时间步并行产生多个连续的单词。通过调整块的大小，半自回归模型可以灵活的调整到自回归模型（当$k$等于1）和非自回归模型（当$k$大于最大的译文长度）上来。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -575,7 +576,7 @@ b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 \parinterval 如果一次并行生成整个序列，往往会导致单词之间的关系很难捕捉，因此也限制了这类方法的能力。即使生成了错误的译文单词，这类方法也无法修改。针对这些问题，也可以使用迭代式的生成方式\upcite{Lee2018DeterministicNN,Ghazvininejad2019MaskPredictPD,Kasai2020NonAutoregressiveMT}。这种方法放弃了一次生成最终的译文句子，而是将解码出的文本再重新送给解码器，在每次迭代中来改进之前生成的译文单词，可以理解为句子级的自回归模型。这样做的好处在于，每次迭代的过程中可以利用已经生成的部分翻译结果，来指导其它部分的生成。
-\parinterval 图\ref{fig:14-18}展示了这种方法的简单示例。它拥有一个解码器和$N$个编码器。解码器首先预测出目标句子的长度，然后将输入$\seq{x}$按照长度复制出$\seq{x'}$作为第一个解码器的输入，之后生成$\seq{y'}$出作为第一轮迭代的输出。接下来再把$\seq{y'}$输入给解码器2输出$\seq{y''}$，以此类推。那么迭代到什么时候结束呢？一种简单的做法是提前制定好迭代次数，这种方法能够自主地对生成句子的质量和效率进行平衡。另一种称之为“自适应”的方法，具体是通过计算当前生成的句子上一次生成的变化量来自动停止，例如，使用杰卡德相似系数作为变化量函数\footnote{杰卡德相似系数是衡量有限样本集之间的相似性与差异性的一种指标，杰卡德相似系数值越大，样本相似度越高。}。另外，需要说明的是，图\ref{fig:14-18}中使用多个解码器一种逻辑示意。真实的系统仅需要一个解码器，并运行多次，就达到了迭代精化的目的。
+\parinterval 图\ref{fig:14-18}展示了这种方法的简单示例。它拥有一个编码器和$N$个解码器。编码器首先预测出译文的长度，然后将输入$\seq{x}$按照长度复制出$\seq{x'}$作为第一个解码器的输入，之后生成$\seq{y'}$作为第一轮迭代的输出。接下来再把$\seq{y'}$输入给第二个解码器输出$\seq{y''}$，以此类推。那么迭代到什么时候结束呢？一种简单的做法是提前制定好迭代次数，这种方法能够自主地对生成句子的质量和效率进行平衡。另一种称之为“自适应”的方法，具体是通过计算当前生成的句子上一次生成的变化量来自动停止，例如，使用杰卡德相似系数作为变化量函数\footnote{杰卡德相似系数是衡量有限样本集之间的相似性与差异性的一种指标，杰卡德相似系数值越大，样本相似度越高。}。另外，需要说明的是，图\ref{fig:14-18}中是使用多个解码器的一种逻辑示意。真实的系统仅需要一个解码器，并运行多次，就达到了迭代精化的目的。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -586,7 +587,7 @@ b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 \end{figure}
 %----------------------------------------------
-\parinterval 除了使用上一个步骤的输出，当前解码器的输入还使用了添加噪声的正确目标语句子，两种使用情况之间使用一个超参数控制\upcite{Lee2018DeterministicNN}。另外，对于目标语长度的预测，本文使用编码端的输出单独训练了一个独立的长度预测模块，这种方法也推广到了目前大多数模型上。
+\parinterval 除了使用上一个步骤的输出，当前解码器的输入还使用了添加噪声的正确目标语言句子，两种使用情况之间使用一个超参数控制\upcite{Lee2018DeterministicNN}。另外，对于译文长度的预测，本章使用编码端的输出单独训练了一个独立的长度预测模块，这种方法也推广到了目前大多数模型上。
 \parinterval 另一种方法借鉴了BERT的思想\upcite{devlin2019bert}，提出了一种新的解码方法：Mask-Predict\upcite{Ghazvininejad2019MaskPredictPD}。类似于BERT中的[CLS]，该方法在源语言句子的最前面加上了一个特殊符号[LEN]作为输入，用来预测目标句的长度$n$。之后，将特殊符[Mask]（与BERT中的[Mask]有相似的含义）复制$n$次作为解码器的输入，然后用非自回归的方式生成目标端所有的词。这样生成的翻译可能是比较差的，因此可以将第一次生成的这些词中不确定（即生成概率比较低）的一些词再“擦”掉，依据目标端剩余的单词以及源语言句子重新进行预测，不断迭代，直到满足停止条件为止。图\ref{fig:14-19}给出了一个示例。
@@ -619,7 +620,7 @@ b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 \begin{itemize}
 \vspace{0.5em}
-\item 假设生成。构建翻译假设集合是假设选择的第一步，也是最重要的一步。理想的情况下，我们希望这个集合尽可能包含更多高质量的翻译假设，这样后面有更大的几率选出更好的结果。不过，由于单个模型的性能是有上限的，因此无法期望这些翻译假设的品质超越单个模型的上限。研究人员更加关心的是翻译假设的多样性，因为已经证明多样的翻译假设非常有助于提升系统融合的性能\upcite{DBLP:journals/corr/LiMJ16,xiao2013bagging}。为了生成多样的翻译假设，通常有两种思路：1）使用不同的模型生成翻译假设；2）使用同一个模型的不同参数和设置生成翻译假设。图\ref{fig:14-8}展示了二者的区别。比如，可以使用基于RNN 的模型和Transformer 模型生成不同的翻译假设，之后都放入集合中；也可以只用Transformer 模型，但是用不同的模型参数构建多个系统，之后分别生成翻译假设。在神经机器翻译中，经常采用的是第二种方式，因为系统开发的成本更低。比如，很多研究工作都是基于一个基础模型，用不同的初始参数、不同层数、不同解码方式生成多个模型进行翻译假设生成。
+\item 假设生成。构建翻译假设集合是假设选择的第一步，也是最重要的一步。理想的情况下，这个集合应该尽可能包含更多高质量的翻译假设，这样后面有更大的几率选出更好的结果。不过，由于单个模型的性能是有上限的，因此无法期望这些翻译假设的品质超越单个模型的上限。研究人员更加关心的是翻译假设的多样性，因为已经证明多样的翻译假设非常有助于提升系统融合的性能\upcite{DBLP:journals/corr/LiMJ16,xiao2013bagging}。为了生成多样的翻译假设，通常有两种思路：1）使用不同的模型生成翻译假设；2）使用同一个模型的不同参数和设置生成翻译假设。图\ref{fig:14-8}展示了二者的区别。比如，可以使用基于RNN 的模型和Transformer 模型生成不同的翻译假设，之后都放入集合中；也可以只用Transformer 模型，但是用不同的模型参数构建多个系统，之后分别生成翻译假设。在神经机器翻译中，经常采用的是第二种方式，因为系统开发的成本更低。比如，很多研究工作都是基于一个基础模型，用不同的初始参数、不同层数、不同解码方式生成多个模型进行翻译假设生成。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -631,7 +632,7 @@ b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 %----------------------------------------------
 \vspace{0.5em}
-\item 选择模型。所谓假设选择实际上就是要用一个更强的模型在候选中进行选择。这个“强” 模型一般是由更多、更复杂的子模型组合而成。常用的方法是直接使用翻译假设生成时的模型构建“强” 模型。比如，使用了两个模型生成了翻译假设集合，之后对所有翻译假设都分别用这两个模型进行打分。最后，综合两个模型的打分（如线性插值）得到翻译假设的最终得分，并进行选择。当然，也可以使用更强大的统计模型对多个子模型进行组合（如使用多层神经网络）。
+\item 选择模型。所谓假设选择实际上就是要用一个更强的模型在候选中进行选择。这个“强” 模型一般是由更多、更复杂的子模型组合而成。常用的方法是直接使用翻译假设生成时的模型构建“强” 模型。比如，使用两个模型生成了翻译假设集合，之后对所有翻译假设都分别用这两个模型进行打分。最后，综合两个模型的打分（如线性插值）得到翻译假设的最终得分，并进行选择。当然，也可以使用更强大的统计模型对多个子模型进行组合（如使用多层神经网络）。
 \vspace{0.5em}
 \end{itemize}
@@ -661,7 +662,7 @@ b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 \end{figure}
 %----------------------------------------------
-\parinterval 公式\eqref{eq:14-11}是一种典型的线性插值模型，这类模型在语言建模等任务中已经得到成功应用。从统计学习的角度，多个模型的插值可以有效的降低经验错误率。不过，多模型集成依赖一个假设：这些模型之间需要有一定的互补性。这种互补性有时也体现在多个模型预测的上限上，称为Oracle。比如，可以把这$K$个模型输出中BLEU最高的结果作为Oracle，也可以选择每个预测结果中使BLEU 达到最高的译文单词，这样构成的句子作为Oracle。当然，并不是说Oracle 提高，模型集成的结果一定会变好。因为Oracle 是最理想情况下的结果，而实际预测的结果与Oracle 往往有很大差异。如何使用Oracle 进行模型优化也是很多研究者在探索的问题。
+\parinterval 公式\eqref{eq:14-11}是一种典型的线性插值模型，这类模型在语言建模等任务中已经得到成功应用。从统计学习的角度，多个模型的插值可以有效地降低经验错误率。不过，多模型集成依赖一个假设：这些模型之间需要有一定的互补性。这种互补性有时也体现在多个模型预测的上限上，称为Oracle。比如，可以把这$K$个模型输出中BLEU最高的结果作为Oracle，也可以选择每个预测结果中使BLEU 达到最高的译文单词，这样构成的句子作为Oracle。当然，并不是说Oracle 提高，模型集成的结果一定会变好。因为Oracle 是最理想情况下的结果，而实际预测的结果与Oracle 往往有很大差异。如何使用Oracle 进行模型优化也是很多研究人员在探索的问题。
 \parinterval 此外，如何构建集成用的模型也是非常重要的，甚至说这部分工作会成为模型集成方法中最困难的部分\upcite{DBLP:conf/wmt/LiLXLLLWZXWFCLL19,Wang2018TencentNM,DBLP:conf/wmt/SennrichHB16}。为了增加模型的多样性，常用的方法有：
@@ -673,9 +674,9 @@ b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 \vspace{0.5em}
 \item 不同模型（局部）架构的调整，比如，使用不同的位置编码模型\upcite{Shaw2018SelfAttentionWR}、多层融合模型\upcite{WangLearning}等；
 \vspace{0.5em}
-\item 利用不同数量以及不同数据增强方式产生的伪数据训练模型（{\color{red} 参考文献：今年的报告}）；
+\item 利用不同数量以及不同数据增强方式产生的伪数据训练模型\upcite{zhang-EtAl:2020:WMT}；
 \vspace{0.5em}
-\item 利用多分支多通道的模型，不同分支可能有不同结构，使得模型能有更好的表示能力（{\color{red} 参考文献：今年的报告}）；
+\item 利用多分支多通道的模型，不同分支可能有不同结构，使得模型能有更好的表示能力\upcite{zhang-EtAl:2020:WMT}；
 \vspace{0.5em}
 \item 利用预训练进行参数共享之后微调的模型；
 \vspace{0.5em}
@@ -687,7 +688,7 @@ b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 \subsection{译文重组}
-\parinterval 假设直接从已经生成的译文中进行选择，因此无法产生“新” 的译文，也就是它的输出只能是某个单模型的输出。另一方面，预测融合需要同时使用多个模型进行推断，对计算和内存消耗较大。而且这两种方法有一个共性问题：搜索都是基于一个个字符串，相比指数级的译文空间，所看到的结果还是非常小的一部分。对于这个问题，一种方法是利用更加紧凑的数据结构对指数级的译文串进行表示。比如，可以使用格（Lattice）对多个译文串进行表示\upcite{DBLP:conf/emnlp/TrombleKOM08}。图\ref{fig:14-10}展示了基于$n$-best词串和基于Lattice 的表示方法的区别。可以看到，Lattice 中从起始状态到结束状态的每一条路径都表示一个译文，不同译文的不同部分可以通过Lattice 中的节点得到共享\footnote{本例中的Lattice 也是一个混淆网络（Confusion Network）。}。理论上，Lattice 可以把指数级数量的词串用线性复杂度的结构表示出来。
+\parinterval 假设直接从已经生成的译文中进行选择，因此无法产生“新” 的译文，也就是它的输出只能是某个单模型的输出。此外，预测融合需要同时使用多个模型进行推断，对计算和内存消耗较大。而且这两种方法有一个共性问题：搜索都是基于一个个字符串，相比指数级的译文空间，所看到的结果还是非常小的一部分。对于这个问题，一种方法是利用更加紧凑的数据结构对指数级的译文串进行表示。比如，可以使用{\small\sffamily\bfseries{格}}\index{格}（Lattice\index{Lattice}）对多个译文串进行表示\upcite{DBLP:conf/emnlp/TrombleKOM08}。图\ref{fig:14-10}展示了基于$n$-best词串和基于Lattice 的表示方法的区别。可以看到，Lattice 中从起始状态到结束状态的每一条路径都表示一个译文，不同译文的不同部分可以通过Lattice 中的节点得到共享\footnote{本例中的Lattice 也是一个{\small\sffamily\bfseries{混淆网络}}\index{混淆网络}（Confusion Network\index{Confusion Network}）。}。理论上，Lattice 可以把指数级数量的词串用线性复杂度的结构表示出来。
 %----------------------------------------------------------------------
 \begin{figure}[htp]
@@ -726,9 +727,11 @@ b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 \vspace{0.5em}
 \item 在对机器翻译推断系统进行实际部署时，对存储的消耗也是需要考虑的因素。因此如何让模型变得更小也是研发人员所关注的方向。当前的模型压缩方法主要可以分为几类：剪枝、量化、知识蒸馏和轻量方法，其中轻量方法主要是更轻量模型结构的设计，这类方法已经在上文进行了介绍。剪枝主要包括权重大小剪枝\upcite{Han2015LearningBW,Lee2019SNIPSN,Frankle2019TheLT,Brix2020SuccessfullyAT}、面向多头注意力的剪枝\upcite{Michel2019AreSH,DBLP:journals/corr/abs-1905-09418}、网络层以及其他部分的剪枝等\upcite{Liu2017LearningEC,Liu2019RethinkingTV}，还有一些方法也通过在训练期间采用正则化的方式来提升剪枝能力\upcite{DBLP:conf/iclr/FanGJ20}。量化方法主要通过截断浮点数来减少模型的存储大小，使其仅使用几个比特位的数字表示方法便能存储整个模型，虽然会导致舍入误差，但压缩效果显著\upcite{DBLP:journals/corr/abs-1906-00532,Cheong2019transformersZ,Banner2018ScalableMF,Hubara2017QuantizedNN}。一些方法利用知识蒸馏手段还将Transformer模型蒸馏成如LSTMs 等其他各种推断速度更快的架构\upcite{DBLP:journals/corr/HintonVD15,Munim2019SequencelevelKD,Tang2019DistillingTK}。另外还有一些其他方法不仅在输出上，还在权重矩阵和隐藏的激活层上对“教师模型”知识进行更深入的挖掘\upcite{Jiao2020TinyBERTDB}。
 \vspace{0.5em}
-\item 目前的翻译模型使用交叉熵损失作为优化函数，这在自回归模型上取得了非常优秀的性能。交叉熵是一个严格的损失函数，预测时位置错误的单词都会受到惩罚，即使是编辑距离很小的输出序列。回归模型会避免这种惩罚，因为单词是根据句子前一个词来生成的，而非自回归模型无法获知这个信息。为此，一些研究工作通过改进损失函数来提高非自回归模型的性能。一种做法使用对齐交叉熵函数\upcite{Ghazvininejad2020AlignedCE}，其基于标签序列和目标词分布预测序列之间的对齐来计算交叉熵损失，采用动态规划的方法寻找单调对齐使交叉熵损失最小化。也可以使用基于$n$-gram的训练目标\upcite{Shao2020MinimizingTB}，希望能最小化模型与参考译文间$n$-gram的差异。该训练目标在$n$-gram的层面上评估预测结果，因此能够建模序列依赖关系。
+\item 目前的翻译模型使用交叉熵损失作为优化函数，这在自回归模型上取得了非常优秀的性能。交叉熵是一个严格的损失函数，预测时位置错误的单词都会受到惩罚，即使是编辑距离很小的输出序列。回归模型会避免这种惩罚，因为单词是根据句子前一个词来生成的，而非自回归模型无法获知这个信息。为此，一些研究工作通过改进损失函数来提高非自回归模型的性能。一种做法使用对齐交叉熵函数\upcite{Ghazvininejad2020AlignedCE}，其基于标签序列和译文单词分布预测序列之间的对齐来计算交叉熵损失，采用动态规划的方法寻找单调对齐使交叉熵损失最小化。也可以使用基于$n$-gram的训练目标\upcite{Shao2020MinimizingTB}，希望能最小化模型与参考译文间$n$-gram的差异。该训练目标在$n$-gram的层面上评估预测结果，因此能够建模序列依赖关系。
 \vspace{0.5em}
-\item 自回归模型预测目标句时，当前词的生成是以之前已生成的词作为条件的，已生成词提供了较强的目标端上下文信息。然而，非自回归模型并行地生成所有词，因此不存在这样的信息。与自回归模型相比，非自回归模型的解码器需要在信息更少的情况下执行翻译任务。因此很多做法通过给非自回归模型的解码器端引入更多的信息，来降低模型的搜索空间。一些研究工作通过将条件随机场引入非自回归模型中来对结构依赖进行建模\upcite{Ma2019FlowSeqNC}；也有工作引入了一个词嵌入转换矩阵来将源端的词嵌入转换为目标端的词嵌入来增强解码端的输入\upcite{Guo2019NonAutoregressiveNM}；此外，研究人员也提出了轻量级的重排序模块来显式的建模重排序信息，以指导非自回归模型的解码\upcite{Ran2019GuidingNN}。
+\item 自回归模型预测目标句时，当前词的生成是以之前已生成的词作为条件的，已生成词提供了较强的目标端上下文信息。然而，非自回归模型并行地生成所有词，因此不存在这样的信息。与自回归模型相比，非自回归模型的解码器需要在信息更少的情况下执行翻译任务。因此很多做法通过给非自回归模型的解码器端引入更多的信息，来降低模型的搜索空间。一些研究工作通过将条件随机场引入非自回归模型中来对结构依赖进行建模\upcite{Ma2019FlowSeqNC}。也有工作引入了一个词嵌入转换矩阵来将源端的词嵌入转换为目标端的词嵌入来增强解码端的输入\upcite{Guo2019NonAutoregressiveNM}。此外，研究人员也提出了轻量级的重排序模块来显式地
+建模重排序信息，以指导非自回归模型的解码\upcite{Ran2019GuidingNN}。
 \vspace{0.5em}
 \end{itemize}

--- a/Chapter16/Figures/figure-unsupervised-dual-learning-process.tex
+++ b/Chapter16/Figures/figure-unsupervised-dual-learning-process.tex
@@ -17,7 +17,7 @@
 \node[anchor=north,circle,fill=red!20,minimum width=6.8em](node2) at ([xshift=-6.0em,yshift=-2.0em]remark1.south) {源语言句子$\seq{x}$};
 \node[anchor=north,circle,fill=red!20,minimum width=6.8em](node2-2) at ([yshift=-0.2em]node2.south) {新生成句子$\seq{x'}$};
 \draw [->,thick]([yshift=0.2em]node2.north).. controls (-1.93,-1.5) and (-2.0,-0.2)..([xshift=-0.2em]remark1.west);
-\node[anchor=north,circle,fill=red!20](node3) at ([xshift=6.5em,yshift=-2.0em]remark1.south) {目标语言句子$\seq{x}$};
+\node[anchor=north,circle,fill=red!20](node3) at ([xshift=6.5em,yshift=-2.0em]remark1.south) {目标语言句子$\seq{y}$};
 \draw [->,thick]([xshift=0.2em]remark1.east).. controls (2.9,-0.25) and (2.9,-0.7) ..([yshift=0.2em]node3.north);

--- a/Chapter16/chapter16.tex
+++ b/Chapter16/chapter16.tex
@@ -32,23 +32,23 @@
 \section{数据的有效使用}\label{effective-use-of-data}
-\parinterval 数据稀缺是低资源机器翻译所面临的主要问题。充分使用既有数据是一种解决问题的思路。比如，在双语训练不充分的时候，可以简单地对双语数据的部分单词用近义词进行替换，达到丰富双语数据的目的\upcite{DBLP:conf/acl/FadaeeBM17a,DBLP:conf/emnlp/WangPDN18}，也可以考虑用转述等方式生成更多的双语训练数据\upcite{DBLP:conf/emnlp/MartonCR09,DBLP:conf/eacl/LapataSM17}。
+\parinterval 数据稀缺是低资源机器翻译所面临的主要问题。充分使用既有数据是一种解决问题的思路。比如，在双语训练不充足的时候，可以简单地对双语数据的部分单词用近义词进行替换，达到丰富双语数据的目的\upcite{DBLP:conf/acl/FadaeeBM17a,DBLP:conf/emnlp/WangPDN18}，也可以考虑用转述等方式生成更多的双语训练数据\upcite{DBLP:conf/emnlp/MartonCR09,DBLP:conf/eacl/LapataSM17}。
-\parinterval 另一种思路是使用相比双语数据更容易获取的单语数据。实际上，在统计机器翻译时代，使用单语数据训练语言模型是构建机器翻译系统的关键步骤，好的语言模型往往会带来性能的增益。而这个现象在神经机器翻译中似乎并不明显，因为在大多数神经机器翻译的范式中，并不要求使用大规模单语数据来帮助机器翻译系统。甚至，连语言模型都不会作为一个独立的模块。这一方面是由于神经机器翻译系统的解码端本身就起着语言模型的作用，另一方面是由于数据的增多使得翻译模型可以更好的捕捉目标语言的规律。但是，双语数据总是有限的，很多场景下，单语数据的规模会远大于双语数据，如果能够让这些单语数据发挥作用，显然是一种非常好的选择。针对以上问题，下面将从数据增强、基于语言模型的单语数据使用等方面展开讨论。
+\parinterval 另一种思路是使用相比双语数据更容易获取的单语数据。实际上，在统计机器翻译时代，使用单语数据训练语言模型是构建机器翻译系统的关键步骤，好的语言模型往往会带来性能的增益。而这个现象在神经机器翻译中似乎并不明显，因为在大多数神经机器翻译的范式中，并不要求使用大规模单语数据来帮助机器翻译系统。甚至，连语言模型都不会作为一个独立的模块。这一方面是由于神经机器翻译系统的解码端本身就起着语言模型的作用，另一方面是由于双语数据的增多使得翻译模型可以更好的捕捉目标语言的规律。但是，双语数据总是有限的，很多场景下，单语数据的规模会远大于双语数据，如果能够让这些单语数据发挥作用，显然是一种非常好的选择。针对以上问题，下面将从数据增强、基于语言模型的单语数据使用等方面展开讨论。
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
 %----------------------------------------------------------------------------------------
 \subsection{数据增强}
-\parinterval {\small\bfnew{数据增强}}\index{数据增强}（Data Augmentation）\index{Data Augmentation}是一种增加训练数据的方法，通常通过对既有数据进行修改或者生成新的伪数据等方式实现。有时候，数据增强也可以被看做是一种防止模型过拟合的手段\upcite{DBLP:journals/jbd/ShortenK19}。在机器翻译中，典型的数据增强方法包括回译、修改双语数据、双语数据挖掘等。
+\parinterval {\small\bfnew{数据增强}}\index{数据增强}（Data Augmentation）\index{Data Augmentation}是一种增加训练数据的方法，通常通过对既有数据进行修改或者生成新的伪数据等方式实现。有时候，数据增强也可以被看做是一种防止模型过拟合的手段\upcite{DBLP:journals/jbd/ShortenK19}。在机器翻译中，典型的数据增强方法包括回译、修改双语数据、双语句对挖掘等。
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SUB-SECTION
 %----------------------------------------------------------------------------------------
 \subsubsection{1. 回译}
-\parinterval {\small\bfnew{回译}}\index{回译}（Back Translation, BT\index{Back Translation}）是目前机器翻译任务上最常用的一种数据增强方法\upcite{Sennrich2016ImprovingNM,DBLP:conf/emnlp/EdunovOAG18,DBLP:conf/aclnmt/HoangKHC18}。回译的主要思想是：利用目标语言-源语言模型（反向翻译模型）来生成伪双语句对，用于训练源语言-目标语言翻译模型（正向翻译模型）。假设现在需要训练一个英汉翻译模型。首先，使用双语数据训练汉英翻译模型，即反向翻译模型。然后通过该模型将额外的汉语单语句子翻译为英语句子，从而得到大量的生成英语- 真实汉语伪双语句对。然后，将回译得到的伪双语句对和真实双语句对混合，训练得到最终的英汉翻译模型。
+\parinterval {\small\bfnew{回译}}\index{回译}（Back Translation, BT\index{Back Translation}）是目前机器翻译任务上最常用的一种数据增强方法\upcite{Sennrich2016ImprovingNM,DBLP:conf/emnlp/EdunovOAG18,DBLP:conf/aclnmt/HoangKHC18}。回译的主要思想是：利用目标语言-源语言翻译模型（反向翻译模型）来生成伪双语句对，用于训练源语言-目标语言翻译模型（正向翻译模型）。假设现在需要训练一个英汉翻译模型。首先，使用双语数据训练汉英翻译模型，即反向翻译模型。然后通过该模型将额外的汉语单语句子翻译为英语句子，从而得到大量的生成英语- 真实汉语伪双语句对。然后，将回译得到的伪双语句对和真实双语句对混合，训练得到最终的英汉翻译模型。
 回译方法是模型无关的，只需要训练一个反向翻译模型，就可以简单有效地利用单语数据来增加训练数据的数量，因此得到了广泛使用\upcite{Hassan2018AchievingHP,DBLP:conf/iclr/LampleCDR18,DBLP:conf/emnlp/LampleOCDR18}。图\ref{fig:16-1} 给出了回译方法的一个简要流程。
 %----------------------------------------------
@@ -60,7 +60,7 @@
 \end{figure}
 %----------------------------------------------
-\parinterval 围绕如何利用回译方法生成伪双语数据这一问题，研究人员进行了详细地分析探讨。一般观点认为，反向翻译模型的性能越好，生成的伪数据质量也就越高，对正向翻译模型的性能提升也就越大\upcite{Sennrich2016ImprovingNM,DBLP:conf/aclnmt/HoangKHC18}。不过，在实践中发现，即使一些简单的策略也能带来性能的增长。比如，对于一些低资源翻译任务，通过将目标语言句子复制到源语言端构造伪数据便能带来增益\upcite{DBLP:conf/wmt/CurreyBH17}。原因在于，即使构造的双语伪数据是不准确的，其目标语言端仍然是真实数据，可以使解码器训练得更加充分，因此保证了神经机器翻译模型生成结果的流畅度。但是，相比这些简单的伪数据生成策略，利用目标语言单语数据进行回译可以获得更高质量的数据\upcite{DBLP:conf/wmt/CurreyBH17}。一种可能的解释是，双语伪数据的源语言是模型生成的翻译结果，保留了两种语言之间的互译信息，相比真实数据又存在一定的噪声。神经机器翻译模型在伪双语句对上进行训练，可以学习到如何处理带有噪声的输入，提高了模型的健壮性。
+\parinterval 围绕如何利用回译方法生成伪双语数据这一问题，研究人员进行了详细地分析探讨。一般观点认为，反向翻译模型的性能越好，生成的伪数据质量也就越高，对正向翻译模型的性能提升也就越大\upcite{Sennrich2016ImprovingNM,DBLP:conf/aclnmt/HoangKHC18}。不过，在实践中发现，即使一些简单的策略也能带来性能的增长。比如，对于一些低资源翻译任务，通过将目标语言句子复制到源语言端构造伪数据便能带来增益\upcite{DBLP:conf/wmt/CurreyBH17}。原因在于，即使构造的双语伪数据是不准确的，其目标语言端仍然是真实数据，可以使解码器训练得更加充分，用来提升神经机器翻译模型生成结果的流畅度。但是，相比这些简单的伪数据生成策略，利用目标语言单语数据进行回译可以带来更高的提升\upcite{DBLP:conf/wmt/CurreyBH17}。一种可能的解释是，双语伪数据的源语言是模型生成的翻译结果，保留了两种语言之间的互译信息，相比真实数据又存在一定的噪声。神经机器翻译模型在伪双语句对上进行训练，可以学习到如何处理带有噪声的输入，提高了模型的健壮性。
 \parinterval 在回译方法中，反向翻译模型的训练只依赖于有限的双语数据，因此生成的源语言端伪数据的质量难以保证。为此，可以采用{\small\sffamily\bfnew{迭代式回译}}\index{迭代式回译}（Iterative Back Translation）\index{Iterative Back Translation}的方法\upcite{DBLP:conf/aclnmt/HoangKHC18}，同时利用源语言端和目标语言端的单语数据，不断通过回译的方式来提升正向和反向翻译模型的性能。图\ref{fig:16-2}展示了迭代式回译的框架。首先，使用双语数据训练一个正向翻译模型，然后利用额外的源语言单语数据通过回译的方式生成伪双语数据，来提升反向翻译模型的性能，再利用反向翻译模型和额外的目标语言单语数据生成伪双语数据，用于提升正向翻译模型的性能。可以看出，迭代式回译的过程是完全闭环的，因此可以一直重复进行，直到正向和反向翻译模型的性能均不再提升。
@@ -73,7 +73,7 @@
 \end{figure}
 %----------------------------------------------
-\parinterval 然而，研究人员发现，在低资源场景中，由于缺乏双语数据，高质量的伪双语数据对于模型来说更有帮助。而在富资源场景中，在回译产生的源语言句子中添加一些噪声，提高翻译结果的多样性，反而可以达到更好的效果，比较常用的方法是使用采样解码、Top-$k$解码和加噪\upcite{DBLP:conf/emnlp/EdunovOAG18,DBLP:conf/aclnmt/ImamuraFS18,DBLP:conf/emnlp/WuWXQLL19}。回译中常用的解码方式为束搜索，在生成每个词的时候只考虑预测概率最高的词，因此生成的翻译结果质量更高，但导致的问题是翻译结果主要集中在部分高频词上，生成的伪数据缺乏多样性，也就很难去准确地覆盖真实的数据分布\upcite{DBLP:conf/icml/OttAGR18}。采样解码是指在解码过程中，对词表中所有的词按照预测概率进行随机采样，因此整个词表中的词都有可能被选中，从而使生成结果多样性更强，但翻译质量和流畅度也会明显下降。Top-$k$解码是对束搜索和采样解码的一个折中方法。在解码过程中，Top-$k$解码对词表中预测概率最高的前$k$个词进行随机采样，这样在保证翻译结果准确的前提下，提高了结果的多样性。加噪方法在束搜索的解码结果加入一些噪声，如丢掉或掩码部分词、打乱句子顺序等。这些方法在生成的源语言句子中引入了噪声，不仅增加了对包含低频词或噪声句子的训练次数，同时也提高了模型的健壮性和泛化能力\upcite{DBLP:conf/icml/VincentLBM08}。
+\parinterval 更进一步，研究人员发现，在低资源场景中，由于缺乏双语数据，高质量的伪双语数据对于模型来说更有帮助。而在富资源场景中，在回译产生的源语言句子中添加一些噪声，提高翻译结果的多样性，反而可以达到更好的效果，比较常用的方法是使用采样解码、Top-$k$解码和加噪\upcite{DBLP:conf/emnlp/EdunovOAG18,DBLP:conf/aclnmt/ImamuraFS18,DBLP:conf/emnlp/WuWXQLL19}。回译中常用的解码方式为束搜索，在生成每个词的时候只考虑预测概率最高的前几个词，因此生成的翻译结果质量更高，但导致的问题是翻译结果主要集中在部分高频词上，生成的伪数据缺乏多样性，也就很难去准确地覆盖真实的数据分布\upcite{DBLP:conf/icml/OttAGR18}。采样解码是指在解码过程中，对词表中所有的词按照预测概率进行随机采样，因此整个词表中的词都有可能被选中，从而使生成结果多样性更强，但翻译质量和流畅度也会明显下降。Top-$k$解码是对束搜索和采样解码的一个折中方法。在解码过程中，Top-$k$解码对词表中预测概率最高的前$k$个词进行随机采样，这样在保证翻译结果准确的前提下，提高了结果的多样性。加噪方法在束搜索的解码结果加入一些噪声，如丢掉或掩码部分词、打乱句子顺序等。这些方法在生成的源语言句子中引入了噪声，不仅增加了对包含低频词或噪声句子的训练次数，同时也提高了模型的健壮性和泛化能力\upcite{DBLP:conf/icml/VincentLBM08}。
 \parinterval 与回译方法类似，源语言单语数据也可以通过一个双语数据训练的正向翻译模型获得对应的目标语言数据，从而构造正向翻译的伪数据\upcite{DBLP:conf/emnlp/ZhangZ16}。与回译方法相反，这时的伪数据中源语言句子是真实的，而目标语言句子是自动生成的，构造的伪数据对译文的流畅性并没有太大帮助，其主要作用是提升编码器的特征提取能力。然而，由于伪数据中生成的译文质量很难保证，因此利用正向翻译模型生成伪数据的方法带来的性能提升效果要弱于回译，甚至可能是有害的\upcite{DBLP:conf/emnlp/WuWXQLL19}。
@@ -82,7 +82,7 @@
 %----------------------------------------------------------------------------------------
 \subsubsection{2. 修改双语数据}
-\parinterval 回译方法是利用单语数据来生成伪数据，而另外一种数据增强技术是对原始双语数据进行修改来得到伪双语数据，常用的方法包括加噪、词替换和转述等。
+\parinterval 回译方法是利用单语数据来生成伪数据，而另外一种数据增强技术是对原始双语数据进行修改来得到伪双语数据，常用的方法包括加噪和转述等。
 \parinterval 加噪是自然语言处理任务中广泛使用的一种方法\upcite{DBLP:conf/icml/VincentLBM08,DBLP:journals/ipm/FarhanTAJATT20,DBLP:conf/iclr/LampleCDR18,devlin2019bert}。比如，在广泛使用的{\small\bfnew{降噪自编码器}}\index{降噪自编码器}（Denoising Autoencoder）\index{Denoising Autoencoder}中，向原始数据中加入噪声作为模型的输入，模型通过学习如何预测原始数据进行训练。而在神经机器翻译中，利用加噪方法进行数据增强的常用方法是，在保证句子整体语义不变的情况下，对原始的双语数据适当加入一些噪声，从而生成伪双语数据来增加训练数据的规模。常用的加噪方法主要有以下三种：
 %----------------------------------------------
@@ -108,7 +108,7 @@
 \end{figure}
 %----------------------------------------------
-\parinterval 和回译方法相似，加噪方法一般仅在源语言句子上进行操作，既保证了目标语言句子的流畅度，又可以提高训练数据量，增加数据的多样性，还可以提高模型的健壮性和泛化能力\upcite{DBLP:conf/icml/VincentLBM08}。加噪作为一种简单有效的方法，实际的应用场景很多，比如：
+\parinterval 和回译方法相似，加噪方法一般仅在源语言句子上进行操作，既保证了目标语言句子的流畅度，又可以提高训练数据量，增加数据的多样性、提高模型的健壮性和泛化能力\upcite{DBLP:conf/icml/VincentLBM08}。加噪作为一种简单有效的方法，实际的应用场景很多，比如：
 %----------------------------------------------
 \begin{itemize}
    \vspace{0.5em}
@@ -159,16 +159,16 @@
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SUB-SECTION
 %----------------------------------------------------------------------------------------
-\subsubsection{1. 语言模型在目标端的融合}
+\subsubsection{1. 语言模型在目标语言端的融合}
 \parinterval 融合目标语言端的语言模型是一种最直接的使用单语数据的方法\upcite{2015OnGulcehre,DBLP:journals/csl/GulcehreFXCB17,DBLP:conf/wmt/StahlbergCS18}。实际上，神经机器翻译模型本身也具备了语言模型的作用，因为解码器本质上也是一个语言模型，用于描述生成译文词串的规律。类似于语言模型，神经机器翻译模型可以自回归地生成翻译结果。对于一个双语句对$(\seq{x}, \seq{y})$，神经机器翻译模型根据源语言句子$\seq{x}$和前面生成的词来预测当前位置词的概率分布：
 \begin{eqnarray}
-\log{P(\seq{y} | \seq{x}; \theta)} & = & \sum_{t}{\log{P(y_t | \seq{x}, {\seq{y}}_{<t}; \theta)}}
+\log{P(\seq{y} | \seq{x}; \theta)} & = & \sum_{t}{\log{P(y_t | {\seq{y}}_{<t}, \seq{x}; \theta)}}
 \label{eq:16-1}
 \end{eqnarray}
-\noindent 这里，$\theta$是神经机器翻译模型的参数，${\seq{y}}_{<t}$表示第$t$个词前面生成的词。可以看出，模型的翻译过程与两部分信息有关，分别是源语言句子$\seq{x}$以及前面生成的翻译序列${\seq{y}}_{<t}$。语言模型可以与解码过程融合，根据${\seq{y}}_{<t}$生成流畅度更高的翻译结果。常用的融合方法主要分为浅融合和深融合\upcite{2015OnGulcehre}。
+\noindent 这里，$\theta$是神经机器翻译模型的参数，${\seq{y}}_{<t}$表示第$t$个位置前面已经生成的词序列。可以看出，模型的翻译过程与两部分信息有关，分别是源语言句子$\seq{x}$以及前面生成的翻译序列${\seq{y}}_{<t}$。语言模型可以与解码过程融合，根据${\seq{y}}_{<t}$生成流畅度更高的翻译结果。常用的融合方法主要分为浅融合和深融合\upcite{2015OnGulcehre}。
 \parinterval 浅融合方法独立训练翻译模型和语言模型，在生成每个词的时候，对两个模型的预测概率进行加权求和得到最终的预测概率。浅融合的不足在于，解码过程对每个词均采用相同的语言模型权重，这实际上是不合理的。比如，在汉语-英语翻译系统中，英语句子中的冠词可能在汉语句子中没有显式的单词对应，这种情况下，英语语言模型可以提供更多帮助，保证翻译结果更加符合英语的语言结构；而在翻译某些名词的时候，语言模型由于没有源语言句子的信息，反而会对解码过程产生干扰，因此权重越小越好。针对这个问题，深融合联合翻译模型和语言模型进行训练，从而在解码过程中动态地计算语言模型的权重，更好地融合翻译模型和语言模型来计算预测概率。
@@ -181,7 +181,7 @@
 \parinterval 神经机器翻译模型所使用的编码器-解码器框架天然就包含了对输入（源语言）和输出（目标语言）进行表示学习的过程。在编码端，需要学习一种分布式表示来表示源语言句子的信息，这种分布式表示可以包含序列中每个位置的表示结果（见{\chapternine}）。从结构上看，神经机器翻译所使用的编码器与语言模型无异，或者说神经机器翻译的编码器其实就是一个源语言的语言模型。唯一的区别在于，神经机器翻译的编码器并不直接输出源语言句子的生成概率，而传统语言模型是建立在序列生成任务上的。既然神经机器翻译的编码器可以与解码器一起在双语数据上联合训练，那为什么不使用更大规模的数据单独对编码器进行训练呢？或者说，直接使用一个预先训练好的编码器，与机器翻译的解码器配合完成翻译过程。
-\parinterval 实现上述想法的一种手段是{\small\sffamily\bfnew{预训练}}\index{预训练}（Pre-training）\index{Pre-training}\upcite{DBLP:conf/nips/DaiL15,DBLP:journals/corr/abs-1802-05365,radford2018improving,devlin2019bert}。预训练的做法相当于将表示模型的学习任务从目标任务中分离出来，这样可以利用额外的更大规模的数据进行学习。常用的一种方法是使用语言建模等方式在大规模单语数据上进行训练，来得到神经机器翻译模型中的一部分（比如词嵌入和编码器等）的模型参数，作为模型的初始值。然后，神经机器翻译模型在双语数据上进行{\small\sffamily\bfnew{微调}}\index{微调}（Fine-tuning）\index{Fine-tuning}，以得到最终的翻译模型。
+\parinterval 实现上述想法的一种手段是{\small\sffamily\bfnew{预训练}}\index{预训练}（Pre-training）\index{Pre-training}\upcite{DBLP:conf/nips/DaiL15,DBLP:journals/corr/abs-1802-05365,radford2018improving,devlin2019bert}。预训练的做法相当于将表示模型的学习任务从目标任务中分离出来，这样可以利用额外的更大规模的数据进行学习。常用的一种方法是使用语言建模等方式在大规模单语数据上进行训练，来得到神经机器翻译模型中的一部分（比如词嵌入和编码器等）的模型参数初始值。然后，神经机器翻译模型在双语数据上进行{\small\sffamily\bfnew{微调}}\index{微调}（Fine-tuning）\index{Fine-tuning}，以得到最终的翻译模型。
 \parinterval 词嵌入可以被看作是对每个独立单词进行的表示学习，在自然语言处理的众多任务中都扮演着重要角色\upcite{DBLP:conf/icml/CollobertW08,2011Natural,DBLP:journals/corr/abs-1901-09069}。到目前为止已经有大量的词嵌入学习方法被提出（见{\chapternine}），因此可以直接应用这些方法在海量的单语数据上训练得到词嵌入，用来初始化神经机器翻译模型的词嵌入参数矩阵\upcite{DBLP:conf/aclwat/NeishiSTIYT17,2018When}。
@@ -215,7 +215,7 @@
 \parinterval 因此，一种做法将预训练模型和翻译模型进行融合，把预训练模型作为一个独立的模块来为编码器或者解码器提供句子级表示信息\upcite{DBLP:journals/corr/abs-2002-06823,DBLP:conf/aaai/YangW0Z00020}。另外一种做法是针对生成任务进行预训练。机器翻译是一种典型的语言生成任务，不仅包含源语言表示学习的问题，还有序列到序列的映射，以及目标语言端序列生成的问题，这些知识是无法单独通过（源语言）单语数据学习到的。因此，可以使用单语数据对编码器-解码器结构进行预训练\upcite{song2019mass,DBLP:conf/acl/LewisLGGMLSZ20,DBLP:conf/emnlp/QiYGLDCZ020}。
-\parinterval 以{\small\bfnew{掩码端到端预训练}}（Masked Sequence To Sequence Pre-training，MASS）\index{掩码端到端预训练}\index{MASS}方法为例\upcite{song2019mass}，其思想与BERT十分相似，也是在预训练过程中采用掩码的方式，随机选择编码器输入句子中的连续片段替换为特殊词[Mask]，然后在解码器预测这个连续片段，如图\ref{fig:16-6} 所示。这种做法可以使得编码器捕捉上下文信息，同时迫使解码器依赖于编码器进行自回归的生成，从而学习到编码器和解码器之间的注意力。为了适配下游的机器翻译任务，使预训练模型可以学习到不同语言的表示，MASS对不同语言的句子采用共享词汇表和模型参数的方法，利用同一个预训练模型来进行不同语言句子的预训练。通过这种方式，模型既学到了对源语言句子的编码，也学习到了对目标语言句子的生成方法，之后通过使用双语句对来对预训练模型的参数进行微调，模型可以快速收敛到较好的水平。
+\parinterval 以{\small\bfnew{掩码端到端预训练}}（Masked Sequence to Sequence Pre-training，MASS）\index{掩码端到端预训练}\index{MASS}方法为例\upcite{song2019mass}，其思想与BERT十分相似，也是在预训练过程中采用掩码的方式，随机选择编码器输入句子中的连续片段替换为特殊词[Mask]，然后在解码器预测这个连续片段，如图\ref{fig:16-6} 所示。这种做法可以使得编码器捕捉上下文信息，同时迫使解码器依赖于编码器进行自回归的生成，从而学习到编码器和解码器之间的注意力。为了适配下游的机器翻译任务，使预训练模型可以学习到不同语言的表示，MASS对不同语言的句子采用共享词汇表和模型参数的方法，利用同一个预训练模型来进行不同语言句子的预训练。通过这种方式，模型既学到了对源语言句子的编码，也学习到了对目标语言句子的生成方法，之后通过使用双语句对来对预训练模型的参数进行微调，模型可以快速收敛到较好的水平。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -281,7 +281,7 @@
 \parinterval 这个例子说明$\funp{P}(\seq{y}|\seq{x})$和$\funp{P}(\seq{x}|\seq{y})$直觉上应当存在联系。当然，$\seq{x}$和$\seq{y}$之间是否存在简单的线性变换关系并没有结论，但是上面的例子给出了一种对源语言句子和目标语言句子进行相互转化的思路。实际上，研究人员已经通过一些数学技巧用目标函数来把$\funp{P}(\seq{y}|\seq{x})$和$\funp{P}(\seq{x}|\seq{y})$联系起来，这样训练神经机器翻译系统一次就可以同时得到两个方向的翻译模型，使得训练变得更加高效\upcite{Hassan2018AchievingHP,DBLP:conf/aaai/Zhang0LZC18,DBLP:conf/wmt/SunJXHWW19}。双向联合训练的基本思想是：使用两个方向的翻译模型对单语数据进行解码，之后用解码后的翻译结果与原始的单语数据作为训练语料，通过多次迭代更新两个方向上的机器翻译模型。
-\parinterval 图\ref{fig:16-9}给出了一个双向训练的详细流程，其中$M_{x \rightarrow y}^{k}$表示第$k$轮得到的$x$到$y$的翻译模型，$M_{y \rightarrow x}^{k}$表示第$k$轮得到的$y$到$x$的翻译模型。这里只展示了前两轮迭代。在第一次迭代开始之前，首先使用双语数据对两个初始翻译模型进行预训练。为了保持一致性，这里称之为第0 轮迭代。在第一轮迭代中，首先使用这两个翻译模型$M_{x \rightarrow y}^{0}$和$M_{y \rightarrow x}^{0}$ 翻译单语数据$X=\{ x_i \}$ 和$Y= \{ y_i \}$ 后得到译文$\{\hat{y}_i^{0} \}$和$\{ \hat{x}_i^{0}\}$。进一步，构建伪训练数据集$\{ x_i,\hat{y}_i^{0}\}$ 与$\{ \hat{x}_i^{0},y_i \}$。然后使用上面的两个伪训练集和原始双语数据混合训练得到模型$M_{x \rightarrow y}^{1}$和$M_{y \rightarrow x}^{1}$并进行参数更新，即用$\{ x_i,\hat{y}_i^{0}\} \bigcup \{ x_i,y_i\}$训练$M_{x \rightarrow y}^{1}$，用$\{ y_i,\hat{x}_i^{0}\} \bigcup \{ y_i,x_i\}$训练$M_{y \rightarrow x}^{1}$。第二轮迭代继续重复上述过程，使用更新参数后的翻译模型$M_{x \rightarrow y}^{1}$和$M_{y \rightarrow x}^{1}$ 得到新的伪数据集$\{ x_i,\hat{y}_i^{1}\}$ 与$\{ \hat{x}_i^{1},y_i \}$。然后，进一步得到翻译模型$M_{x \rightarrow y}^{2}$和$M_{y \rightarrow x}^{2}$。这种方式本质上也是一种自学习的过程，通过逐步生成更好的伪数据来提升模型质量。
+\parinterval 图\ref{fig:16-9}给出了一个双向训练的详细流程，其中$M_{x \rightarrow y}^{k}$表示第$k$轮得到的$x$到$y$的翻译模型，$M_{y \rightarrow x}^{k}$表示第$k$轮得到的$y$到$x$的翻译模型。这里只展示了前两轮迭代。在第一次迭代开始之前，首先使用双语数据对两个初始翻译模型进行预训练。为了保持一致性，这里称之为第0 轮迭代。在第一轮迭代中，首先使用这两个翻译模型$M_{x \rightarrow y}^{0}$和$M_{y \rightarrow x}^{0}$ 翻译单语数据$X=\{ x_i \}$ 和$Y= \{ y_i \}$ 后得到译文$\{\hat{y}_i^{0} \}$和$\{ \hat{x}_i^{0}\}$。进一步，构建伪训练数据集$\{ x_i,\hat{y}_i^{0}\}$ 与$\{ \hat{x}_i^{0},y_i \}$。然后使用上面的两个伪训练集和原始双语数据混合训练得到模型$M_{x \rightarrow y}^{1}$和$M_{y \rightarrow x}^{1}$并进行参数更新，即用$\{ x_i,\hat{y}_i^{0}\} \bigcup \{ x_i,y_i\}$训练$M_{y \rightarrow x}^{1}$，用$\{ y_i,\hat{x}_i^{0}\} \bigcup \{ y_i,x_i\}$训练$M_{x \rightarrow y}^{1}$。第二轮迭代继续重复上述过程，使用更新参数后的翻译模型$M_{x \rightarrow y}^{1}$和$M_{y \rightarrow x}^{1}$ 得到新的伪数据集$\{ x_i,\hat{y}_i^{1}\}$ 与$\{ \hat{x}_i^{1},y_i \}$。然后，进一步得到翻译模型$M_{x \rightarrow y}^{2}$和$M_{y \rightarrow x}^{2}$。这种方式本质上也是一种自学习的过程，通过逐步生成更好的伪数据来提升模型质量。
 %----------------------------------------------
 \begin{figure}[h]
@@ -296,7 +296,7 @@
 %----------------------------------------------------------------------------------------
 \subsection{对偶学习}
-\parinterval 对称，也许是人类最喜欢的美，其始终贯穿在整个人类文明的诞生与发展之中。古语“夫美者，上下、内外、大小、远近皆无害焉，故曰美”描述的即是这样的美。在人工智能的任务中，也存在着这样的对称结构，比如机器翻译中英译汉和汉译英、图像处理中的图像标注和图像生成以及语音处理中的语音识别和文字合成等。利用这些任务的对称性质（也称对偶性），可以使互为对偶的两个任务获得更有效的反馈，从而使对应的模型相互学习、相互提高。目前，对偶学习的思想已经广泛应用于低资源机器翻译领域，它不仅能够提升在有限双语资源下的翻译模型性能（{\small\bfnew{有监督对偶学习}}，Dual Supervised Learning\index{Dual Supervised Learning}）\upcite{DBLP:conf/icml/XiaQCBYL17,DBLP:conf/acl/SuHC19,DBLP:journals/ejasmp/RadzikowskiNWY19}，而且能够利用未标注的单语数据来进行学习（{\small\bfnew{无监督对偶学习}}，Dual Unsupervised Learning\index{Dual Unsupervised Learning}）\upcite{qin2020dual,DBLP:conf/iccv/YiZTG17,DBLP:journals/access/DuRZH20}。下面将一一展开讨论。
+\parinterval 对称，也许是人类最喜欢的美，其始终贯穿在整个人类文明的诞生与发展之中。古语“夫美者，上下、内外、大小、远近皆无害焉，故曰美”描述的即是这样的美。在人工智能的任务中，也存在着这样的对称结构，比如机器翻译中英译汉和汉译英、图像处理中的图像标注和图像生成以及语音处理中的语音识别和语音合成等。利用这些任务的对称性质（也称对偶性），可以使互为对偶的两个任务获得更有效的反馈，从而使对应的模型相互学习、相互提高。目前，对偶学习的思想已经广泛应用于低资源机器翻译领域，它不仅能够提升在有限双语资源下的翻译模型性能（{\small\bfnew{有监督对偶学习}}，Dual Supervised Learning\index{Dual Supervised Learning}）\upcite{DBLP:conf/icml/XiaQCBYL17,DBLP:conf/acl/SuHC19,DBLP:journals/ejasmp/RadzikowskiNWY19}，而且能够利用未标注的单语数据来进行学习（{\small\bfnew{无监督对偶学习}}，Dual Unsupervised Learning\index{Dual Unsupervised Learning}）\upcite{qin2020dual,DBLP:conf/iccv/YiZTG17,DBLP:journals/access/DuRZH20}。下面将一一展开讨论。
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SUB-SECTION
@@ -329,7 +329,7 @@
 %----------------------------------------------------------------------------------------
 \subsubsection{2. 无监督对偶学习}
-\parinterval 如上一节所述，有监督的对偶学习需要使用双语数据来训练两个翻译模型。幸运的是，存在大量的单语数据可供使用。因此，如何使用这些单语数据来提升翻译模型的性能是一个关键问题。
+\parinterval 有监督的对偶学习需要使用双语数据来训练两个翻译模型，但是有些低资源语言仅有少量双语数据可以训练。幸运的是，存在大量的单语数据可供使用。因此，如何使用这些单语数据来提升翻译模型的性能也是一个关键问题。
 \parinterval 无监督对偶学习提供了一个解决问题的思路\upcite{qin2020dual}。假设目前有两个比较弱的翻译模型，一个原始任务模型$f$将源语言句子$\seq{x}$翻译成目标语言句子$\seq{y}$，一个对偶任务模型$g$将目标语言句子$\seq{y}$翻译成源语言句子$\seq{x}$。翻译模型可由有限的双语训练或者使用无监督机器翻译的方法得到。如图\ref{fig:16-10}所示，无监督对偶学习的做法是，先通过原始任务模型$f$将一个源语言单语句子$x$翻译为目标语言句子$y$，由于没有参考译文，无法判断$y$的正确性。但通过语言模型，可以判断这个句子是否通顺、符合语法规范，这些信息可用来评估翻译模型$f$的翻译流畅性。随后，再通过对偶任务模型$g$将目标语言句子$y$翻译为源语言句子$x^{'}$。如果模型$f$和$g$的翻译性能较好，那么$x^{'}$和$x$会十分相似。通过计算二者的{\small\bfnew{重构损失}}\index{重构损失}（Reconstruction Loss）\index{Reconstruction Loss}，就可以优化模型$f$和$g$的参数。这个过程可以多次迭代，从大量的无标注单语数据上不断提升性能。
@@ -576,7 +576,7 @@
 \parinterval 在得到映射$\mathbi{W}$之后，对于$\mathbi{X}$中的任意一个单词$x_{i}$，通过$\mathbi{W} \mathbi{E}({x}_{i})$将其映射到空间$\mathbi{y}$中（$\mathbi{E}({x}_{i})$表示的是单词$x_{i}$的词嵌入向量），然后在$\mathbi{Y}$中找到该点的最近邻点$y_{j}$，于是$y_{j}$就是$x_{i}$的翻译词，重复该过程即可归纳出种子词典$D$，第一阶段结束。事实上，由于第一阶段缺乏监督信号，得到的种子词典$D$会包含大量的噪音，因此需要进行进一步的微调。
-\parinterval 微调的原理普遍基于普氏分析\upcite{DBLP:journals/corr/MikolovLS13}。假设现在有一个种子词典$D=\left\{x_{i}, y_{i}\right\}$其中${i \in\{1, n\}}$，和两个单语词嵌入$\mathbi{X}$和$\mathbi{Y}$，那么就可以将$D$作为{\small\bfnew{映射锚点}}\index{映射锚点}（Anchor\index{Anchor}）学习一个转移矩阵$\mathbi{W}$，使得$\mathbi{W} \mathbi{X}$与$\mathbi{Y}$这两个空间尽可能相近，此外通过对$\mathbi{W}$施加正交约束可以显著提高能\upcite{DBLP:conf/naacl/XingWLL15}，于是这个优化问题就转变成了{\small\bfnew{普鲁克问题}}\index{普鲁克问题}（Procrustes Problem\index{Procrustes Problem}）\upcite{DBLP:conf/iclr/SmithTHH17}，可以通过{\small\bfnew{奇异值分解}}\index{奇异值分解}（Singular Value Decomposition，SVD\index{Singular Value Decomposition，SVD}）来获得近似解：
+\parinterval 微调的原理普遍基于普氏分析\upcite{DBLP:journals/corr/MikolovLS13}。假设现在有一个种子词典$D=\left\{x_{i}, y_{i}\right\}$其中${i \in\{1, n\}}$，和两个单语词嵌入$\mathbi{X}$和$\mathbi{Y}$，那么就可以将$D$作为{\small\bfnew{映射锚点}}\index{映射锚点}（Anchor\index{Anchor}）学习一个转移矩阵$\mathbi{W}$，使得$\mathbi{W} \mathbi{X}$与$\mathbi{Y}$这两个空间尽可能相近，此外通过对$\mathbi{W}$施加正交约束可以显著提高能\upcite{DBLP:conf/naacl/XingWLL15}，于是这个优化问题就转变成了{\small\bfnew{普鲁克问题}}\index{普鲁克问题}（Procrustes Problem\index{Procrustes Problem}）\upcite{DBLP:conf/iclr/SmithTHH17}，可以通过{\small\bfnew{奇异值分解}}\index{奇异值分解}（Singular Value Decomposition，SVD\index{Singular Value Decomposition}）来获得近似解：
 \begin{eqnarray}
 \mathbi{W}^{\star} & = &\underset{\mathbi{W} \in O_{d}(\mathbb{R})}{\operatorname{argmin}}\|\mathbi{W} \mathbi{X}'- \mathbi{Y}' \|_{\mathrm{F}} \nonumber \\

--- a/Chapter3/chapter3.tex
+++ b/Chapter3/chapter3.tex
@@ -34,8 +34,8 @@
 %----------------------------------------------
 \begin{figure}[htp]
    \centering
- 	\subfigure[机器翻译系统被看作一个黑盒] {\input{./Chapter3/Figures/figure-mt-system-as-a-black-box}  }
+ 	\subfigure[\small{机器翻译系统被看作一个黑盒}] {\input{./Chapter3/Figures/figure-mt-system-as-a-black-box}  }
- 	\subfigure[机器翻译系统 = 前/后处理 + 核心引擎] {\input{./Chapter3/Figures/figure-mt=language-analysis+translation-engine}}
+ 	\subfigure[\small{机器翻译系统 = 前/后处理 + 核心引擎}] {\input{./Chapter3/Figures/figure-mt=language-analysis+translation-engine}}
 	\caption{机器翻译系统的结构}
    \label{fig:3.1-1}
 \end{figure}
@@ -253,8 +253,8 @@ $计算这种切分的概率值。
 %----------------------------------------------
 \begin{figure}[htp]
    \centering
- 	\subfigure[BIO格式标注命名实体] {\input{./Chapter3/Figures/figure-labeling-named-entities-in-bio-format} }
+ 	\subfigure[\small{BIO格式标注命名实体}] {\input{./Chapter3/Figures/figure-labeling-named-entities-in-bio-format} }
- 	\subfigure[BIOES格式标注命名实体] {\input{./Chapter3/Figures/figure-labeling-named-entities-in-bioes-format}}
+ 	\subfigure[\small{BIOES格式标注命名实体}] {\input{./Chapter3/Figures/figure-labeling-named-entities-in-bioes-format}}
 	\caption{BIO和BIOES格式对比}
    \label{fig:3.3-1}
 \end{figure}
@@ -514,7 +514,7 @@ Z(\seq{x})&=&\sum_{\seq{y}}\exp(\sum_{i=1}^m\sum_{j=1}^k\lambda_{j}F_{j}(y_{i-1}
 \begin{figure}[htp]
 \centering
 \begin{tabular}{l l l}
-\subfigure[HMM处理序列标注]{\input{./Chapter3/Figures/figure-process-sequence-labeling-by-hmm}} & \subfigure[CRF处理序列标注]{\input{./Chapter3/Figures/figure-process-sequence-labeling-by-crf}} & \subfigure[分类模型处理序列标注]{\input{./Chapter3/Figures/figure-process-sequence-labeling-by-classfication}}
+\subfigure[\small{HMM处理序列标注}]{\input{./Chapter3/Figures/figure-process-sequence-labeling-by-hmm}} & \subfigure[\small{CRF处理序列标注}]{\input{./Chapter3/Figures/figure-process-sequence-labeling-by-crf}} & \subfigure[\small{分类模型处理序列标注}]{\input{./Chapter3/Figures/figure-process-sequence-labeling-by-classfication}}
 \end{tabular}
 \caption{HMM、CRF、分类算法三种方法对比}
 \label{fig:3.3-7}

--- a/Chapter4/chapter4.tex
+++ b/Chapter4/chapter4.tex
@@ -149,7 +149,7 @@
    \item  {\small\sffamily\bfseries{根据冲突次数进行排序}}\upcite{DBLP:conf/wmt/Lopez12}。第一种排序策略中存在冲突现象：例如在每次两两比较中，系统${S}_j$胜过系统${S}_k$ 的次数比系统${S}_j$不敌系统${S}_k$的次数多，若待评价系统仅有系统${S}_j$、${S}_k$，显然系统${S}_j$的排名高于系统${S}_k$。但当待评价系统很多时，可能系统${S}_j$在所有比较中获胜的次数低于系统${S}_k$，此时就出现了总体排序与局部排序不一致的冲突。因此，有研究者提出，能够与局部排序冲突最少的总体排序才是最合理的。令$O$表示一个对若干个系统的排序，该排序所对应的冲突定义为：
 \begin{eqnarray}
-\textrm{conflict}(O) &=& \sum\limits_{{{S}_j} \in O,{{S}_k} \in O,j \ne k} {{\textrm{max}}(0,\textrm{count}_{\textrm{win}}({{S}_j},{{S}_k}) - \textrm{count}_{\textrm{loss}}({{S}_j},{{S}_k}))}
+\textrm{conflict}(O) =\sum\limits_{{{S}_j},{{S}_k} \in O,j \ne k} {{\textrm{max}}(0,\textrm{count}_{\textrm{win}}({{S}_j},{{S}_k}) - \textrm{count}_{\textrm{loss}}({{S}_j},{{S}_k}))}
 \label{eq:4-1}
 \end{eqnarray}
@@ -297,8 +297,8 @@
 %----------------------------------------------
 \begin{figure}[htp]
    \centering
-	\subfigure[“绝对”匹配词对齐-1]{\input{./Chapter4/Figures/figure-absolute-match-word-alignment-1}}
+	\subfigure[\small{“绝对”匹配词对齐-1}]{\input{./Chapter4/Figures/figure-absolute-match-word-alignment-1}}
-	\subfigure[“绝对”匹配词对齐-2]{\input{./Chapter4/Figures/figure-absolute-match-word-alignment-2}}
+	\subfigure[\small{“绝对”匹配词对齐-2}]{\input{./Chapter4/Figures/figure-absolute-match-word-alignment-2}}
   \caption{“绝对”匹配模型}
   \label{fig:4-3}
 \end{figure}
@@ -490,8 +490,8 @@ His house is on the south bank of the river.
 %----------------------------------------------
 \begin{figure}[htp]
    \centering
-\subfigure[英语参考答案集表示]{\input{./Chapter4/Figures/figure-representation-of-english-reference-answer-set}}
+\subfigure[\small{英语参考答案集表示}]{\input{./Chapter4/Figures/figure-representation-of-english-reference-answer-set}}
-\subfigure[捷克语参考答案集表示]{\input{./Chapter4/Figures/figure-representation-of-czech-reference-answer-set}}
+\subfigure[\small{捷克语参考答案集表示}]{\input{./Chapter4/Figures/figure-representation-of-czech-reference-answer-set}}
   \caption{使用HyTER构造的参考答案集}
   \label{fig:4-8}
 \end{figure}
@@ -815,11 +815,11 @@ d&=&t \frac{s}{\sqrt{n}}
 \begin{example}
 文档级质量评估任务
-Candidate： A {\red housewife} won the first prize in the supermarket's anniversary
+上文信息：A {\red housewife} won the first prize in the supermarket's anniversary
 \hspace{5em}celebration.
-Reference： A few days ago, {\red he} contacted the News Channel and said that the
+机器译文：A few days ago, {\red he} contacted the News Channel and said that the
 \hspace{5em}supermarket owner refused to give {\red him} the prize.
 \label{eg:4-9}

--- a/Chapter9/Figures/figure-activate.tex
+++ b/Chapter9/Figures/figure-activate.tex
@@ -7,7 +7,7 @@
 \foreach \y in {1.0,0.5}{\draw(0,\y)--(0.05,\y)node[left,outer sep=2pt,font=\scriptsize]at(0,\y){\y};}
 \draw[color=red ,domain=-1.4:1, line width=1pt]plot(\x,{ln(1+(exp(\x))});
 \node[black,anchor=south] at (0,1.4) {\small $y = \ln(1+{\textrm e}^x)$};
-\node [anchor=south east,inner sep=1pt] (labela) at (0.8,-2) {\footnotesize{(a) Softplus}};
+\node [anchor=south east,inner sep=1pt] (labela) at (0.8,-2) {\small{(a) Softplus}};
 \end{scope}
 %%%------------------------------------------------------------------------------------------------------------
@@ -22,7 +22,7 @@
 \foreach \y in {0.5,1.0}{\draw(0,\y)--(0.05,\y)node[left,outer sep=2pt,font=\scriptsize]at(-0.15,\y){\y};}
 \draw[color=red,domain=-1.4:1.4, line width=1pt]plot(\x,{1/(1+(exp(-5*\x)))});
 \node[black,anchor=south] at (0,1.4) {\small $y = \frac{1}{1+{\textrm e}^{-x}}$};
-\node [anchor=south east,inner sep=1pt] (labelb) at (0.8,-2) {\footnotesize{(b) Sigmoid}};
+\node [anchor=south east,inner sep=1pt] (labelb) at (0.8,-2) {\small{(b) Sigmoid}};
 \end{scope}
 %%%------------------------------------------------------------------------------------------------------------
@@ -35,7 +35,7 @@
        \foreach \y in {,-1.0-0.5,0.5,1.0}{\draw(0,\y)--(0.05,\y)node[left,outer sep=2pt,font=\scriptsize]at(0,\y){\y};}
        \draw[color=red ,domain=-1.4:1.4, line width=1pt]plot(\x,{tanh(\x)});
        \node[black,anchor=south] at (0,1.4) {\small $y = \frac{{\textrm e}^{x}-{\textrm e}^{-x}}{{e}^{x}+e^{-x}}$};
-\node [anchor=south east,inner sep=1pt] (labelc) at (0.8,-2) {\footnotesize{(c) Tanh}};
+\node [anchor=south east,inner sep=1pt] (labelc) at (0.8,-2) {\small{(c) Tanh}};
 \end{scope}
 %%%------------------------------------------------------------------------------------------------------------
@@ -47,7 +47,7 @@
        \foreach \y in {0.5,1.0}{\draw(0,\y)--(0.05,\y)node[left,outer sep=2pt,font=\scriptsize]at(0,\y){\y};}
        \draw[color=red ,domain=-1.4:1.4, line width=1pt]plot(\x,{max(\x,0)});
        \node[black,anchor=south] at (0,1.4) {\small $y =\max (0, x)$};
-\node [anchor=south east,inner sep=1pt] (labeld) at (0.8,-2) {\footnotesize{(d) ReLU}};
+\node [anchor=south east,inner sep=1pt] (labeld) at (0.8,-2) {\small{(d) ReLU}};
 \end{scope}
 %%%------------------------------------------------------------------------------------------------------------
@@ -58,7 +58,7 @@
        \foreach \y in {0.5,1.0}{\draw(0,\y)--(0.05,\y)node[left,outer sep=2pt,font=\scriptsize]at(-0.15,\y){\y};}
        \draw[color=red ,domain=-1.4:1.4, line width=1pt]plot(\x,{exp(-1*((\x)^2))});
        \node[black,anchor=south] at (0,1.4) {\small $y =e^{-x^2}$};
-\node [anchor=south east,inner sep=1pt] (labele) at (0.8,-2) {\footnotesize{(e) Gaussian}};
+\node [anchor=south east,inner sep=1pt] (labele) at (0.8,-2) {\small{(e) Gaussian}};
 \end{scope}
 %%%------------------------------------------------------------------------------------------------------------
@@ -69,7 +69,7 @@
        \foreach \y in {0.5,1.0}{\draw(0,\y)--(0.05,\y)node[left,outer sep=2pt,font=\scriptsize]at(0,\y){\y};}
        \draw[color=red ,domain=-1:1, line width=1pt]plot(\x,\x);
        \node[black,anchor=south] at (0,1.4) {\small $y =x$};
-\node [anchor=south east,inner sep=1pt] (labelf) at (0.8,-2) {\footnotesize{(f) Identity}};
+\node [anchor=south east,inner sep=1pt] (labelf) at (0.8,-2) {\small{(f) Identity}};
 \end{scope}
 \end{tikzpicture}
 %%%------------------------------------------------------------------------------------------------------------
\ No newline at end of file
--- a/Chapter9/Figures/figure-bias.tex
+++ b/Chapter9/Figures/figure-bias.tex
@@ -8,7 +8,7 @@
 \draw [-] (-0.05,1) -- (0.05,1);
 \node [anchor=east,inner sep=1pt] (label1) at (0,1) {\tiny{1}};
 \node [anchor=south east,inner sep=1pt] (label2) at (0,0) {\tiny{0}};
-\node [anchor=south east,inner sep=1pt] (labela) at (0.2,-0.5) {\footnotesize{(a)}};
+\node [anchor=south east,inner sep=1pt] (labela) at (0.2,-0.5) {\small{(a)}};
 }
 {\node [anchor=north west,align=left] (wblabel) at (-1.8,2) {{\scriptsize{\ $w_{11}=100$}}\\[-0ex] \scriptsize{\ $b_1=0$}};}
 {\draw [-,very thick,ublue,rounded corners=0.1em] (-1.5,0) -- (0,0) -- (0,1) -- (1.5,1);}
@@ -21,7 +21,7 @@
 \draw [-] (-0.05,1) -- (0.05,1);
 \node [anchor=east,inner sep=1pt] (label1) at (0,1) {\tiny{1}};
 \node [anchor=south east,inner sep=1pt] (label2) at (0,0) {\tiny{0}};
-\node [anchor=south east,inner sep=1pt] (labelb) at (0.2,-0.5) {\footnotesize{(b)}};
+\node [anchor=south east,inner sep=1pt] (labelb) at (0.2,-0.5) {\small{(b)}};
 }
 {\node [anchor=north west,align=left] (wblabel) at (-1.8,2) {\scriptsize{\ $w_{11}=100$}\\[-0ex] {\scriptsize{\ $b_1=-2$}}};}
 {\draw [-,very thick,ublue,rounded corners=0.1em] (-1.5,0) -- (0.25,0) -- (0.25,1) -- (1.5,1);}
@@ -34,7 +34,7 @@
 \draw [-] (-0.05,1) -- (0.05,1);
 \node [anchor=east,inner sep=1pt] (label1) at (0,1) {\tiny{1}};
 \node [anchor=south east,inner sep=1pt] (label2) at (0,0) {\tiny{0}};
-\node [anchor=south east,inner sep=1pt] (labelc) at (0.2,-0.5) {\footnotesize{(c)}};
+\node [anchor=south east,inner sep=1pt] (labelc) at (0.2,-0.5) {\small{(c)}};
 }
 {\node [anchor=north west,align=left] (wblabel) at (-1.8,2) {\scriptsize{\ $w_{11}=100$}\\[-0ex] {\scriptsize{\ $b_1=-4$}}};}
 {\draw [-,very thick,ublue,rounded corners=0.1em] (-1.5,0) -- (0.5,0) -- (0.5,1) -- (1.5,1);}

--- a/Chapter9/Figures/figure-broadcast.tex
+++ b/Chapter9/Figures/figure-broadcast.tex
@@ -9,7 +9,7 @@
    \addtocounter{mycount1}{1};
  }
 \node [anchor=south] (varlabel) at (0,0.6) {$\mathbi{s}$};
-\node [anchor=north] (labelc) at (0,-0.7) {\footnotesize{(a)}};
+\node [anchor=north] (labelc) at (0,-0.7) {\small{(a)}};
 \end{scope}
 \begin{scope}[xshift=2.1in]
@@ -21,7 +21,7 @@
    \addtocounter{mycount1}{1};
  }
 \node [anchor=south] (varlabel) at (0,0.1) {$\mathbi{b}$};
-\node [anchor=north] (labelc) at (0,-0.7) {\footnotesize{(b)}};
+\node [anchor=north] (labelc) at (0,-0.7) {\small{(b)}};
 \end{scope}
@@ -51,7 +51,7 @@
  }
 \node [anchor=center] (plabel) at (-4.5em,0) {\huge{$\mathbf{+}$}};
 \node [anchor=south] (varlabel) at (0,0.6) {$\mathbi{b}$};
-\node [anchor=north] (labelc) at (0,-0.7) {\footnotesize{(c)}};
+\node [anchor=north] (labelc) at (0,-0.7) {\small{(c)}};
 \end{scope}
 \begin{scope}[yshift=-1in,xshift=3in]
 \setcounter{mycount1}{2}

--- a/Chapter9/Figures/figure-fit.tex
+++ b/Chapter9/Figures/figure-fit.tex
@@ -10,7 +10,7 @@
 \draw [-,ublue] (n10.west) -- (n10.east);
 \draw [-,ublue] (n11.west) -- (n11.east);
 \node [anchor=north] (x1) at ([yshift=-6em]n11.south) {$x_1$};
-\node [anchor=north] (labela) at ([xshift=3.5em,yshift=-0.5em]x1.south) {\footnotesize{(a) 拟合一小段函数}};
+\node [anchor=north] (labela) at ([xshift=3.5em,yshift=-0.5em]x1.south) {\small{(a) 拟合一小段函数}};
 \node [anchor=north] (b) at ([yshift=-6em]n10.south) {$\mathbi{b}$};
 {
 \draw [->,thick,red] (b.north) -- ([yshift=-0.1em]n10.south);
@@ -92,7 +92,7 @@
 \draw [-,ublue] (n10.west) -- (n10.east);
 \draw [-,ublue] (n11.west) -- (n11.east);
 \node [anchor=north] (x1) at ([yshift=-6em]n11.south) {$x_1$};
-\node [anchor=north] (labelb) at ([xshift=6em,yshift=-0.5em]x1.south) {\footnotesize{(b) 拟合更大一段函数}};
+\node [anchor=north] (labelb) at ([xshift=6em,yshift=-0.5em]x1.south) {\small{(b) 拟合更大一段函数}};
 \node [anchor=north] (b) at ([yshift=-6em]n10.south) {$\mathbi{b}$};
 {
 \draw [->,thick,red] (b.north) -- ([yshift=-0.1em]n10.south);

--- a/Chapter9/Figures/figure-parallel.tex
+++ b/Chapter9/Figures/figure-parallel.tex
@@ -32,13 +32,13 @@
 }
 {
-\draw[->,very thick,red] ([xshift=-0.5em,yshift=2pt]processor2.north) -- ([xshift=-0.5em,yshift=-2pt]serverbox.south) node [pos=0.5,align=right,xshift=-2em] (pushlabel) {\footnotesize{$\frac{\partial J}{\partial{\bm \theta}}$}};;
+\draw[->,very thick,red!70] ([xshift=-0.5em,yshift=2pt]processor2.north) -- ([xshift=-0.5em,yshift=-2pt]serverbox.south) node [pos=0.5,align=right,xshift=-2em] (pushlabel) {\footnotesize{$\frac{\partial J}{\partial{\bm \theta}}$}};;
-\draw[<-,very thick,blue] ([xshift=0.5em,yshift=2pt]processor2.north) -- ([xshift=0.5em,yshift=-2pt]serverbox.south) node [pos=0.5,align=left,xshift=2.2em] (fetchlabel) {\footnotesize{${\bm \theta}_{\textrm{new}}$}};;;
+\draw[<-,very thick,ublue!90] ([xshift=0.5em,yshift=2pt]processor2.north) -- ([xshift=0.5em,yshift=-2pt]serverbox.south) node [pos=0.5,align=left,xshift=2.2em] (fetchlabel) {\footnotesize{${\bm \theta}_{\textrm{new}}$}};;;
-\draw[->,very thick,red] ([xshift=-0.5em,yshift=2pt]processor3.north) --
+\draw[->,very thick,red!70] ([xshift=-0.5em,yshift=2pt]processor3.north) --
 ([xshift=3em,yshift=-2pt]serverbox.south);
-\draw[<-,very thick,blue] ([xshift=0.5em,yshift=2pt]processor3.north) -- ([xshift=4em,yshift=-2pt]serverbox.south) node [pos=0.49,align=left,xshift=2.2em] (fetchlabel) {\footnotesize{Fetch($\cdot$)}};
+\draw[<-,very thick,ublue!90] ([xshift=0.5em,yshift=2pt]processor3.north) -- ([xshift=4em,yshift=-2pt]serverbox.south) node [pos=0.49,align=left,xshift=2.2em] (fetchlabel) {\footnotesize{Fetch($\cdot$)}};
-\draw[->,very thick,red] ([xshift=-0.5em,yshift=2pt]processor1.north) -- ([xshift=-4em,yshift=-2pt]serverbox.south) node [pos=0.5,align=right,xshift=-2em] (pushlabel) {\footnotesize{Push($\cdot$)}};
+\draw[->,very thick,red!70] ([xshift=-0.5em,yshift=2pt]processor1.north) -- ([xshift=-4em,yshift=-2pt]serverbox.south) node [pos=0.5,align=right,xshift=-2em] (pushlabel) {\footnotesize{Push($\cdot$)}};
-\draw[<-,very thick,blue] ([xshift=0.5em,yshift=2pt]processor1.north) -- ([xshift=-3em,yshift=-2pt]serverbox.south);
+\draw[<-,very thick,ublue!90] ([xshift=0.5em,yshift=2pt]processor1.north) -- ([xshift=-3em,yshift=-2pt]serverbox.south);
 }
 %%%%%%%%%%%
@@ -110,13 +110,13 @@
 }
 {
-\draw[->,very thick,red] ([xshift=-0.5em,yshift=2pt]processor2.north) -- ([xshift=-0.5em,yshift=-2pt]serverbox.south) node [pos=0.5,align=right,xshift=-2em] (pushlabel) {\footnotesize{$\frac{\partial J}{\partial {\bm \theta}}$}};;
+\draw[->,very thick,red!70] ([xshift=-0.5em,yshift=2pt]processor2.north) -- ([xshift=-0.5em,yshift=-2pt]serverbox.south) node [pos=0.5,align=right,xshift=-2em] (pushlabel) {\footnotesize{$\frac{\partial J}{\partial {\bm \theta}}$}};;
-\draw[<-,very thick,blue] ([xshift=0.5em,yshift=2pt]processor2.north) -- ([xshift=0.5em,yshift=-2pt]serverbox.south) node [pos=0.5,align=left,xshift=2.2em] (fetchlabel) {\footnotesize{${\bm \theta}_{\textrm{new}}$}};;;
+\draw[<-,very thick,ublue] ([xshift=0.5em,yshift=2pt]processor2.north) -- ([xshift=0.5em,yshift=-2pt]serverbox.south) node [pos=0.5,align=left,xshift=2.2em] (fetchlabel) {\footnotesize{${\bm \theta}_{\textrm{new}}$}};;;
-\draw[->,very thick,red] ([xshift=-0.5em,yshift=2pt]processor3.north) --
+\draw[->,very thick,red!70] ([xshift=-0.5em,yshift=2pt]processor3.north) --
 ([xshift=3em,yshift=-2pt]serverbox.south);
-\draw[<-,very thick,blue] ([xshift=0.5em,yshift=2pt]processor3.north) -- ([xshift=4em,yshift=-2pt]serverbox.south) node [pos=0.49,align=left,xshift=2.2em] (fetchlabel) {\footnotesize{Fetch($\cdot$)}};
+\draw[<-,very thick,ublue] ([xshift=0.5em,yshift=2pt]processor3.north) -- ([xshift=4em,yshift=-2pt]serverbox.south) node [pos=0.49,align=left,xshift=2.2em] (fetchlabel) {\footnotesize{Fetch($\cdot$)}};
-\draw[->,very thick,red] ([xshift=-0.5em,yshift=2pt]processor1.north) -- ([xshift=-4em,yshift=-2pt]serverbox.south) node [pos=0.5,align=right,xshift=-2em] (pushlabel) {\footnotesize{Push($\cdot$)}};
+\draw[->,very thick,red!70] ([xshift=-0.5em,yshift=2pt]processor1.north) -- ([xshift=-4em,yshift=-2pt]serverbox.south) node [pos=0.5,align=right,xshift=-2em] (pushlabel) {\footnotesize{Push($\cdot$)}};
-\draw[<-,very thick,blue] ([xshift=0.5em,yshift=2pt]processor1.north) -- ([xshift=-3em,yshift=-2pt]serverbox.south);
+\draw[<-,very thick,ublue] ([xshift=0.5em,yshift=2pt]processor1.north) -- ([xshift=-3em,yshift=-2pt]serverbox.south);
 }
 %%%%%%%%%%%

--- a/Chapter9/Figures/figure-piecewise.tex
+++ b/Chapter9/Figures/figure-piecewise.tex
@@ -10,7 +10,7 @@
 \draw [->,thick] (-2.2,0) -- (2.2,0);
 \draw [->,thick] (0,0) -- (0,2);
 \draw [-] (-0.05,1) -- (0.05,1);
-\node [anchor=north,inner sep=1pt] (labelb) at (0,-0.2) {\footnotesize{(b)}};
+\node [anchor=north,inner sep=1pt] (labelb) at (0,-0.2) {\small{(b)}};
 }
 {
 \draw [->,thick] (-2.2,0) -- (2.2,0);
@@ -35,7 +35,7 @@
 \draw [-] (-0.05,1) -- (0.05,1);
 \node [anchor=east,inner sep=1pt] (label1) at (0,1.18) {\tiny{1}};
 \node [anchor=south east,inner sep=1pt] (label2) at (0,0) {\tiny{0}};
-\node [anchor=north,inner sep=1pt] (labela) at (0,-0.2) {\footnotesize{(a)}};
+\node [anchor=north,inner sep=1pt] (labela) at (0,-0.2) {\small{(a)}};
 }
 {
 \draw [->,thick] (-2.2,0) -- (2.2,0);

--- a/Chapter9/Figures/figure-rnn-lm.tex
+++ b/Chapter9/Figures/figure-rnn-lm.tex
 \begin{tikzpicture}
 \begin{scope}
-\tikzstyle{rnnnode} = [draw,inner sep=5pt,minimum width=4em,minimum height=1.5em,fill=green!30!white,blur shadow={shadow xshift=1pt,shadow yshift=-1pt}]
+\tikzstyle{rnnnode} = [draw,inner sep=5pt,minimum width=4em,minimum height=1.5em,fill=green!20!white,blur shadow={shadow xshift=1pt,shadow yshift=-1pt}]
 {
 \node [anchor=west,rnnnode] (node11) at (0,0) {\scriptsize{RNN Cell}};
 \node [anchor=west,rnnnode] (node12) at ([xshift=2em]node11.east) {\scriptsize{RNN Cell}};
 \node [anchor=west,rnnnode] (node13) at ([xshift=2em]node12.east) {\scriptsize{RNN Cell}};
 \node [anchor=west,rnnnode] (node14) at ([xshift=2em]node13.east) {\scriptsize{RNN Cell}};
 }
-\node [anchor=north,rnnnode,fill=red!30!white] (e1) at ([yshift=-1.2em]node11.south) {\tiny{${\mathbi{e}}_1={\mathbi{o}}_1{\mathbi{C}}$}};
+\node [anchor=north,rnnnode,fill=red!20!white] (e1) at ([yshift=-1.2em]node11.south) {\tiny{${\mathbi{e}}_1={\mathbi{o}}_1{\mathbi{C}}$}};
-\node [anchor=north,rnnnode,fill=red!30!white] (e2) at ([yshift=-1.2em]node12.south) {\tiny{${\mathbi{e}}_2={\mathbi{o}}_2{\mathbi{C}}$}};
+\node [anchor=north,rnnnode,fill=red!20!white] (e2) at ([yshift=-1.2em]node12.south) {\tiny{${\mathbi{e}}_2={\mathbi{o}}_2{\mathbi{C}}$}};
-\node [anchor=north,rnnnode,fill=red!30!white] (e3) at ([yshift=-1.2em]node13.south) {\tiny{${\mathbi{e}}_3={\mathbi{o}}_3{\mathbi{C}}$}};
+\node [anchor=north,rnnnode,fill=red!20!white] (e3) at ([yshift=-1.2em]node13.south) {\tiny{${\mathbi{e}}_3={\mathbi{o}}_3{\mathbi{C}}$}};
-\node [anchor=north,rnnnode,fill=red!30!white] (e4) at ([yshift=-1.2em]node14.south) {\tiny{${\mathbi{e}}_4={\mathbi{o}}_4{\mathbi{C}}$}};
+\node [anchor=north,rnnnode,fill=red!20!white] (e4) at ([yshift=-1.2em]node14.south) {\tiny{${\mathbi{e}}_4={\mathbi{o}}_4{\mathbi{C}}$}};
 \node [anchor=north] (w1) at ([yshift=-1em]e1.south) {\footnotesize{${\mathbi{o}}_1$}};
 \node [anchor=north] (w2) at ([yshift=-1em]e2.south) {\footnotesize{${\mathbi{o}}_2$}};
 \node [anchor=north] (w3) at ([yshift=-1em]e3.south) {\footnotesize{${\mathbi{o}}_3$}};
@@ -32,10 +32,10 @@
 \node [anchor=south,rnnnode] (node23) at ([yshift=1.5em]node13.north) {\scriptsize{RNN Cell}};
 \node [anchor=south,rnnnode] (node24) at ([yshift=1.5em]node14.north) {\scriptsize{RNN Cell}};
-\node [anchor=south,rnnnode,fill=blue!30!white] (node31) at ([yshift=1.5em]node21.north) {\scriptsize{Softmax($\cdot$)}};
+\node [anchor=south,rnnnode,fill=blue!20!white] (node31) at ([yshift=1.5em]node21.north) {\scriptsize{Softmax($\cdot$)}};
-\node [anchor=south,rnnnode,fill=blue!30!white] (node32) at ([yshift=1.5em]node22.north) {\scriptsize{Softmax($\cdot$)}};
+\node [anchor=south,rnnnode,fill=blue!20!white] (node32) at ([yshift=1.5em]node22.north) {\scriptsize{Softmax($\cdot$)}};
-\node [anchor=south,rnnnode,fill=blue!30!white] (node33) at ([yshift=1.5em]node23.north) {\scriptsize{Softmax($\cdot$)}};
+\node [anchor=south,rnnnode,fill=blue!20!white] (node33) at ([yshift=1.5em]node23.north) {\scriptsize{Softmax($\cdot$)}};
-\node [anchor=south,rnnnode,fill=blue!30!white] (node34) at ([yshift=1.5em]node24.north) {\scriptsize{Softmax($\cdot$)}};
+\node [anchor=south,rnnnode,fill=blue!20!white] (node34) at ([yshift=1.5em]node24.north) {\scriptsize{Softmax($\cdot$)}};
 }
 {

--- a/Chapter9/Figures/figure-rnn-model.tex
+++ b/Chapter9/Figures/figure-rnn-model.tex
 \begin{tikzpicture}
 \begin{scope}
-\tikzstyle{rnnnode} = [draw,inner sep=5pt,minimum width=4em,minimum height=1.5em,fill=green!30!white,blur shadow={shadow xshift=1pt,shadow yshift=-1pt}]
+\tikzstyle{rnnnode} = [draw,inner sep=5pt,minimum width=4em,minimum height=1.5em,fill=green!20!white,blur shadow={shadow xshift=1pt,shadow yshift=-1pt}]
 \node [anchor=west,rnnnode] (node11) at (0,0) {\scriptsize{RNN Cell}};
 \node [anchor=west,rnnnode] (node12) at ([xshift=2em]node11.east) {\scriptsize{RNN Cell}};
 \node [anchor=west,rnnnode] (node13) at ([xshift=2em]node12.east) {\scriptsize{RNN Cell}};
 \node [anchor=west,rnnnode] (node14) at ([xshift=2em]node13.east) {\scriptsize{RNN Cell}};
-\node [anchor=north,rnnnode,fill=red!30!white] (e1) at ([yshift=-1.2em]node11.south) {\scriptsize{embedding}};
+\node [anchor=north,rnnnode,fill=red!20!white] (e1) at ([yshift=-1.2em]node11.south) {\scriptsize{embedding}};
-\node [anchor=north,rnnnode,fill=red!30!white] (e2) at ([yshift=-1.2em]node12.south) {\scriptsize{embedding}};
+\node [anchor=north,rnnnode,fill=red!20!white] (e2) at ([yshift=-1.2em]node12.south) {\scriptsize{embedding}};
-\node [anchor=north,rnnnode,fill=red!30!white] (e3) at ([yshift=-1.2em]node13.south) {\scriptsize{embedding}};
+\node [anchor=north,rnnnode,fill=red!20!white] (e3) at ([yshift=-1.2em]node13.south) {\scriptsize{embedding}};
-\node [anchor=north,rnnnode,fill=red!30!white] (e4) at ([yshift=-1.2em]node14.south) {\scriptsize{embedding}};
+\node [anchor=north,rnnnode,fill=red!20!white] (e4) at ([yshift=-1.2em]node14.south) {\scriptsize{embedding}};
 \node [anchor=north] (w1) at ([yshift=-1em]e1.south) {\footnotesize{亚伦}};
 \node [anchor=north] (w2) at ([yshift=-1em]e2.south) {\footnotesize{任职}};
 \node [anchor=north] (w3) at ([yshift=-1em]e3.south) {\footnotesize{于}};

--- a/Chapter9/Figures/figure-save.tex
+++ b/Chapter9/Figures/figure-save.tex
@@ -7,7 +7,7 @@
    \node [fill=green!15,inner sep=0pt,minimum height=0.49cm,minimum width=0.49cm](vector1) at (\x,0.25) {$\number\value{mycount1}$};
    \addtocounter{mycount1}{1};
 }
-\node [anchor=north] (labela) at ([xshift=-1.2em,yshift=-0em]vector1.south) {\footnotesize{(a) }};
+\node [anchor=north] (labela) at ([xshift=-1.2em,yshift=-0em]vector1.south) {\small{(a) }};
 \end{scope}
 \begin{scope}[xshift=1.2in]
@@ -21,7 +21,7 @@
    \node [fill=red!15,inner sep=0pt,minimum height=0.49cm,minimum width=0.49cm] at (\x,0.25) {$\number\value{mycount2}$};
    \addtocounter{mycount2}{1};
 }
-\node [anchor=north] (labelb) at ([xshift=0.3em,yshift=-0em]vector2.south) {\footnotesize{(b) }};
+\node [anchor=north] (labelb) at ([xshift=0.3em,yshift=-0em]vector2.south) {\small{(b) }};
 \end{scope}
 \begin{scope}[yshift=-0.6in]
@@ -47,7 +47,7 @@
 \draw[decorate,thick,decoration={brace,mirror,raise=0.2em}] (3.05,-0.2) -- (6,-0.2);
 \node [anchor=north] (subtensor1) at (1.5,-0.4) {\footnotesize{$3 \times 2$ sub-tensor}};
 \node [anchor=north] (subtensor1) at (4.5,-0.4) {\footnotesize{$3 \times 2$ sub-tensor}};
-\node [anchor=north] (labelc) at (3,-0.8) {\footnotesize{(c)}};
+\node [anchor=north] (labelc) at (3,-0.8) {\small{(c)}};
 \end{scope}
 \end{tikzpicture}

--- a/Chapter9/Figures/figure-sawtooth.tex
+++ b/Chapter9/Figures/figure-sawtooth.tex
@@ -40,7 +40,7 @@
 \draw[-,ublue](0,-0.8)..controls(0.5,-0.8) and (0.6,-0.85)..(0.6,-0.9)..controls(0.6,-0.93)and (0.5,-0.91)..(0.3,-0.88)..controls(0.2,-0.87)and (0.1,-0.86)..(0,-0.86)..controls(-0.1,-0.86)and(-0.2,-0.87)..(-0.3,-0.88)..controls(-0.5,-0.91) and(-0.6,-0.93) ..(-0.6,-0.9)..controls(-0.6,-0.85)and (-0.5,-0.8)..(0,-0.8);
-\node [anchor=north] (labela) at (0,-2.7) {\footnotesize{(a)梯度下降算法中的``锯齿''现象}};
+\node [anchor=north] (labela) at (0,-2.7) {\small{(a)梯度下降算法中的``锯齿''现象}};
 \end{scope}
@@ -78,7 +78,7 @@
 \draw[-,ublue](0,-0.8)..controls(0.5,-0.8) and (0.6,-0.85)..(0.6,-0.9)..controls(0.6,-0.93)and (0.5,-0.91)..(0.3,-0.88)..controls(0.2,-0.87)and (0.1,-0.86)..(0,-0.86)..controls(-0.1,-0.86)and(-0.2,-0.87)..(-0.3,-0.88)..controls(-0.5,-0.91) and(-0.6,-0.93) ..(-0.6,-0.9)..controls(-0.6,-0.85)and (-0.5,-0.8)..(0,-0.8);
-\node [anchor=north] (labelb) at (0,-3) {\footnotesize{(b)Momentum梯度下降算法更加``平滑''地更新}};
+\node [anchor=north] (labelb) at (0,-3) {\small{(b)Momentum梯度下降算法更加``平滑''地更新}};
 \end{scope}
 \end{tikzpicture}

--- a/Chapter9/Figures/figure-translation.tex
+++ b/Chapter9/Figures/figure-translation.tex
 %%%------------------------------------------------------------------------------------------------------------
 \begin{tikzpicture}
-\tikzstyle{neuron} = [rectangle,draw,thick,fill=red!30,red!35,minimum height=2em,minimum width=2em,font=\small]
+\tikzstyle{neuron} = [rectangle,draw,thick,fill=red!25,red!30,minimum height=2em,minimum width=2em,font=\small]
 \node[neuron,anchor=north] (a1) at (0,0) {};
 \draw[->,thick] ([xshift=-2em,yshift=0em]a1.south) to ([xshift=3em,yshift=0em]a1.south);
 \draw[->,thick] ([xshift=0em,yshift=-4em]a1.west) to ([xshift=0em,yshift=2em]a1.west);
@@ -11,7 +11,7 @@
 \node [anchor=west] (x) at ([xshift=-0.7em,yshift=1em]a1.south) {\Large{$\textbf{F}$}};
 {
-\tikzstyle{neuron} = [rectangle,draw,thick,fill=red!30,red!35,minimum height=2em,minimum width=2em,font=\small]
+\tikzstyle{neuron} = [rectangle,draw,thick,fill=red!25,red!30,minimum height=2em,minimum width=2em,font=\small]
 \node[neuron,anchor=north] (a2) at ([xshift=10em,yshift=0em]a1.south) {};
 \draw[->,thick] ([xshift=-2em,yshift=0em]a2.north) to ([xshift=3em,yshift=0em]a2.north);
 \draw[->,thick] ([xshift=0em,yshift=-2em]a2.west) to ([xshift=0em,yshift=4em]a2.west);
@@ -33,7 +33,7 @@
 }
 {
-\tikzstyle{neuron} = [rectangle,draw,thick,fill=red!30,red!35,minimum height=2em,minimum width=2em,font=\small]
+\tikzstyle{neuron} = [rectangle,draw,thick,fill=red!25,red!30,minimum height=2em,minimum width=2em,font=\small]
 \node[neuron,anchor=north] (a3) at ([xshift=11em,yshift=2.05em]a2.south) {};
 \draw[->,thick] ([xshift=-3em,yshift=0em]a3.north) to ([xshift=2em,yshift=0em]a3.north);
 \draw[->,thick] ([xshift=-1em,yshift=-2em]a3.west) to ([xshift=-1em,yshift=4em]a3.west);

--- a/Chapter9/Figures/figure-w1.tex
+++ b/Chapter9/Figures/figure-w1.tex
@@ -8,7 +8,7 @@
 \draw [-] (-0.05,1) -- (0.05,1);
 \node [anchor=east,inner sep=1pt] (label1) at (0,1) {\tiny{1}};
 \node [anchor=south east,inner sep=1pt] (label2) at (0,0) {\tiny{0}};
-\node [anchor=south east,inner sep=1pt] (labela) at (0.2,-0.5) {\footnotesize{(a)}};
+\node [anchor=south east,inner sep=1pt] (labela) at (0.2,-0.5) {\small{(a)}};
 }
 {\node [anchor=north west,align=left] (wblabel) at (-1.8,2) {\scriptsize{\ $w_{11}=100$}\\[-0ex] {\scriptsize{\ $b_1=-4$}}};}
 {\draw [-,very thick,ublue,rounded corners=0.1em] (-1.5,0) -- (0.5,0) -- (0.5,1) -- (1.5,1);}
@@ -21,7 +21,7 @@
 \draw [-] (-0.05,1) -- (0.05,1);
 \node [anchor=east,inner sep=1pt] (label1) at (0,1) {\tiny{1}};
 \node [anchor=south east,inner sep=1pt] (label2) at (0,0) {\tiny{0}};
-\node [anchor=south east,inner sep=1pt] (labelb) at (0.2,-0.5) {\footnotesize{(b)}};
+\node [anchor=south east,inner sep=1pt] (labelb) at (0.2,-0.5) {\small{(b)}};
 }
 {\node [anchor=north west,align=left] (wblabel) at (-1.8,2) {{\scriptsize{\ $w'_{11}=0.9$}}};}
 {\draw [-,very thick,ublue,rounded corners=0.1em] (-1.8,0) -- (0.5,0) -- (0.5,0.9) -- (1.8,0.9);}
@@ -35,7 +35,7 @@
 \draw [-] (-0.05,1) -- (0.05,1);
 \node [anchor=east,inner sep=1pt] (label1) at (0,1) {\tiny{1}};
 \node [anchor=south east,inner sep=1pt] (label2) at (0,0) {\tiny{0}};
-\node [anchor=south east,inner sep=1pt] (labelc) at (0.2,-0.5) {\footnotesize{(c)}};
+\node [anchor=south east,inner sep=1pt] (labelc) at (0.2,-0.5) {\small{(c)}};
 }
 {\node [anchor=north west,align=left] (wblabel) at (-1.8,2) {{\scriptsize{\ $w'_{11}=0.7$}}};}
 {\draw [-,very thick,ublue,rounded corners=0.1em] (-1.5,0) -- (0.5,0) -- (0.5,0.7) -- (1.5,0.7);}

--- a/Chapter9/Figures/figure-weight.tex
+++ b/Chapter9/Figures/figure-weight.tex
@@ -8,7 +8,7 @@
 \draw [-] (-0.05,1) -- (0.05,1);
 \node [anchor=east,inner sep=1pt] (label1) at (0,1) {\tiny{1}};
 \node [anchor=south east,inner sep=1pt] (label2) at (0,0) {\tiny{0}};
-\node [anchor=south east,inner sep=1pt] (labela) at (0.2,-0.5) {\footnotesize{(a)}};
+\node [anchor=south east,inner sep=1pt] (labela) at (0.2,-0.5) {\small{(a)}};
 }
 {\node [anchor=north west,align=left] (wblabel) at (-1.8,2) {\scriptsize{\ $w_{11}=1$}\\[-0ex] \scriptsize{\ $b_1=0$}};}
 {\draw [-,very thick,ublue,domain=-1.5:1.5,samples=100] plot (\x,{1/(1+exp(-2*\x))});}
@@ -21,7 +21,7 @@
 \draw [-] (-0.05,1) -- (0.05,1);
 \node [anchor=east,inner sep=1pt] (label1) at (0,1) {\tiny{1}};
 \node [anchor=south east,inner sep=1pt] (label2) at (0,0) {\tiny{0}};
-\node [anchor=south east,inner sep=1pt] (labelb) at (0.2,-0.5) {\footnotesize{(b)}};
+\node [anchor=south east,inner sep=1pt] (labelb) at (0.2,-0.5) {\small{(b)}};
 }
 {\node [anchor=north west,align=left] (wblabel) at (-1.8,2) {{\scriptsize{\ $w_{11}=10$}}\\[-0ex] \scriptsize{\ $b_1=0$}};}
 {\draw [-,very thick,ublue,domain=-1.5:1.5,samples=100] plot (\x,{1/(1+exp(-4*\x))});}
@@ -34,7 +34,7 @@
 \draw [-] (-0.05,1) -- (0.05,1);
 \node [anchor=east,inner sep=1pt] (label1) at (0,1) {\tiny{1}};
 \node [anchor=south east,inner sep=1pt] (label2) at (0,0) {\tiny{0}};
-\node [anchor=south east,inner sep=1pt] (labelc) at (0.2,-0.5) {\footnotesize{(c)}};
+\node [anchor=south east,inner sep=1pt] (labelc) at (0.2,-0.5) {\small{(c)}};
 }
 {\node [anchor=north west,align=left] (wblabel) at (-1.8,2) {{\scriptsize{\ $w_{11}=100$}}\\[-0ex] \scriptsize{\ $b_1=0$}};}
 {\draw [-,very thick,ublue,rounded corners=0.1em] (-1.5,0) -- (0,0) -- (0,1) -- (1.5,1);}

--- a/bibliography.bib
+++ b/bibliography.bib
 % !Mode:: "TeX:UTF-8"
 % !TEX encoding = UTF-8 Unicode
+new
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%% chapter 1------------------------------------------------------
@@ -836,7 +836,6 @@
 %%%%% chapter 2------------------------------------------------------
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%% chapter 3------------------------------------------------------
@@ -872,10 +871,10 @@
  publisher={IEEE Conference on Artificial Intelligence Application},
 }
-@article{张小衡1997中文机构名称的识别与分析,
+@inproceedings{张小衡1997中文机构名称的识别与分析,
  title={中文机构名称的识别与分析},
  author={张小衡 and 王玲玲},
-  journal={中文信息学报},
+  publisher={中文信息学报},
  volume={11},
  number={4},
  pages={22-33},
@@ -894,50 +893,50 @@
  year      = {2016},
 }
-@article{Baum1966Statistical,
+@inproceedings{Baum1966Statistical,
  title={Statistical Inference for Probabilistic Functions of Finite State Markov Chains},
  author={Baum, Leonard E and Petrie, Ted},
-  journal={Annals of Mathematical Stats},
+  publisher={Annals of Mathematical Stats},
  volume={37},
  number={6},
  pages={1554-1563},
  year={1966},
 }
-@article{baum1970maximization,
+@inproceedings{baum1970maximization,
  title={A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains},
  author={Baum, Leonard E and Petrie, Ted and Soules, George and Weiss, Norman},
-  journal={Annals of Mathematical Stats},
+  publisher={Annals of Mathematical Stats},
  volume={41},
  number={1},
  pages={164--171},
  year={1970},
 }
-@article{1977Maximum,
+@inproceedings{1977Maximum,
  title={Maximum likelihood from incomplete data via the EM algorithm},
  author={Dempster, Arthur P and Laird, Nan M and Rubin, Donald B},
-  journal={Journal of the Royal Statistical Society: Series B (Methodological)},
+  publisher={Journal of the Royal Statistical Society: Series B (Methodological)},
  volume={39},
  number={1},
  pages={1--22},
  year={1977}
 }
-@article{1967Error,
+@inproceedings{1967Error,
  title={Error bounds for convolutional codes and an asymptotically optimum decoding algorithm},
  author={Viterbi, Andrew},
-  journal={IEEE Transactions on Information Theory},
+  publisher={IEEE Transactions on Information Theory},
  volume={13},
  number={2},
  pages={260-269},
  year={1967},
 }
-@article{harrington2013机器学习实战,
+@inproceedings{harrington2013机器学习实战,
  title={机器学习实战},
  author={Harrington, Peter},
-  journal={人民邮电出版社, 北京},
+  publisher={人民邮电出版社, 北京},
  year={2013}
 }
@@ -969,17 +968,17 @@
    pages = {92--97},
 }
-@article{2015Bidirectional,
+@inproceedings{2015Bidirectional,
  title={Bidirectional LSTM-CRF Models for Sequence Tagging},
  author={ Huang, Zhiheng  and  Xu, Wei  and  Yu, Kai },
-  journal={CoRR},
+  publisher={CoRR},
  year={2015},
 }
-@article{chiu2016named,
+@inproceedings{chiu2016named,
  title={Named entity recognition with bidirectional LSTM-CNNs},
  author={Chiu, Jason PC and Nichols, Eric},
-  journal={Transactions of the Association for Computational Linguistics},
+  publisher={Transactions of the Association for Computational Linguistics},
  volume={4},
  pages={357--370},
  year={2016},
@@ -996,22 +995,22 @@
  year      = {2018},
 }
-@article{Li2020A,
+@inproceedings{Li2020A,
  title={A Survey on Deep Learning for Named Entity Recognition},
  author={Li, Jing and Sun, Aixin and Han, Jianglei and Li, Chenliang},
-  journal={IEEE Transactions on Knowledge and Data Engineering},
+  publisher={IEEE Transactions on Knowledge and Data Engineering},
  volume={PP},
  number={99},
  pages={1-1},
  year={2020},
 }
-@article{devlin2019bert,
+@inproceedings{devlin2019bert,
  title={Bert: Pre-training of deep bidirectional transformers for language understanding},
  author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
  year={2019},
  pages = {4171--4186},
-  journal = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
 }
 @inproceedings{conneau2019unsupervised,
@@ -1047,19 +1046,19 @@
  year      = {2016},
 }
-@article{刘挺1998最大概率分词问题及其解法,
+@inproceedings{刘挺1998最大概率分词问题及其解法,
  title={最大概率分词问题及其解法},
  author={刘挺 and 吴岩 and 王开铸},
-  journal={哈尔滨工业大学学报},
+  publisher={哈尔滨工业大学学报},
  number={06},
  pages={37-41},
  year={1998},
 }
-@article{丁洁2010基于最大概率分词算法的中文分词方法研究,
+@inproceedings{丁洁2010基于最大概率分词算法的中文分词方法研究,
  title={基于最大概率分词算法的中文分词方法研究},
  author={丁洁},
-  journal={科技信息},
+  publisher={科技信息},
  number={21},
  pages={I0075--I0075},
  year={2010}
@@ -1092,10 +1091,10 @@
  year      = {1998},
 }
-@article{1996Hidden,
+@inproceedings{1996Hidden,
  title={Hidden Markov models.},
  author={ Eddy, Sean R },
-  journal={Current Opinion in Structural Biology},
+  publisher={Current Opinion in Structural Biology},
  volume={6},
  number={3},
  pages={361-5},
@@ -1121,20 +1120,20 @@
  publisher={John Wiley \& Sons}
 }
-@article{1998Support,
+@inproceedings{1998Support,
  title={Support vector machines},
  author={Hearst, Marti A. and Dumais, Susan T and Osuna, Edgar and Platt, John and Scholkopf, Bernhard},
-  journal={IEEE Intelligent Systems \& Their Applications},
+  publisher={IEEE Intelligent Systems \& Their Applications},
  volume={13},
  number={4},
  pages={18-28},
  year={1998},
 }
-@article{2011Natural,
+@inproceedings{2011Natural,
  title={Natural Language Processing (almost) from Scratch},
  author={ Collobert, Ronan  and  Weston, Jason  and Bottou, Léon and  Karlen, Michael  and  Kavukcuoglu, Koray  and  Kuksa, Pavel },
-  journal={Journal of Machine Learning Research},
+  publisher={Journal of Machine Learning Research},
  volume={12},
  number={1},
  pages={2493-2537},
@@ -1148,10 +1147,10 @@
  publisher={Cambridge university press}
 }
-@article{berger1996maximum,
+@inproceedings{berger1996maximum,
  title={A maximum entropy approach to natural language processing},
  author={Berger, Adam and Della Pietra, Stephen A and Della Pietra, Vincent J},
-  journal={Computational linguistics},
+  publisher={Computational linguistics},
  volume={22},
  number={1},
  pages={39--71},
@@ -1161,7 +1160,7 @@
 @book{mitchell1996m,
  title={Machine Learning},
  author={Mitchell, Tom},
-  journal={McCraw Hill},
+  publisher={McCraw Hill},
  year={1996}
 }
@@ -1183,10 +1182,10 @@
  publisher={Springer}
 }
-@article{bellman1966dynamic,
+@inproceedings{bellman1966dynamic,
  title={Dynamic programming},
  author={Bellman, Richard},
-  journal={Science},
+  publisher={Science},
  volume={153},
  number={3731},
  pages={34--37},
@@ -1195,7 +1194,6 @@
 %%%%% chapter 3------------------------------------------------------
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%% chapter 4------------------------------------------------------
 @inproceedings{DBLP:conf/acl/PapineniRWZ02,
@@ -1208,7 +1206,7 @@
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2002}
 }
-@article{DBLP:journals/mt/ChurchH93,
+@inproceedings{DBLP:journals/mt/ChurchH93,
  title={Good applications for crummy machine translation},
  author={Church, Kenneth W and Hovy, Eduard H},
  volume={8},
@@ -1294,10 +1292,10 @@
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2014}
 }
-@article{DBLP:journals/mt/Shiwen93,
+@inproceedings{DBLP:journals/mt/Shiwen93,
  author    = {Shiwen Yu},
  title     = {Automatic evaluation of output quality for Machine Translation systems},
-  journal   = {Mach. Transl.},
+  publisher  = {Mach. Transl.},
  volume    = {8},
  number    = {1-2},
  pages     = {117--126},
@@ -1400,20 +1398,20 @@
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2013},
 }
-@article{DBLP:journals/corr/MatsuoKS17,
+@inproceedings{DBLP:journals/corr/MatsuoKS17,
  author    = {Junki Matsuo and
               Mamoru Komachi and
               Katsuhito Sudoh},
  title     = {Word-Alignment-Based Segment-Level Machine Translation Evaluation
               using Word Embeddings},
-  journal   = {CoRR},
+  publisher  = {CoRR},
  volume    = {abs/1704.00380},
  year      = {2017}
 }
-@article{DBLP:journals/csl/GuzmanJMN17,
+@inproceedings{DBLP:journals/csl/GuzmanJMN17,
  title={Machine translation evaluation with neural networks},
  author={Guzm{\'a}n, Francisco and Joty, Shafiq and M{\`a}rquez, Llu{\'\i}s and Nakov, Preslav},
-  journal={Computer Speech \& Language},
+  publisher={Computer Speech \& Language},
  volume={45},
  pages={180--200},
  year={2017}
@@ -1485,10 +1483,10 @@
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
-@article{DBLP:journals/mt/BiciciGG13,
+@inproceedings{DBLP:journals/mt/BiciciGG13,
  title={Predicting sentence translation quality using extrinsic and language independent features},
  author={Bi{\c{c}}ici, Ergun and Groves, Declan and van Genabith, Josef},
-  journal={Machine Translation},
+  publisher={Machine Translation},
  volume={27},
  number={3-4},
  pages={171--192},
@@ -1536,7 +1534,7 @@
               Greg Corrado and
               Jeffrey Dean},
  title     = {Efficient Estimation of Word Representations in Vector Space},
-  journal   = {arXiv preprint arXiv:1301.3781},
+  publisher  = {arXiv preprint arXiv:1301.3781},
  year      = {2013}
 }
 @inproceedings{DBLP:conf/icml/LeM14,
@@ -1588,10 +1586,10 @@
  author={Radford, Alec and Narasimhan, Karthik and Salimans, Tim and Sutskever, Ilya},
  year={2018}
 }
-@article{DBLP:journals/mtcl/Carroll66,
+@inproceedings{DBLP:journals/mtcl/Carroll66,
  author    = {John B. Carroll},
  title     = {An experiment in evaluating the quality of translations},
-  journal   = {Mech. Transl. Comput. Linguistics},
+  publisher  = {Mech. Transl. Comput. Linguistics},
  volume    = {9},
  number    = {3-4},
  pages     = {55--66},
@@ -1610,14 +1608,14 @@
  pages={224--231},
  year={2003}
 }
-@article{DBLP:journals/mt/PrzybockiPBS09,
+@inproceedings{DBLP:journals/mt/PrzybockiPBS09,
  author    = {Mark A. Przybocki and
               Kay Peterson and
               Sebastien Bronsart and
               Gregory A. Sanders},
  title     = {The {NIST} 2008 Metrics for machine translation challenge - overview,
               methodology, metrics, and results},
-  journal   = {Machine Translation},
+  publisher  = {Machine Translation},
  volume    = {23},
  number    = {2-3},
  pages     = {71--103},
@@ -1703,7 +1701,7 @@
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2007}
 }
-@article{DBLP:journals/mt/PadoCGJM09,
+@inproceedings{DBLP:journals/mt/PadoCGJM09,
  author    = {Sebastian Pad{\'{o}} and
               Daniel M. Cer and
               Michel Galley and
@@ -1711,7 +1709,7 @@
               Christopher D. Manning},
  title     = {Measuring machine translation quality as semantic equivalence: {A}
               metric based on entailment features},
-  journal   = {Machine Translation},
+  publisher  = {Machine Translation},
  volume    = {23},
  number    = {2-3},
  pages     = {181--193},
@@ -1796,7 +1794,7 @@
  publisher={European Association for Machine Translation},
  year={2011}
 }
-@article{DBLP:journals/mt/CostaLLCC15,
+@inproceedings{DBLP:journals/mt/CostaLLCC15,
  author    = {{\^{A}}ngela Costa and
               Wang Ling and
               Tiago Lu{\'{\i}}s and
@@ -1804,7 +1802,7 @@
               Lu{\'{\i}}sa Coheur},
  title     = {A linguistically motivated taxonomy for Machine Translation error
               analysis},
-  journal   = {Machine Translation},
+  publisher  = {Machine Translation},
  volume    = {29},
  number    = {2},
  pages     = {127--161},
@@ -1862,10 +1860,10 @@
  publisher={Proceedings of the Ninth Machine Translation Summit. New Orleans},
  year={2003}
 }
-@article{pearson1920notes,
+@inproceedings{pearson1920notes,
  title={Notes on the history of correlation},
  author={Pearson, Karl},
-  journal={Biometrika},
+  publisher={Biometrika},
  volume={13},
  number={1},
  pages={25--45},
@@ -1879,10 +1877,10 @@
  pages={71--78},
  year={2003}
 }
-@article{finch2004using,
+@inproceedings{finch2004using,
  title={Using a paraphraser to improve machine translation evaluation},
  author={Finch, Andrew and Akiba, Yasuhiro and Sumita, Eiichiro},
-  journal={International Joint Conference on Natural Language Processing},
+  publisher={International Joint Conference on Natural Language Processing},
  year={2004}
 }
 @inproceedings{DBLP:conf/coling/HamonM08,
@@ -1955,7 +1953,7 @@ publisher={Annual Meeting of the Association for Computational Linguistics},
 pages={148--155},
 year={2001}
 }
-@article{albrecht2008regression,
+@inproceedings{albrecht2008regression,
 title={Regression for machine translation evaluation at the sentence level},
 author={Albrecht, Joshua S and Hwa, Rebecca},
 volume={22},
@@ -2187,7 +2185,7 @@ year = {2012}
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2013}
 }
-@article{kepler2019unbabel,
+@inproceedings{kepler2019unbabel,
  title={Unbabel's Participation in the WMT19 Translation Quality Estimation Shared Task},
  pages={78--84},
  author={Kepler, F{\'a}bio and Tr{\'e}nous, Jonay and Treviso, Marcos and Vera, Miguel and G{\'o}is, Ant{\'o}nio and Farajian, M Amin and Lopes, Ant{\'o}nio V and Martins, Andr{\'e} FT},
@@ -2220,7 +2218,7 @@ year = {2012}
  year={2019},
  publisher={Springer Nature}
 }
-@article{akaike1974new,
+@inproceedings{akaike1974new,
  title={A new look at the statistical model identification},
  author={Akaike, Hirotugu},
  volume={19},
@@ -3848,17 +3846,16 @@ year = {2012}
 }
 %%%%% chapter 8------------------------------------------------------
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%% chapter 9------------------------------------------------------
-@article{brown1992class,
+@inproceedings{brown1992class,
  title={Class-based n-gram models of natural language},
  author={Peter F. Brown and
               Vincent J. Della Pietra and
               Peter V. De Souza and
               Jennifer C. Lai and
               Robert L. Mercer},
-  journal={Computational linguistics},
+  publisher={Computational linguistics},
  volume={18},
  number={4},
  pages={467--479},
@@ -3874,79 +3871,79 @@ year = {2012}
  year={2012}
 }
-@article{zaremba2014recurrent,
+@inproceedings{zaremba2014recurrent,
  title={Recurrent Neural Network Regularization},
  author={Wojciech Zaremba and
               Ilya Sutskever and
               Oriol Vinyals},
-  journal={arXiv: Neural and Evolutionary Computing},
+  publisher={arXiv: Neural and Evolutionary Computing},
  year={2014}
 }
-@article{zilly2016recurrent,
+@inproceedings{zilly2016recurrent,
  title={Recurrent Highway Networks},
  author={Julian G. Zilly and
               Rupesh Kumar Srivastava and
               Jan Koutn{\'{\i}}k and
               J{\"{u}}rgen Schmidhuber},
-  journal={International Conference on Machine Learning},
+  publisher={International Conference on Machine Learning},
  year={2016}
 }
-@article{merity2017regularizing,
+@inproceedings{merity2017regularizing,
  title={Regularizing and optimizing LSTM language models},
  author={Stephen Merity and
               Nitish Shirish Keskar and
               Richard Socher},
-  journal={International Conference on Learning Representations},
+  publisher={International Conference on Learning Representations},
  year={2017}
 }
-@article{radford2019language,
+@inproceedings{radford2019language,
  title ={Language models are unsupervised multitask learners},
  author ={Radford, Alec and Wu, Jeffrey and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
-  journal ={OpenAI Blog},
+  publisher ={OpenAI Blog},
  volume ={1},
  number ={8},
  pages ={9},
  year ={2019}
 }
-@article{baydin2017automatic,
+@inproceedings{baydin2017automatic,
  title ={Automatic differentiation in machine learning: a survey},
  author ={Baydin, At{\i}l{\i}m G{\"u}nes and Pearlmutter, Barak A and Radul, Alexey Andreyevich and Siskind, Jeffrey Mark},
-  journal ={Journal of Machine Learning Research},
+  publisher ={Journal of Machine Learning Research},
  volume ={18},
  number ={1},
  pages ={5595--5637},
  year ={2017}
 }
-@article{qian1999momentum,
+@inproceedings{qian1999momentum,
  author    = {Ning Qian},
  title     = {On the momentum term in gradient descent learning algorithms},
-  journal   = {Neural Networks},
+  publisher   = {Neural Networks},
  volume    = {12},
  number    = {1},
  pages     = {145--151},
  year      = {1999},
 }
-@article{duchi2011adaptive,
+@inproceedings{duchi2011adaptive,
  author    = {John C. Duchi and
               Elad Hazan and
               Yoram Singer},
  title     = {Adaptive Subgradient Methods for Online Learning and Stochastic Optimization},
-  journal   = {Journal of Machine Learning Research},
+  publisher   = {Journal of Machine Learning Research},
  volume    = {12},
  pages     = {2121--2159},
  year      = {2011},
 }
-@article{tieleman2012rmsprop,
+@inproceedings{tieleman2012rmsprop,
  title ={Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude},
  author ={Tieleman, Tijmen and Hinton, Geoffrey},
-  journal ={COURSERA: Neural networks for machine learning},
+  publisher ={COURSERA: Neural networks for machine learning},
  volume ={4},
  number ={2},
  pages ={26--31},
@@ -3972,12 +3969,12 @@ year = {2012}
  year      = {2015}
 }
-@article{Ba2016LayerN,
+@inproceedings{Ba2016LayerN,
  author    = {Lei Jimmy Ba and
               Jamie Ryan Kiros and
               Geoffrey Hinton},
  title     = {Layer Normalization},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  volume    = {abs/1607.06450},
  year      = {2016}
 }
@@ -4034,10 +4031,10 @@ year = {2012}
  year      = {2014}
 }
-@article{2011Natural,
+@inproceedings{2011Natural,
  title={Natural Language Processing (almost) from Scratch},
  author={ Collobert, Ronan  and  Weston, Jason  and Bottou, Léon and  Karlen, Michael  and  Kavukcuoglu, Koray  and  Kuksa, Pavel },
-  journal={Journal of Machine Learning Research},
+  publisher={Journal of Machine Learning Research},
  volume={12},
  number={1},
  pages={2493-2537},
@@ -4049,7 +4046,7 @@ year = {2012}
               Caiming Xiong and
               Richard Socher},
  title     = {Learned in Translation: Contextualized Word Vectors},
-  booktitle = {Conference on Neural Information Processing Systems},
+  publisher = {Conference on Neural Information Processing Systems},
  pages     = {6294--6305},
  year      = {2017}
 }
@@ -4069,7 +4066,7 @@ year = {2012}
 }
-@article{Graves2013HybridSR,
+@inproceedings{Graves2013HybridSR,
  title={Hybrid speech recognition with Deep Bidirectional LSTM},
  author={Alex Graves and 
          Navdeep Jaitly and 
@@ -4115,30 +4112,30 @@ year = {2012}
  publisher={AAAI Conference on Artificial Intelligence},
  year={2016}
 }
-@article{Ahn2016ANK,
+@inproceedings{Ahn2016ANK,
  title={A Neural Knowledge Language Model},
  author={Sungjin Ahn and 
          Heeyoul Choi and 
 		  Tanel P{\"a}rnamaa and 
 		  Yoshua Bengio},
-  journal={arXiv preprint arXiv:1608.00318},
+  publisher={arXiv preprint arXiv:1608.00318},
  year={2016}
 }
-@article{Wang2015LargerContextLM,
+@inproceedings{Wang2015LargerContextLM,
  title={Larger-Context Language Modelling},
  author={Tian Wang and 
          Kyunghyun Cho},
-  journal={Annual Meeting of the Association for Computational Linguistics},
+  publisher={Annual Meeting of the Association for Computational Linguistics},
  year={2015}
 }
-@article{Adel2015SyntacticAS,
+@inproceedings{Adel2015SyntacticAS,
  title={Syntactic and Semantic Features For Code-Switching Factored Language Models},
  author={Heike Adel and 
          Ngoc Vu and 
 		  Katrin Kirchhoff and 
 		  Dominic Telaar and 
 		  Tanja Schultz},
-  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
+  publisher={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  year={2015},
  volume={23},
  pages={431-440}
@@ -4164,14 +4161,14 @@ year = {2012}
 }
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%深度阅读修改和补充，待检查修改%%%%%%%%%%%%%%%%%%%
-@article{moraffah2020causal,
+@inproceedings{moraffah2020causal,
  title={Causal Interpretability for Machine Learning-Problems, Methods and Evaluation},
  author={Raha Moraffah and 
          Mansooreh Karami and 
 		  Ruocheng Guo and 
 		  Adrienne Raglin and 
 		  Huan Liu},
-  journal={ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
+  publisher={ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
  volume={22},
  number={1},
  pages={18--33},
@@ -4209,11 +4206,11 @@ year = {2012}
  year={2019}
 }
-@article{currey2018multi,
+@inproceedings{currey2018multi,
  title={Multi-source syntactic neural machine translation},
  author={Anna Currey and 
          Kenneth Heafield},
-  journal={Conference on Empirical Methods in Natural Language Processing},
+  publisher={Conference on Empirical Methods in Natural Language Processing},
  year={2018}
 }
 @inproceedings{marevcek2018extracting,
@@ -4224,10 +4221,10 @@ year = {2012}
  pages={347--349},
  year={2018}
 }
-@article{blevins2018deep,
+@inproceedings{blevins2018deep,
  title={Deep rnns encode soft hierarchical syntax},
  author={Blevins, Terra and Levy, Omer and Zettlemoyer, Luke},
-  journal={Annual Meeting of the Association for Computational Linguistics},
+  publisher={Annual Meeting of the Association for Computational Linguistics},
  year={2018}
 }
 @inproceedings{Yin2018StructVAETL,
@@ -4239,11 +4236,11 @@ year = {2012}
  publisher={Annual Meeting of the Association for Computational Linguistics},
  year={2018}
 }
-@article{Aharoni2017TowardsSN,
+@inproceedings{Aharoni2017TowardsSN,
  title={Towards String-To-Tree Neural Machine Translation},
  author={Roee Aharoni and 
          Yoav Goldberg},
-  journal={Annual Meeting of the Association for Computational Linguistics},
+  publisher={Annual Meeting of the Association for Computational Linguistics},
  year={2017}
 }
@@ -4257,31 +4254,31 @@ year = {2012}
  year={2017}
 }
-@article{KoncelKedziorski2019TextGF,
+@inproceedings{KoncelKedziorski2019TextGF,
  title={Text Generation from Knowledge Graphs with Graph Transformers},
  author={Rik Koncel-Kedziorski and 
          Dhanush Bekal and Yi Luan and 
 		  Mirella Lapata and 
 		  Hannaneh Hajishirzi},
-  journal={Annual Conference of the North American Chapter of the Association for Computational Linguistics},
+  publisher={Annual Conference of the North American Chapter of the Association for Computational Linguistics},
  year={2019}
 }
-@article{Kovalerchuk2020SurveyOE,
+@inproceedings{Kovalerchuk2020SurveyOE,
  title={Survey of explainable machine learning with visual and granular methods beyond quasi-explanations},
  author={Boris Kovalerchuk and 
          Muhammad Ahmad and 
 		  Ankur Teredesai},
-  journal={ArXiv},
+  publisher={ArXiv},
  year={2020},
  volume={abs/2009.10221}
 }
-@article{DoshiVelez2017TowardsAR,
+@inproceedings{DoshiVelez2017TowardsAR,
  title={Towards A Rigorous Science of Interpretable Machine Learning},
  author={Finale Doshi-Velez and 
          Been Kim},
-  journal={arXiv preprint arXiv:1702.08608},
+  publisher={arXiv preprint arXiv:1702.08608},
  year={2017}
 }
@@ -4301,16 +4298,15 @@ year = {2012}
  year      = {2018}
 }
-@article{Zeiler2012ADADELTAAA,
+@inproceedings{Zeiler2012ADADELTAAA,
  author    = {Matthew D. Zeiler},
  title     = {ADADELTA:An Adaptive Learning Rate Method},
-  journal   = {arXiv preprint arXiv:1212.5701},
+  publisher   = {arXiv preprint arXiv:1212.5701},
  year      = {2012}
 }
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%% chapter 9------------------------------------------------------
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%% chapter 10------------------------------------------------------
 @inproceedings{vaswani2017attention,
@@ -5895,7 +5891,7 @@ author    = {Yoshua Bengio and
 @inproceedings{garcia-martinez2016factored,
 	title={Factored Neural Machine Translation Architectures},
 	author={Mercedes {Garcia-Martinez} and Loïc {Barrault} and Fethi {Bougares}},
-	booktitle={International Workshop on Spoken Language Translation (IWSLT'16)},
+	publisher={International Workshop on Spoken Language Translation (IWSLT'16)},
 	notes={Sourced from Microsoft Academic - https://academic.microsoft.com/paper/2949810612},
 	year={2016}
 }
@@ -6240,7 +6236,6 @@ author    = {Yoshua Bengio and
 %%%%% chapter 13------------------------------------------------------
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%% chapter 14------------------------------------------------------
 @inproceedings{Koehn2007Moses,
@@ -6330,13 +6325,13 @@ author    = {Yoshua Bengio and
  year      = {2016}
 }
-@article{Stahlberg2018TheUO,
+@inproceedings{Stahlberg2018TheUO,
  title={The University of Cambridge's Machine Translation Systems for WMT18},
  author={Felix Stahlberg and
               Adri{\`{a}} de Gispert and
               Bill Byrne},
  pages     = {504--512},
-  journal = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2018}
 }
@@ -6353,13 +6348,13 @@ author    = {Yoshua Bengio and
  year      = {2018}
 }
-@article{Li2017EnhancedNM,
+@inproceedings{Li2017EnhancedNM,
  title={Enhanced neural machine translation by learning from draft},
  author={Aodong Li and
               Shiyue Zhang and
               Dong Wang and
               Thomas Fang Zheng},
-  journal={IEEE Asia-Pacific Services Computing Conference},
+  publisher={IEEE Asia-Pacific Services Computing Conference},
  year={2017},
  pages={1583-1587}
 }
@@ -6383,11 +6378,11 @@ author    = {Yoshua Bengio and
  year={2018}
 }
-@article{Lee2018DeterministicNN,
+@inproceedings{Lee2018DeterministicNN,
  title={Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement},
  author={Jason Lee and Elman Mansimov and Kyunghyun Cho},
  pages     = {1173--1182},
-  journal = {Conference on Empirical Methods in Natural Language Processing},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2018}
 }
@@ -6407,11 +6402,11 @@ author    = {Yoshua Bengio and
  year      = {2020}
 }
-@article{Stahlberg2018AnOS,
+@inproceedings{Stahlberg2018AnOS,
  title={An Operation Sequence Model for Explainable Neural Machine Translation},
  author={Felix Stahlberg and Danielle Saunders and Bill Byrne},
  pages     = {175--186},
-  journal = {Conference on Empirical Methods in Natural Language Processing},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2018}
 }
@@ -6423,15 +6418,15 @@ author    = {Yoshua Bengio and
  year={2019}
 }
-@article{stling2017NeuralMT,
+@inproceedings{stling2017NeuralMT,
  title={Neural machine translation for low-resource languages},
  author={Robert {\"O}stling and J{\"{o}}rg Tiedemann},
-  journal={CoRR},
+  publisher={CoRR},
  year={2017},
  volume={abs/1708.05729}
 }
-@article{Kikuchi2016ControllingOL,
+@inproceedings{Kikuchi2016ControllingOL,
  title={Controlling Output Length in Neural Encoder-Decoders},
  author={Yuta Kikuchi and
               Graham Neubig and
@@ -6439,7 +6434,7 @@ author    = {Yoshua Bengio and
               Hiroya Takamura and
               Manabu Okumura},
  pages     = {1328--1338},
-  journal = {Conference on Empirical Methods in Natural Language Processing},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2016}
 }
@@ -6460,11 +6455,11 @@ author    = {Yoshua Bengio and
  year      = {2018}
 }
-@article{Sountsov2016LengthBI,
+@inproceedings{Sountsov2016LengthBI,
  title={Length bias in Encoder Decoder Models and a Case for Global Conditioning},
  author={Pavel Sountsov and Sunita Sarawagi},
  pages     = {1516--1525},
-  journal = {Conference on Empirical Methods in Natural Language Processing},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2016}
 }
@@ -6534,13 +6529,13 @@ author    = {Yoshua Bengio and
  year      = {2018}
 }
-@article{Ma2019LearningTS,
+@inproceedings{Ma2019LearningTS,
  title={Learning to Stop in Structured Prediction for Neural Machine Translation},
  author={Mingbo Ma and
               Renjie Zheng and
               Liang Huang},
  pages     = {1884--1889},
-  journal = {	Annual Conference of the North American Chapter of the Association for Computational Linguistics},
+  publisher = {	Annual Conference of the North American Chapter of the Association for Computational Linguistics},
  year      = {2019}
 }
@@ -6612,10 +6607,10 @@ author    = {Yoshua Bengio and
  year={2013}
 }
-@article{Li2016MutualIA,
+@inproceedings{Li2016MutualIA,
  title={Mutual Information and Diverse Decoding Improve Neural Machine Translation},
  author={Jiwei Li and Dan Jurafsky},
-  journal={CoRR},
+  publisher={CoRR},
  year={2016},
  volume={abs/1601.00372}
 }
@@ -6640,19 +6635,19 @@ author    = {Yoshua Bengio and
  year      = {2018}
 }
-@article{Shen2019MixtureMF,
+@inproceedings{Shen2019MixtureMF,
  title={Mixture Models for Diverse Machine Translation: Tricks of the Trade},
  author={Tianxiao Shen and Myle Ott and Michael Auli and Marc'Aurelio Ranzato},
  pages     = {5719--5728},
-  journal = {International Conference on Machine Learning},
+  publisher = {International Conference on Machine Learning},
  year      = {2019},
 }
-@article{Wu2020GeneratingDT,
+@inproceedings{Wu2020GeneratingDT,
  title={Generating Diverse Translation from Model Distribution with Dropout},
  author={Xuanfu Wu and Yang Feng and Chenze Shao},
  pages={1088--1097},
-  journal={Annual Meeting of the Association for Computational Linguistics},
+  publisher={Annual Meeting of the Association for Computational Linguistics},
  year={2020}
 }
@@ -6664,7 +6659,7 @@ author    = {Yoshua Bengio and
  year={2020}
 }
-@article{Vijayakumar2016DiverseBS,
+@inproceedings{Vijayakumar2016DiverseBS,
  title={Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models},
  author={Ashwin K. Vijayakumar and
               Michael Cogswell and
@@ -6673,7 +6668,7 @@ author    = {Yoshua Bengio and
               Stefan Lee and
               David J. Crandall and
               Dhruv Batra},
-  journal={CoRR},
+  publisher={CoRR},
  year={2016},
  volume={abs/1610.02424}
 }
@@ -6715,41 +6710,41 @@ author    = {Yoshua Bengio and
  year={2017}
 }
-@article{StahlbergNeural,
+@inproceedings{StahlbergNeural,
  title={Neural Machine Translation: A Review},
  author={Felix Stahlberg},
-  journal={Journal of Artificial Intelligence Research},
+  publisher={Journal of Artificial Intelligence Research},
  year={2020},
  volume={69},
  pages={343-418}
 }
-@article{Ranzato2016SequenceLT,
+@inproceedings{Ranzato2016SequenceLT,
  title={Sequence Level Training with Recurrent Neural Networks},
  author={Marc'Aurelio Ranzato and
               Sumit Chopra and
               Michael Auli and
               Wojciech Zaremba},
-  journal={International Conference on Learning Representations},
+  publisher={International Conference on Learning Representations},
  year={2016}
 }
-@article{Bengio2015ScheduledSF,
+@inproceedings{Bengio2015ScheduledSF,
  title={Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks},
  author={Samy Bengio and
               Oriol Vinyals and
               Navdeep Jaitly and
               Noam Shazeer},
-  journal = {Conference and Workshop on Neural Information Processing Systems},
+  publisher = {Conference and Workshop on Neural Information Processing Systems},
  pages     = {1171--1179},
  year      = {2015}
 }
-@article{Zhang2019BridgingTG,
+@inproceedings{Zhang2019BridgingTG,
  title={Bridging the Gap between Training and Inference for Neural Machine Translation},
  author={Wen Zhang and Yang Feng and Fandong Meng and Di You and Qun Liu},
  pages     = {4334--4343},
-  journal = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
@@ -6807,31 +6802,31 @@ author    = {Yoshua Bengio and
  year      = {2012}
 }
-@article{Narang2017BlockSparseRN,
+@inproceedings{Narang2017BlockSparseRN,
  title={Block-Sparse Recurrent Neural Networks},
  author={Sharan Narang and Eric Undersander and Gregory Diamos},
-  journal={CoRR},
+  publisher={CoRR},
  year={2017},
  volume={abs/1711.02782}
 }
-@article{Gale2019TheSO,
+@inproceedings{Gale2019TheSO,
  title={The State of Sparsity in Deep Neural Networks},
  author={Trevor Gale and
               Erich Elsen and
               Sara Hooker},
-  journal={CoRR},
+  publisher={CoRR},
  year={2019},
  volume={abs/1902.09574}
 }
-@article{Michel2019AreSH,
+@inproceedings{Michel2019AreSH,
  title={Are Sixteen Heads Really Better than One?},
  author    = {Paul Michel and
               Omer Levy and
               Graham Neubig},
  title     = {Are Sixteen Heads Really Better than One?},
-  journal = {Conference and Workshop on Neural Information Processing Systems},
+  publisher = {Conference and Workshop on Neural Information Processing Systems},
  pages     = {14014--14024},
  year      = {2019}
 }
@@ -6849,31 +6844,31 @@ author    = {Yoshua Bengio and
  year      = {2019},
 }
-@article{Kitaev2020ReformerTE,
+@inproceedings{Kitaev2020ReformerTE,
  author    = {Nikita Kitaev and
               Lukasz Kaiser and
               Anselm Levskaya},
  title     = {Reformer: The Efficient Transformer},
-  journal = {International Conference on Learning Representations},
+  publisher = {International Conference on Learning Representations},
  year      = {2020}
 }
-@article{Katharopoulos2020TransformersAR,
+@inproceedings{Katharopoulos2020TransformersAR,
  title={Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention},
  author={Angelos Katharopoulos and Apoorv Vyas and Nikolaos Pappas and Franccois Fleuret},
-  journal={CoRR},
+  publisher={CoRR},
  year={2020},
  volume={abs/2006.16236}
 }
-@article{xiao2011language,
+@inproceedings{xiao2011language,
  title ={Language Modeling for Syntax-Based Machine Translation Using Tree Substitution Grammars: A Case Study on Chinese-English Translation},
  author ={Xiao, Tong and Zhu, Jingbo and Zhu, Muhua},
  volume ={10},
  number ={4},
  pages ={1--29},
  year ={2011},
-  journal ={ACM Transactions on Asian Language Information Processing (TALIP)}
+  publisher ={ACM Transactions on Asian Language Information Processing (TALIP)}
 }
 @inproceedings{Li2009VariationalDF,
@@ -6886,30 +6881,30 @@ author    = {Yoshua Bengio and
  year={2009}
 }
-@article{Bastings2019ModelingLS,
+@inproceedings{Bastings2019ModelingLS,
  title={Modeling Latent Sentence Structure in Neural Machine Translation},
  author={Jasmijn Bastings and
               Wilker Aziz and
               Ivan Titov and
               Khalil Sima'an},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  volume    = {abs/1901.06436},
  year      = {2019}
 }
-@article{Shah2018GenerativeNM,
+@inproceedings{Shah2018GenerativeNM,
  title={Generative Neural Machine Translation},
  author={Harshil Shah and
               David Barber},
-  journal={Conference and Workshop on Neural Information Processing Systems},
+  publisher={Conference and Workshop on Neural Information Processing Systems},
  pages={1353--1362},
  year={2018}
 }
-@article{Su2018VariationalRN,
+@inproceedings{Su2018VariationalRN,
  title={Variational Recurrent Neural Machine Translation},
  author={Jinsong Su and Shan Wu and Deyi Xiong and Yaojie Lu and Xianpei Han and Biao Zhang},
-  journal={AAAI Conference on Artificial Intelligence},
+  publisher={AAAI Conference on Artificial Intelligence},
  pages={5488--5495},
  year={2018}
 }
@@ -6948,15 +6943,15 @@ author    = {Yoshua Bengio and
  year={2019}
 }
-@article{Akoury2019SyntacticallyST,
+@inproceedings{Akoury2019SyntacticallyST,
  title={Syntactically Supervised Transformers for Faster Neural Machine Translation},
  author={Nader Akoury and Kalpesh Krishna and Mohit Iyyer},
  pages     = {1269--1281},
-  journal = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019},
 }
-@article{Guo2020FineTuningBC,
+@inproceedings{Guo2020FineTuningBC,
  title={Fine-Tuning by Curriculum Learning for Non-Autoregressive Neural Machine Translation},
  author={Junliang Guo and
               Xu Tan and
@@ -6965,7 +6960,7 @@ author    = {Yoshua Bengio and
               Enhong Chen and
               Tie-Yan Liu},
  pages     = {7839--7846},
-  journal = {AAAI Conference on Artificial Intelligence},
+  publisher = {AAAI Conference on Artificial Intelligence},
  year      = {2020}
 }
@@ -6977,7 +6972,7 @@ author    = {Yoshua Bengio and
  year={2020}
 }
-@article{Liu2020FastBERTAS,
+@inproceedings{Liu2020FastBERTAS,
  title={FastBERT: a Self-distilling BERT with Adaptive Inference Time},
  author={Weijie Liu and
               Peng Zhou and
@@ -6986,25 +6981,25 @@ author    = {Yoshua Bengio and
               Haotang Deng and
               Qi Ju},
  pages     = {6035--6044},
-  journal = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2020}
 }
-@article{Elbayad2020DepthAdaptiveT,
+@inproceedings{Elbayad2020DepthAdaptiveT,
  title={Depth-Adaptive Transformer},
  author={Maha Elbayad and
               Jiatao Gu and
               Edouard Grave and
               Michael Auli},
-  journal={International Conference on Learning Representations},
+  publisher={International Conference on Learning Representations},
  year={2020}
 }
-@article{Lan2020ALBERTAL,
+@inproceedings{Lan2020ALBERTAL,
  title={ALBERT: A Lite BERT for Self-supervised Learning of Language Representations},
  author={Zhenzhong Lan and Mingda Chen and Sebastian Goodman and Kevin Gimpel and Piyush Sharma and Radu Soricut},
-  journal={International Conference on Learning Representations},
+  publisher={International Conference on Learning Representations},
  year={2020}
 }
@@ -7019,46 +7014,46 @@ author    = {Yoshua Bengio and
  year={2015}
 }
-@article{Lee2019SNIPSN,
+@inproceedings{Lee2019SNIPSN,
  author    = {Namhoon Lee and
               Thalaiyasingam Ajanthan and
               Philip H. S. Torr},
  title     = {Snip: single-Shot Network Pruning based on Connection sensitivity},
-  journal = {International Conference on Learning Representations},
+  publisher = {International Conference on Learning Representations},
  year      = {2019},
 }
-@article{Frankle2019TheLT,
+@inproceedings{Frankle2019TheLT,
  title={The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks},
  author={Jonathan Frankle and Michael Carbin},
-  journal={International Conference on Learning Representations},
+  publisher={International Conference on Learning Representations},
  year={2019}
 }
-@article{Brix2020SuccessfullyAT,
+@inproceedings{Brix2020SuccessfullyAT,
  author    = {Christopher Brix and
               Parnia Bahar and
               Hermann Ney},
  title     = {Successfully Applying the Stabilized Lottery Ticket Hypothesis to
               the Transformer Architecture},
  pages     = {3909--3915},
-  journal = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2020},
 }
-@article{Liu2019RethinkingTV,
+@inproceedings{Liu2019RethinkingTV,
  title={Rethinking the Value of Network Pruning},
  author={Zhuang Liu and
               Mingjie Sun and
               Tinghui Zhou and
               Gao Huang and
               Trevor Darrell},
-  journal={ArXiv},
+  publisher={ArXiv},
  year={2019},
  volume={abs/1810.05270}
 }
-@article{Liu2017LearningEC,
+@inproceedings{Liu2017LearningEC,
 author    = {Zhuang Liu and
               Jianguo Li and
               Zhiqiang Shen and
@@ -7067,7 +7062,7 @@ author    = {Zhuang Liu and
               Changshui Zhang},
  title     = {Learning Efficient Convolutional Networks through Network Slimming},
  pages     = {2755--2763},
-  journal = {{IEEE} International Conference on Computer Vision},
+  publisher = {{IEEE} International Conference on Computer Vision},
  year      = {2017}
 }
@@ -7082,34 +7077,34 @@ author    = {Zhuang Liu and
  year={2018}
 }
-@article{Hubara2017QuantizedNN,
+@inproceedings{Hubara2017QuantizedNN,
  title={Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations},
  author={Itay Hubara and Matthieu Courbariaux and Daniel Soudry and Ran El-Yaniv and Yoshua Bengio},
-  journal={Journal of Machine Learning Reseach},
+  publisher={Journal of Machine Learning Reseach},
  year={2017},
  volume={18},
  pages={187:1-187:30}
 }
-@article{DBLP:journals/corr/HintonVD15,
+@inproceedings{DBLP:journals/corr/HintonVD15,
  author    = {Geoffrey E. Hinton and
               Oriol Vinyals and
               Jeffrey Dean},
  title     = {Distilling the Knowledge in a Neural Network},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  volume    = {abs/1503.02531},
  year      = {2015}
 }
-@article{Munim2019SequencelevelKD,
+@inproceedings{Munim2019SequencelevelKD,
  title={Sequence-level Knowledge Distillation for Model Compression of Attention-based Sequence-to-sequence Speech Recognition},
  author={Raden Mu'az Mun'im and Nakamasa Inoue and Koichi Shinoda},
-  journal={{IEEE} International Conference on Acoustics, Speech and Signal Processing},
+  publisher={{IEEE} International Conference on Acoustics, Speech and Signal Processing},
  year={2019},
  pages={6151-6155}
 }
-@article{Tang2019DistillingTK,
+@inproceedings{Tang2019DistillingTK,
  author    = {Raphael Tang and
               Yao Lu and
               Linqing Liu and
@@ -7118,7 +7113,7 @@ author    = {Zhuang Liu and
               Jimmy Lin},
  title     = {Distilling Task-Specific Knowledge from {BERT} into Simple Neural
               Networks},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  volume    = {abs/1903.12136},
  year      = {2019}
 }
@@ -7138,13 +7133,13 @@ author    = {Zhuang Liu and
  year={2020}
 }
-@article{Ghazvininejad2020AlignedCE,
+@inproceedings{Ghazvininejad2020AlignedCE,
  author    = {Marjan Ghazvininejad and
               Vladimir Karpukhin and
               Luke Zettlemoyer and
               Omer Levy},
  title     = {Aligned Cross Entropy for Non-Autoregressive Machine Translation},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  volume    = {abs/2004.01655},
  year      = {2020},
 }
@@ -7187,14 +7182,14 @@ author    = {Zhuang Liu and
  year={2019}
 }
-@article{Ran2019GuidingNN,
+@inproceedings{Ran2019GuidingNN,
  author    = {Qiu Ran and
               Yankai Lin and
               Peng Li and
               Jie Zhou},
  title     = {Guiding Non-Autoregressive Neural Machine Translation Decoding with
               Reordering Information},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  volume    = {abs/1911.02215},
  year      = {2019}
 }
@@ -7218,10 +7213,10 @@ author    = {Zhuang Liu and
  year      = {2018}
 }
-@article{Zhou2020UnderstandingKD,
+@inproceedings{Zhou2020UnderstandingKD,
  title={Understanding Knowledge Distillation in Non-autoregressive Machine Translation},
  author={Chunting Zhou and Graham Neubig and Jiatao Gu},
-  journal={ArXiv},
+  publisher={ArXiv},
  year={2020},
  volume={abs/1911.02727}
 }
@@ -7247,11 +7242,11 @@ author    = {Zhuang Liu and
  year={2018}
 }
-@article{Tu2020ENGINEEI,
+@inproceedings{Tu2020ENGINEEI,
  title={ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation},
  author={Lifu Tu and Richard Yuanzhe Pang and Sam Wiseman and Kevin Gimpel},
  pages={2819--2826},
-  journal={Annual Meeting of the Association for Computational Linguistics},
+  publisher={Annual Meeting of the Association for Computational Linguistics},
  year={2020}
 }
@@ -7295,10 +7290,10 @@ author    = {Zhuang Liu and
  year={2016}
 }
-@article{Duan2017OneShotIL,
+@inproceedings{Duan2017OneShotIL,
  title={One-Shot Imitation Learning},
  author={Yan Duan and Marcin Andrychowicz and Bradly C. Stadie and Jonathan Ho and Jonas Schneider and Ilya Sutskever and Pieter Abbeel and Wojciech Zaremba},
-  journal={CoRR},
+  publisher={CoRR},
  year={2017},
  volume={abs/1703.07326}
 }
@@ -7308,7 +7303,7 @@ author    = {Zhuang Liu and
  author={Chunqi Wang and
               Ji Zhang and
               Haiqing Chen},
-  booktitle={Conference on Empirical Methods in Natural Language Processing},
+  publisher={Conference on Empirical Methods in Natural Language Processing},
  pages={479--488},
  year={2018}
 }
@@ -7321,36 +7316,36 @@ author    = {Zhuang Liu and
  year={2019}
 }
-@article{Kasai2020NonAutoregressiveMT,
+@inproceedings{Kasai2020NonAutoregressiveMT,
  title={Non-Autoregressive Machine Translation with Disentangled Context Transformer},
  author={Jungo Kasai and J. Cross and Marjan Ghazvininejad and Jiatao Gu},
-  journal={arXiv: Computation and Language},
+  publisher={arXiv: Computation and Language},
  year={2020}
 }
-@article{Zhou2019SynchronousBN,
+@inproceedings{Zhou2019SynchronousBN,
  title={Synchronous Bidirectional Neural Machine Translation},
  author={Long Zhou and
               Jiajun Zhang and
               Chengqing Zong},
-  journal={Transactions of the Association for Computational Linguistics},
+  publisher={Transactions of the Association for Computational Linguistics},
  year={2019},
  volume={7},
  pages={91-105}
 }
-@article{devlin2019bert,
+@inproceedings{devlin2019bert,
  title={Bert: Pre-training of deep bidirectional transformers for language understanding},
  author={Devlin Jacob and Chang Ming-Wei and Lee Kenton and Toutanova Kristina},
  year={2019},
  pages = {4171--4186},
-  journal = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
 }
 @inproceedings{Feng2016ImprovingAM,
  title={Improving Attention Modeling with Implicit Distortion and Fertility for Machine Translation},
  author={Shi Feng and Shujie Liu and Nan Yang and Mu Li and Ming Zhou and Kenny Q. Zhu},
-  booktitle={International Conference on Computational Linguistics},
+  publisher={International Conference on Computational Linguistics},
  pages={3082--3092},
  year={2016}
 }
@@ -7366,7 +7361,7 @@ author    = {Zhuang Liu and
  year      = {2016}
 }
-@article{Wu2016GooglesNM,
+@inproceedings{Wu2016GooglesNM,
  title={Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation},
  author    = {Yonghui Wu and
               Mike Schuster and
@@ -7399,7 +7394,7 @@ author    = {Zhuang Liu and
               Greg Corrado and
               Macduff Hughes and
               Jeffrey Dean},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  year={2016},
  volume={abs/1609.08144}
 }
@@ -7417,10 +7412,10 @@ author    = {Zhuang Liu and
  year      = {2018}
 }
-@article{Peris2017InteractiveNM,
+@inproceedings{Peris2017InteractiveNM,
  title={Interactive neural machine translation},
  author={{\'A}lvaro Peris and Miguel Domingo and F. Casacuberta},
-  journal={Computer Speech and Language},
+  publisher={Computer Speech and Language},
  year={2017},
  volume={45},
  pages={201-220}
@@ -7434,10 +7429,10 @@ author    = {Zhuang Liu and
  year={2018}
 }
-@article{Xiao2016ALA,
+@inproceedings{Xiao2016ALA,
  title={A Loss-Augmented Approach to Training Syntactic Machine Translation Systems},
  author={Tong Xiao and Derek F. Wong and Jingbo Zhu},
-  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
+  publisher={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  year={2016},
  volume={24},
  pages={2069-2083}
@@ -7454,9 +7449,9 @@ author    = {Zhuang Liu and
  year      = {2015}
 }
-@article{61115,
+@inproceedings{61115,
  author={Jianhua Lin},
-  journal={IEEE Transactions on Information Theory}, 
+  publisher={IEEE Transactions on Information Theory}, 
  title={Divergence measures based on the Shannon entropy}, 
  year={1991},
  volume={37},
@@ -7531,7 +7526,7 @@ author    = {Zhuang Liu and
  year      = {2018}
 }
-@article{DBLP:journals/corr/abs-1906-00532,
+@inproceedings{DBLP:journals/corr/abs-1906-00532,
  author    = {Aishwarya Bhandare and
               Vamsi Sripathi and
               Deepthi Karkada and
@@ -7541,7 +7536,7 @@ author    = {Zhuang Liu and
               Vikram Saletore},
  title     = {Efficient 8-Bit Quantization of Transformer Neural Machine Language
               Translation Model},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  volume    = {abs/1906.00532},
  year      = {2019}
 }
@@ -7563,12 +7558,12 @@ author    = {Zhuang Liu and
  year      = {2018}
 }
-@article{DBLP:journals/corr/abs-1910-10485,
+@inproceedings{DBLP:journals/corr/abs-1910-10485,
  author    = {Gabriele Prato and
               Ella Charlaix and
               Mehdi Rezagholizadeh},
  title     = {Fully Quantized Transformer for Improved Translation},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  volume    = {abs/1910.10485},
  year      = {2019}
 }
@@ -7585,12 +7580,12 @@ author    = {Zhuang Liu and
  year      = {2016}
 }
-@article{DBLP:journals/jcss/FreundS97,
+@inproceedings{DBLP:journals/jcss/FreundS97,
  author    = {Yoav Freund and
               Robert E. Schapire},
  title     = {A Decision-Theoretic Generalization of On-Line Learning and an Application
               to Boosting},
-  journal   = {Journal of Computer and System Sciences},
+  publisher   = {Journal of Computer and System Sciences},
  volume    = {55},
  number    = {1},
  pages     = {119--139},
@@ -7654,20 +7649,20 @@ author    = {Zhuang Liu and
  year      = {2009}
 }
-@article{DBLP:journals/corr/LiMJ16,
+@inproceedings{DBLP:journals/corr/LiMJ16,
  author    = {Jiwei Li and
               Will Monroe and
               Dan Jurafsky},
  title     = {A Simple, Fast Diverse Decoding Algorithm for Neural Generation},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  volume    = {abs/1611.08562},
  year      = {2016}
 }
-@article{xiao2013bagging,
+@inproceedings{xiao2013bagging,
  title ={Bagging and boosting statistical machine translation systems},
  author ={Tong Xiao and Jingbo Zhu and Tongran Liu },
-  journal ={Artificial Intelligence},
+  publisher ={Artificial Intelligence},
  volume ={195},
  pages ={496--527},
  year ={2013}
@@ -7743,14 +7738,14 @@ author    = {Zhuang Liu and
  year      = {2020}
 }
-@article{DBLP:journals/corr/abs-2002-02925,
+@inproceedings{DBLP:journals/corr/abs-2002-02925,
  author    = {Canwen Xu and
               Wangchunshu Zhou and
               Tao Ge and
               Furu Wei and
               Ming Zhou},
  title     = {BERT-of-Theseus: Compressing {BERT} by Progressive Module Replacing},
-  journal = {Conference on Empirical Methods in Natural Language Processing},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2020}
 }
@@ -7758,35 +7753,35 @@ author    = {Zhuang Liu and
  author    = {Alexei Baevski and
               Michael Auli},
  title     = {Adaptive Input Representations for Neural Language Modeling},
-  journal   = {arXiv preprint arXiv:1809.10853},
+  publisher   = {arXiv preprint arXiv:1809.10853},
  year      = {2019}
 }
-@article{DBLP:journals/corr/abs-2006-04768,
+@inproceedings{DBLP:journals/corr/abs-2006-04768,
  author    = {Sinong Wang and
               Belinda Z. Li and
               Madian Khabsa and
               Han Fang and
               Hao Ma},
  title     = {Linformer: Self-Attention with Linear Complexity},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  volume    = {abs/2006.04768},
  year      = {2020}
 }
-@article{DBLP:journals/corr/abs-1911-12385,
+@inproceedings{DBLP:journals/corr/abs-1911-12385,
  author    = {Sachin Mehta and
               Rik Koncel-Kedziorski and
               Mohammad Rastegari and
               Hannaneh Hajishirzi},
  title     = {DeFINE: DEep Factorized INput Word Embeddings for Neural Sequence
               Modeling},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  volume    = {abs/1911.12385},
  year      = {2019}
 }
-@article{DBLP:journals/corr/abs-1906-09777,
+@inproceedings{DBLP:journals/corr/abs-1906-09777,
  author    = {Xindian Ma and
               Peng Zhang and
               Shuai Zhang and
@@ -7795,7 +7790,7 @@ author    = {Zhuang Liu and
               Dawei Song and
               Ming Zhou},
  title     = {A Tensorized Transformer for Language Modeling},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  volume    = {abs/1906.09777},
  year      = {2019}
 }
@@ -7806,12 +7801,12 @@ author    = {Zhuang Liu and
               Ruslan Salakhutdinov and
               Quoc V. Le},
  title     = {Mixtape: Breaking the Softmax Bottleneck Efficiently},
-  booktitle = {Conference on Neural Information Processing Systems},
+  publisher = {Conference on Neural Information Processing Systems},
  pages     = {15922--15930},
  year      = {2019}
 }
-@article{DBLP:journals/corr/abs-2006-10369,
+@inproceedings{DBLP:journals/corr/abs-2006-10369,
  author    = {Jungo Kasai and
               Nikolaos Pappas and
               Hao Peng and
@@ -7819,7 +7814,7 @@ author    = {Zhuang Liu and
               Noah A. Smith},
  title     = {Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff
               in Machine Translation},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  volume    = {abs/2006.10369},
  year      = {2020}
 }
@@ -7839,13 +7834,13 @@ author    = {Zhuang Liu and
  year      = {2020}
 }
-@article{DBLP:journals/corr/abs-2010-02416,
+@inproceedings{DBLP:journals/corr/abs-2010-02416,
  author    = {Yi-Te Hsu and
               Sarthak Garg and
               Yi-Hsiu Liao and
               Ilya Chatsviorkin},
  title     = {Efficient Inference For Neural Machine Translation},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  volume    = {abs/2010.02416},
  year      = {2020}
 }
@@ -7895,14 +7890,14 @@ author    = {Zhuang Liu and
  year={2018}
 }
-@article{Bi2019MultiagentLF,
+@inproceedings{Bi2019MultiagentLF,
  title={Multi-agent Learning for Neural Machine Translation},
  author={Tianchi Bi and 
          Hao Xiong and 
 		  Zhongjun He and 
 		  Hua Wu and 
 		  Haifeng Wang},
-  journal={arXiv preprint arXiv:1909.01101},
+  publisher={arXiv preprint arXiv:1909.01101},
  year={2019}
 }
@@ -7961,10 +7956,10 @@ author    = {Zhuang Liu and
  year={2003}
 }
-@article{Gage1994ANA,
+@inproceedings{Gage1994ANA,
  title={A new algorithm for data compression},
  author={P. Gage},
-  journal={The C Users Journal archive},
+  publisher={The C Users Journal archive},
  year={1994},
  volume={12},
  pages={23-38}
@@ -7977,10 +7972,10 @@ author    = {Zhuang Liu and
  year={2011}
 }
-@article{Kazimi2017CoverageFC,
+@inproceedings{Kazimi2017CoverageFC,
  title={Coverage for Character Based Neural Machine Translation},
  author={M. Kazimi and Marta R. Costa-juss{\`a}},
-  journal={arXiv preprint arXiv:1810.02340},
+  publisher={arXiv preprint arXiv:1810.02340},
  year={2017},
  volume={59},
  pages={99-106}
@@ -8022,7 +8017,6 @@ author    = {Zhuang Liu and
 }
 %%%%% chapter 14------------------------------------------------------
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%% chapter 15------------------------------------------------------
@@ -8364,7 +8358,7 @@ author    = {Zhuang Liu and
 @inproceedings{Real2019AgingEF,
  title={Aging Evolution for Image Classifier Architecture Search},
  author={Esteban Real and Alok Aggarwal and Yanping Huang and Quoc V. Le },
-  booktitle={AAAI Conference on Artificial Intelligence},
+  publisher={AAAI Conference on Artificial Intelligence},
  year={2019}
 }