wording (sec 13)

154466b9 · xiaotong · a1f57a7c · 154466b9 · 154466b9 · 154466b9
Commit 154466b9 authored Jan 04, 2021 by xiaotong
--- a/Chapter13/Figures/figure-framework-of-Adversarial-Neural-machine-translation.tex
+++ b/Chapter13/Figures/figure-framework-of-Adversarial-Neural-machine-translation.tex
@@ -4,25 +4,25 @@

 \begin{tikzpicture}

-\tikzstyle{rnnnode} = [draw,inner sep=2pt,minimum width=4em,minimum height=2em,rounded corners=1pt,fill=yellow!20]
-\tikzstyle{snode} = [draw,inner sep=2pt,minimum width=4em,minimum height=2em,rounded corners=1pt,fill=red!20]
-\tikzstyle{wode} = [inner sep=0pt,minimum width=4em,minimum height=2em,rounded corners=0pt]
-
-\node [anchor=west,wode] (n1) at (0,0) {${y}_1,{y}_2,\ldots,{y}_n$};
-\node [anchor=north west,wode] (n2) at ([xshift=1em,yshift=0.5em]n1.south east) {${x}_1,{x}_2,\ldots,{x}_m$};
-\node [anchor=south west,rnnnode] (n3) at ([xshift=8em,yshift=0.5em]n2.north east) {生成模型G};
-\node [anchor=south east,wode] (n4) at ([xshift=-2em,yshift=0em]n3.north west) {$\tilde{{y}}_{1},\tilde{{y}}_{2},...,\tilde{{y}}_{J}$};
-\node [anchor=south,snode] (n5) at ([xshift=0em,yshift=6em]n2.north) {判别网络D};
+\tikzstyle{rnnnode} = [draw,inner sep=4pt,minimum width=2em,minimum height=2em,rounded corners=1pt,fill=yellow!20]
+\tikzstyle{snode} = [draw,inner sep=4pt,minimum width=2em,minimum height=2em,rounded corners=1pt,fill=red!20]
+\tikzstyle{wode} = [inner sep=0pt,minimum width=2em,minimum height=2em,rounded corners=0pt]
+
+\node [anchor=west,wode] (n1) at (0,0) {$y$};
+\node [anchor=north west,wode] (n2) at ([xshift=3em,yshift=-2.0em]n1.south east) {$x$};
+\node [anchor=south west,rnnnode] (n3) at ([xshift=8em,yshift=0.5em]n2.north east) {生成模型$G$};
+\node [anchor=south east,wode] (n4) at ([xshift=-2em,yshift=0em]n3.north west) {$\tilde{y}$};
+\node [anchor=south,snode] (n5) at ([xshift=0em,yshift=6em]n2.north) {判别网络$D$};
 \node [anchor=west,align=left,font=\small] (n6) at ([xshift=15em,yshift=-3em]n5.east) {根据$(\seq{x},\seq{\tilde{y}})$生\\成奖励信号};


-\draw [->,thick] ([xshift=0em,yshift=0em]n1.north)--([xshift=0em,yshift=0em]n5.south);
-\draw [->,thick] ([xshift=0em,yshift=0em]n2.north)--([xshift=0em,yshift=0em]n5.south);
-\draw [->,thick] ([xshift=0em,yshift=0em]n4.west)--([xshift=0em,yshift=0em]n5.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]n1.north)--([xshift=-0.3em,yshift=-0.1em]n5.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]n2.north)--([xshift=0em,yshift=-0.1em]n5.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]n4.west)--([xshift=0.3em,yshift=-0.1em]n5.south);
 \draw [->,thick] ([xshift=0em,yshift=0em]n3.north)--([xshift=0em,yshift=1em]n3.north)--([xshift=0em,yshift=0em]n4.east);


-\draw [->,thick] ([xshift=0em,yshift=0em]n5.east) --  ([xshift=14.5em,yshift=0em]n5.east) --  ([xshift=1em,yshift=0em]n3.east) --  ([xshift=0em,yshift=0em]n3.east);
+\draw [->,thick] ([xshift=0em,yshift=0em]n5.east) --  ([xshift=13.3em,yshift=0em]n5.east) --  ([xshift=1em,yshift=0em]n3.east) --  ([xshift=0em,yshift=0em]n3.east);

 \draw [->,thick] ([xshift=0em,yshift=0em]n2.east) --  ([xshift=0em,yshift=-1.5em]n3.south) --  ([xshift=0em,yshift=0em]n3.south);


--- a/Chapter13/Figures/figure-of-scheduling-sampling-method.tex
+++ b/Chapter13/Figures/figure-of-scheduling-sampling-method.tex
@@ -31,9 +31,9 @@
 \node [anchor=south,inner sep=2pt] (st2) at (n8.north) {\scriptsize{\textbf{[step $j$]}}};
 \node [anchor=south,inner sep=2pt] (st3) at (n14.north) {\scriptsize{\textbf{[step $1$]}}};

-\node [anchor=north,font=\tiny,rotate=90] (e1) at ([xshift=-2.7em,yshift=-1.1em]n3.south) {${(1-\epsilon_i)}^2$};
+\node [anchor=north,font=\tiny,rotate=90] (e1) at ([xshift=-2.7em,yshift=-1.1em]n3.south) {${1-\epsilon_i}$};
 %\node [anchor=north,font=\scriptsize] (e2) at ([xshift=2em,yshift=-0.1em]n3.south) {$\funp{P}=\epsilon_i$};
-%\node [anchor=north,font=\scriptsize] (e3) at ([xshift=-2em,yshift=-1em]n4.south) {$\funp{P}={(1-\epsilon_i)}^2$};
+%\node [anchor=north,font=\scriptsize] (e3) at ([xshift=-2em,yshift=-1em]n4.south) {$\funp{P}={1-\epsilon_i}$};
 \node [anchor=north,font=\tiny,rotate=90] (e4) at ([xshift=1.5em,yshift=-1.2em]n4.south) {$\epsilon_i$};

 %\node [anchor=south east,font=\small] (l1) at ([xshift=-1em,yshift=0.5em]n5.north west) {Loss};

--- a/Chapter13/chapter13.tex
+++ b/Chapter13/chapter13.tex
@@ -459,7 +459,7 @@ Loss_{\textrm{robust}}(\theta_{\textrm{mt}}) &=&  \frac{1}{N}\sum_{(\mathbi{x},\

 \subsection{非Teacher-forcing方法}

-\parinterval 所谓Teacher-forcing，即要求模型预测的结果和标准答案完全对应。Teacher-forcing是一种深度学习训练策略，在序列处理任务上被广泛使用（{\color{red} deep learning}）。以序列生成任务为例，Teacher-forcing要求模型在训练时不是使用上一个时刻的模型输出作为下一个时刻的输入，而是使用训练数据中上一时刻的标准答案作为下一个时刻的输入。显然这会导致曝光偏置问题。为了解决这个问题，可以使用非Teacher-forcing方法，主要包括调度采样和生成对抗网络。
+\parinterval 所谓Teacher-forcing，即要求模型预测的结果和标准答案完全对应。Teacher-forcing是一种深度学习训练策略，在序列处理任务上被广泛使用（{\color{red} deep learning}）。以序列生成任务为例，Teacher-forcing要求模型在训练时不是使用上一个时刻的模型输出作为下一个时刻的输入，而是使用训练数据中上一时刻的标准答案作为下一个时刻的输入。显然这会导致曝光偏置问题。为了解决这个问题，可以使用非Teacher-forcing方法。比如，在训练中使用束搜索，这样可以让训练过程模拟推断时的行为。具体来说，非Teacher-forcing方法可以用调度采样和生成对抗网络进行实现。

 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -467,9 +467,7 @@ Loss_{\textrm{robust}}(\theta_{\textrm{mt}}) &=&  \frac{1}{N}\sum_{(\mathbi{x},\

 \subsubsection{1. 调度采样}

-\parinterval 对于曝光偏置问题，一般可以使用束搜索等启发式搜索方法来进行缓解。也就是，训练过程可以模拟推断时的行为。
-
-\parinterval 对于一个目标序列$\seq{y}=\{{y}_1,{y}_2,\ldots,{y}_n\}$，在预测第$j$个单词${y}_j$时，训练过程与推断过程之间的主要区别在于：训练过程中使用标准答案$\{{y}_{1},...,{y}_{j-1}\}$，而推断过程使用的是来自模型本身的预测结果$\{\tilde{{y}}_{1},...,\tilde{{y}}_{j-1}\}$。此时可以采取一种{\small\bfnew{调度采样}}\index{调度采样}（Scheduled Sampling\index{Scheduled Sampling}）机制（{\color{red} Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks}），在训练中随机决定使用${y}_{j-1}$还是$\tilde{{y}}_{j-1}$。 假设训练时使用的是基于小批量的随机梯度下降方法，在第$i$ 个批次中，对序列每一个位置进行预测时以概率$\epsilon_i$使用标准答案，或以概率${(1-\epsilon_i)}^2$使用来自模型本身的预测。具体到序列中的一个位置$j$，可以根据模型单词预测的概率进行采样，在$\epsilon_i$控制的调度策略下，同${y}_{j-1}$一起作为输入。此过程如图\ref{fig:13-22}所示，并且这个过程可以很好地与束搜索融合。
+\parinterval 对于一个目标序列$\seq{y}=\{{y}_1,\ldots,{y}_n\}$，在预测第$j$个单词${y}_j$时，训练过程与推断过程之间的主要区别在于：训练过程中使用标准答案$\{{y}_{1},...,{y}_{j-1}\}$，而推断过程使用的是来自模型本身的预测结果$\{\tilde{{y}}_{1},...,\tilde{{y}}_{j-1}\}$。此时可以采取一种{\small\bfnew{调度采样}}\index{调度采样}（Scheduled Sampling\index{Scheduled Sampling}）机制（{\color{red} Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks}），在训练中随机决定使用${y}_{j-1}$还是$\tilde{{y}}_{j-1}$。 假设训练时使用的是基于小批量的随机梯度下降方法，在第$i$ 个批次中，对序列每一个位置进行预测时以概率$\epsilon_i$使用标准答案，或以概率${1-\epsilon_i}$使用来自模型本身的预测。具体到序列中的一个位置$j$，可以根据模型单词预测的概率进行采样，在$\epsilon_i$控制的调度策略下，同${y}_{j-1}$一起作为输入。此过程如图\ref{fig:13-22}所示，并且这个过程可以很好地与束搜索融合。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -501,9 +499,9 @@ Translation}），该策略认为学习应该循序渐进，从一种状态逐

 \subsubsection{2. 生成对抗网络}

-\parinterval 调度采样解决曝光偏置的方法是：把模型历史预测的结果作为输入，并用于预测下一时刻的输出。但是，如果模型预测的结果中有错误，再使用错误的结果预测未来的序列也会产生问题。这里面的核心问题是，如何知道模型预测的好与坏，进而在训练中有效的使用它们。如果生成好的结果，那么可以使用它进行模型训练，否则就不使用。生成对抗网络就是这样一种技术，它引入了一个额外的模型（判别器）来对原有模型（生成器）的生成结果进行评价，并根据评价结果同时训练两个模型。
+\parinterval 调度采样解决曝光偏置的方法是：把模型前$j-1$步的预测结果作为输入预测第$j$步的输出。但是，如果模型预测的结果中有错误，再使用错误的结果预测未来的序列也会产生问题。这里面的核心问题是，如何知道模型预测的好与坏，进而在训练中有效的使用它们。如果生成好的结果，那么可以使用它进行模型训练，否则就不使用。生成对抗网络就是这样一种技术，它引入了一个额外的模型（判别器）来对原有模型（生成器）的生成结果进行评价，并根据评价结果同时训练两个模型。

-\parinterval 在\ref{sec:adversarial-examples}小节已经提到了生成对抗网络，这里进行一些展开。 在机器翻译中，基于对抗神经网络的架构被命名为{\small\bfnew{对抗神经机器翻译}}\index{对抗神经机器翻译}（Adversarial-NMT）\index{Adversarial-NMT}（{\color{red} Adversarial Neural Machine Translation}）。对于训练用双语句对$(\seq{x},\seq{y})=(\{{x}_1,{x}_2,\ldots,{x}_m\},\{{y}_1,{y}_2,\ldots,{y}_n\})$，其中${x}_i$ 是源句子中的第$i$个单词的表示向量，${y}_j$是目标句子中的第$j$个单词的表示向量。令$\tilde{\seq{y}} = \{\tilde{{y}}_{1},\tilde{{y}}_{2},...,\tilde{{y}}_{J}\}$表示神经机器翻译系统对源语言句子$\seq{x}$的翻译结果。此时，对抗神经机器翻译的总体框架可以表示为图\ref{fig:13-23}，其中。黄色部分表示神经机器翻译模型$G$，该模型将源语言句子$\seq{x}$映射为目标语言句子。红色部分是对抗网络$D$，该网络预测目标语言句子是否是源语言句子$\seq{x}$的真实翻译。$G$和$D$相互对抗，同时生成翻译结果$\tilde{\seq{y}}$来训练$D$，并生成奖励信号来通过策略梯度训练$G$。
+\parinterval 在\ref{sec:adversarial-examples}小节已经提到了生成对抗网络，这里进行一些展开。 在机器翻译中，基于对抗神经网络的架构被命名为{\small\bfnew{对抗神经机器翻译}}\index{对抗神经机器翻译}（Adversarial-NMT）\index{Adversarial-NMT}（{\color{red} Adversarial Neural Machine Translation}）。这里，令$(\seq{x},\seq{y})$表示一个训练样本，，令$\tilde{\seq{y}}$ 表示神经机器翻译系统对源语言句子$\seq{x}$ 的翻译结果。此时，对抗神经机器翻译的总体框架可以表示为图\ref{fig:13-23}，其中。黄色部分表示神经机器翻译模型$G$，该模型将源语言句子$\seq{x}$翻译为目标语言句子$\tilde{\seq{y}}$。红色部分是对抗网络$D$，它的作用是判断目标语言句子是否是源语言句子$\seq{x}$ 的真实翻译。$G$ 和$D$相互对抗，同时生成翻译结果$\tilde{\seq{y}}$来训练$D$，并生成奖励信号来通过策略梯度训练$G$。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -516,8 +514,6 @@ Translation}），该策略认为学习应该循序渐进，从一种状态逐

 \parinterval 实际上，对抗神经机器翻译的训练目标就是强制$\tilde{\seq{y}}$ 与$\seq{y}$ 相似。在理想情况下，$\tilde{\seq{y}}$与人类标注的答案$\seq{y}$非常相似，以至于人类也无法分辨$\tilde{\seq{y}}$是由机器还是人类产生的。

-\parinterval 为了实现这一目标，需要引入一个额外的对抗网络\upcite{DBLP:conf/nips/GoodfellowPMXWOCB14}。在这个框架中，原始的神经机器翻译模型为生成网络，其训练由对抗网络协助。对抗网络的目标是将神经机器翻译模型生成的翻译结果与人的翻译结果区分开，而生成网络的目标是产生高质量的翻译，以欺骗对抗网络。生成网络和对抗网络作为对手，由策略梯度方法来共同训练。为了使得生成网络和对抗网络能够提高彼此性能，可以通过学习人为产生的正例和从神经机器翻译取得的负例来提高对手的辨别力。通过这种方式，神经机器翻译的结果可以尽可能接近真实答案。
-
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
 %----------------------------------------------------------------------------------------