13.4

91a3564b · 单韦乔 · 2a74c729 · 91a3564b · 91a3564b · 91a3564b
Commit 91a3564b authored Jan 01, 2021 by 单韦乔
--- a/Chapter13/Figures/figure-exposure-bias.tex
+++ b/Chapter13/Figures/figure-exposure-bias.tex
+
+
+%------------------------------------------------------------
+
+\begin{tikzpicture}
+
+\tikzstyle{rnnnode} = [draw,inner sep=2pt,minimum width=4em,minimum height=2em,rounded corners=1pt,fill=red!20]
+\tikzstyle{snode} = [draw,inner sep=2pt,minimum width=4em,minimum height=2em,rounded corners=1pt,fill=blue!20]
+\tikzstyle{ynode} = [inner sep=2pt,minimum width=4em,minimum height=2em,rounded corners=1pt]
+
+
+\node [anchor=west,rnnnode] (n1) at (0,0) {$\mathbi{h}_{1}$};
+\node [anchor=west] (n2) at ([xshift=3em,yshift=0em]n1.east) {$\cdots$};
+\node [anchor=west,rnnnode] (n3) at ([xshift=3em,yshift=0em]n2.east) {$\mathbi{h}_{j-1}$};
+\node [anchor=west,rnnnode] (n4) at ([xshift=3em,yshift=0em]n3.east) {$\mathbi{h}_{j}$};
+\node [anchor=south,snode] (n5) at ([xshift=0em,yshift=1em]n3.north) {Softmax};
+\node [anchor=south,ynode] (n6) at ([xshift=0em,yshift=1em]n5.north) {$\tilde{\mathbi{y}}_{j-1}$};
+\node [anchor=south,snode] (n7) at ([xshift=0em,yshift=1em]n4.north) {Softmax};
+\node [anchor=south,ynode] (n8) at ([xshift=0em,yshift=1em]n7.north) {$\tilde{\mathbi{y}}_{j}$};
+
+\node [anchor=north] (x1) at ([xshift=0em,yshift=-1em]n1.south) {$\seq{x}$};
+\node [anchor=north,font=\small,align=left] (x2) at ([xshift=-4em,yshift=-1.7em]n3.south) {采样出\\的$\tilde{\mathbi{y}}_{j-2}$};
+\node [anchor=north,font=\small,align=left] (x3) at ([xshift=2em,yshift=-2.5em]n3.south) {真实答\\案$\mathbi{y}_{j-2}$};
+\node [anchor=north,font=\small,align=left] (x4) at ([xshift=2em,yshift=-2.5em]n4.south) {真实答\\案$\mathbi{y}_{j-1}$};
+
+\node [anchor=south,inner sep=2pt] (st1) at (n6.north) {\scriptsize{\textbf{[step $j-1$]}}};
+\node [anchor=south,inner sep=2pt] (st2) at (n8.north) {\scriptsize{\textbf{[step $j$]}}};
+
+\node [anchor=north,font=\scriptsize] (e1) at ([xshift=-3em,yshift=-0em]n3.south) {$\funp{P}={(1-\epsilon_i)}^2$};
+\node [anchor=north,font=\scriptsize] (e2) at ([xshift=2em,yshift=-0.1em]n3.south) {$\funp{P}=\epsilon_i$};
+\node [anchor=north,font=\scriptsize] (e3) at ([xshift=-2em,yshift=-1em]n4.south) {$\funp{P}={(1-\epsilon_i)}^2$};
+\node [anchor=north,font=\scriptsize] (e4) at ([xshift=2em,yshift=-0.1em]n4.south) {$\funp{P}=\epsilon_i$};
+
+\node [anchor=south east,font=\small] (l1) at ([xshift=-1em,yshift=0.5em]n5.north west) {Loss};
+\node [anchor=south west,font=\small] (l2) at ([xshift=1em,yshift=0.5em]n7.north east) {Loss};
+
+\draw [->,thick] ([xshift=0em,yshift=0em]x1.north)--([xshift=0em,yshift=0em]n1.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]n1.east)--([xshift=0em,yshift=0em]n2.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n2.east)--([xshift=0em,yshift=0em]n3.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n3.east)--([xshift=0em,yshift=0em]n4.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n4.east)--([xshift=3em,yshift=0em]n4.east);
+
+
+\draw [->,thick] ([xshift=0em,yshift=0em]n3.north)--([xshift=0em,yshift=0em]n5.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]n5.north)--([xshift=0em,yshift=0em]n6.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]n4.north)--([xshift=0em,yshift=0em]n7.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]n7.north)--([xshift=0em,yshift=0em]n8.south);
+
+\draw [->,thick] ([xshift=0em,yshift=0em]l1.south) .. controls +(south:1em) and +(west:0.1em) .. ([xshift=0em,yshift=0em]n5.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]l2.south) .. controls +(south:1em) and +(east:0.1em) .. ([xshift=0em,yshift=0em]n7.east);
+
+\draw [->,thick,dotted] ([xshift=0em,yshift=-0.5em]x2.north east) .. controls +(east:1.5em) and +(south:0.2em) .. ([xshift=-0.5em,yshift=0em]n3.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]x3.north) .. controls +(north:1em) and +(south:2em) .. ([xshift=0em,yshift=0em]n3.south);
+\draw [->,thick,dotted] ([xshift=0em,yshift=0em]n6.east) .. controls ([xshift=2em,yshift=1em]n6.east) and ([xshift=-2em,yshift=-5em]n4.south west) .. ([xshift=-0.5em,yshift=-0em]n4.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]x4.north) .. controls +(north:1em) and +(south:2em) .. ([xshift=0em,yshift=0em]n4.south);
+
+\end{tikzpicture}
--- a/Chapter13/Figures/figure-of-scheduling-sampling-method.tex
+++ b/Chapter13/Figures/figure-of-scheduling-sampling-method.tex
+
+
+%------------------------------------------------------------
+
+\begin{tikzpicture}
+
+\tikzstyle{rnnnode} = [draw,inner sep=2pt,minimum width=4em,minimum height=2em,rounded corners=1pt,fill=red!20]
+\tikzstyle{snode} = [draw,inner sep=2pt,minimum width=4em,minimum height=2em,rounded corners=1pt,fill=blue!20]
+\tikzstyle{ynode} = [inner sep=2pt,minimum width=4em,minimum height=2em,rounded corners=1pt]
+
+
+\node [anchor=west,rnnnode] (n1) at (0,0) {$\mathbi{h}_{1}$};
+\node [anchor=west] (n2) at ([xshift=3em,yshift=0em]n1.east) {$\cdots$};
+\node [anchor=west,rnnnode] (n3) at ([xshift=3em,yshift=0em]n2.east) {$\mathbi{h}_{j-1}$};
+\node [anchor=west,rnnnode] (n4) at ([xshift=3em,yshift=0em]n3.east) {$\mathbi{h}_{j}$};
+\node [anchor=south,snode] (n5) at ([xshift=0em,yshift=1em]n3.north) {Softmax};
+\node [anchor=south,ynode] (n6) at ([xshift=0em,yshift=1em]n5.north) {$\tilde{\mathbi{y}}_{j-1}$};
+\node [anchor=south,snode] (n7) at ([xshift=0em,yshift=1em]n4.north) {Softmax};
+\node [anchor=south,ynode] (n8) at ([xshift=0em,yshift=1em]n7.north) {$\tilde{\mathbi{y}}_{j}$};
+
+\node [anchor=north] (x1) at ([xshift=0em,yshift=-1em]n1.south) {$\seq{x}$};
+\node [anchor=north,font=\small,align=left] (x2) at ([xshift=-4em,yshift=-1.7em]n3.south) {采样出\\的$\tilde{\mathbi{y}}_{j-2}$};
+\node [anchor=north,font=\small,align=left] (x3) at ([xshift=2em,yshift=-2.5em]n3.south) {真实答\\案$\mathbi{y}_{j-2}$};
+\node [anchor=north,font=\small,align=left] (x4) at ([xshift=2em,yshift=-2.5em]n4.south) {真实答\\案$\mathbi{y}_{j-1}$};
+
+\node [anchor=south,inner sep=2pt] (st1) at (n6.north) {\scriptsize{\textbf{[step $j-1$]}}};
+\node [anchor=south,inner sep=2pt] (st2) at (n8.north) {\scriptsize{\textbf{[step $j$]}}};
+
+\node [anchor=north,font=\scriptsize] (e1) at ([xshift=-3em,yshift=-0em]n3.south) {$\funp{P}={(1-\epsilon_i)}^2$};
+\node [anchor=north,font=\scriptsize] (e2) at ([xshift=2em,yshift=-0.1em]n3.south) {$\funp{P}=\epsilon_i$};
+\node [anchor=north,font=\scriptsize] (e3) at ([xshift=-2em,yshift=-1em]n4.south) {$\funp{P}={(1-\epsilon_i)}^2$};
+\node [anchor=north,font=\scriptsize] (e4) at ([xshift=2em,yshift=-0.1em]n4.south) {$\funp{P}=\epsilon_i$};
+
+\node [anchor=south east,font=\small] (l1) at ([xshift=-1em,yshift=0.5em]n5.north west) {Loss};
+\node [anchor=south west,font=\small] (l2) at ([xshift=1em,yshift=0.5em]n7.north east) {Loss};
+
+\draw [->,thick] ([xshift=0em,yshift=0em]x1.north)--([xshift=0em,yshift=0em]n1.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]n1.east)--([xshift=0em,yshift=0em]n2.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n2.east)--([xshift=0em,yshift=0em]n3.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n3.east)--([xshift=0em,yshift=0em]n4.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n4.east)--([xshift=3em,yshift=0em]n4.east);
+
+
+\draw [->,thick] ([xshift=0em,yshift=0em]n3.north)--([xshift=0em,yshift=0em]n5.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]n5.north)--([xshift=0em,yshift=0em]n6.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]n4.north)--([xshift=0em,yshift=0em]n7.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]n7.north)--([xshift=0em,yshift=0em]n8.south);
+
+\draw [->,thick] ([xshift=0em,yshift=0em]l1.south) .. controls +(south:1em) and +(west:0.1em) .. ([xshift=0em,yshift=0em]n5.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]l2.south) .. controls +(south:1em) and +(east:0.1em) .. ([xshift=0em,yshift=0em]n7.east);
+
+\draw [->,thick,dotted] ([xshift=0em,yshift=-0.5em]x2.north east) .. controls +(east:1.5em) and +(south:0.2em) .. ([xshift=-0.5em,yshift=0em]n3.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]x3.north) .. controls +(north:1em) and +(south:2em) .. ([xshift=0em,yshift=0em]n3.south);
+\draw [->,thick,dotted] ([xshift=0em,yshift=0em]n6.east) .. controls ([xshift=2em,yshift=1em]n6.east) and ([xshift=-2em,yshift=-5em]n4.south west) .. ([xshift=-0.5em,yshift=-0em]n4.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]x4.north) .. controls +(north:1em) and +(south:2em) .. ([xshift=0em,yshift=0em]n4.south);
+
+\end{tikzpicture}
--- a/Chapter13/Figures/figure-reinforcement-learning-method-based-on-actor-critic.tex
+++ b/Chapter13/Figures/figure-reinforcement-learning-method-based-on-actor-critic.tex
@@ -12,17 +12,23 @@
 	\node[anchor=south,inner sep=0mm,font=\small] (c1) at ([xshift=0em,yshift=1em]n2.north) {评论家$Q$};
 	\node[anchor=north,inner sep=0mm] (c2) at ([xshift=0em,yshift=-1em]n2.south) {$\mathbi{y}_1,\mathbi{y}_2,\ldots,\mathbi{y}_J$};

-	\node[anchor=west,inner sep=0mm] (n3) at ([xshift=2.1em,yshift=2em]n1.east) {$Q_1,Q_2,\ldots,Q_J$};
-	\node[anchor=west,inner sep=0mm] (n4) at ([xshift=2.9em,yshift=-0.4em]n1.east) {$\hat{\mathbi{y}}_1,\hat{\mathbi{y}}_2,\ldots,\hat{\mathbi{y}}_J$};
-	\node[anchor=west,inner sep=0mm,font=\small] (n5) at ([xshift=3em,yshift=-3em]n1.east) {演员状态};
+%	\node[anchor=west,inner sep=0mm] (n3) at ([xshift=2.1em,yshift=2em]n1.east) {$Q_1,Q_2,\ldots,Q_J$};
+%	\node[anchor=west,inner sep=0mm] (n4) at ([xshift=2.9em,yshift=-0.4em]n1.east) {$\hat{\mathbi{y}}_1,\hat{\mathbi{y}}_2,\ldots,\hat{\mathbi{y}}_J$};
+%	\node[anchor=west,inner sep=0mm,font=\small] (n5) at ([xshift=3em,yshift=-3em]n1.east) {演员状态};

 \draw [-,thick] ([xshift=0em,yshift=0em]n1.west) -- ([xshift=0em,yshift=0em]n1.east);
 \draw [-,thick] ([xshift=0em,yshift=0em]n2.west) -- ([xshift=0em,yshift=0em]n2.east);

-\draw [->,thick] ([xshift=0em,yshift=1em]n2.west) -- ([xshift=0em,yshift=1em]n1.east);
-\draw [->,thick] ([xshift=0em,yshift=0.5em]n1.east) -- ([xshift=0em,yshift=0.5em]n2.west);
+%\draw [->,thick] ([xshift=0em,yshift=1em]n2.west) -- ([xshift=0em,yshift=1em]n1.east);
+%\draw [->,thick] ([xshift=0em,yshift=0.5em]n1.east) -- ([xshift=0em,yshift=0.5em]n2.west);

-\draw [->,dotted,very thick] ([xshift=0em,yshift=0em]n1.east)  .. controls ([xshift=3em,yshift=-1em]n1.-90) and ([xshift=-3em,yshift=-1em]n2.-90) .. (n2.west);
+%\draw [->,dotted,very thick] ([xshift=0em,yshift=0em]n1.east)  .. controls ([xshift=3em,yshift=-1em]n1.-90) and ([xshift=-3em,yshift=-1em]n2.-90) .. (n2.west);
+
+	\node[anchor=west,inner sep=0mm] (n3) at ([xshift=2.1em,yshift=1em]n1.east) {$Q_1,Q_2,\ldots,Q_J$};
+	\node[anchor=west,inner sep=0mm] (n4) at ([xshift=2.9em,yshift=-1em]n1.east) {$\tilde{\mathbi{y}}_1,\tilde{\mathbi{y}}_2,\ldots,\tilde{\mathbi{y}}_J$};
+
+\draw [->,thick] ([xshift=0em,yshift=0.2em]n2.west) -- ([xshift=0em,yshift=0.2em]n1.east);
+\draw [->,thick] ([xshift=0em,yshift=-0.2em]n1.east) -- ([xshift=0em,yshift=-0.2em]n2.west);


 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter13/Figures/figure-word-root.tex
+++ b/Chapter13/Figures/figure-word-root.tex
@@ -11,8 +11,8 @@
 \node[anchor = north] (new_root) at ([yshift = -1.5em]newer.south) {new};
 \draw [->] ([yshift=0.2em]do_root.north) .. controls +(north:0.4) and +(south:0.6) ..(do.south);
 \draw [->] (do_root.north) -- (does.south);
-\draw [->] ([yshift=0.2em]do_root.north) .. controls +(north:0.4) and +(south:0.6) ..([yshift=0.1em]doing.south);
+\draw [->] ([yshift=0.2em]do_root.north) .. controls +(north:0.4) and +(south:0.6) ..(doing.south);
 \draw [->] ([yshift=0.2em]new_root.north) .. controls +(north:0.4) and +(south:0.6) ..(new.south);
 \draw [->] (new_root.north) -- (newer.south);
-\draw [->] ([yshift=0.2em]new_root.north) .. controls +(north:0.4) and +(south:0.6) ..([yshift=0.08em]newest.south);
+\draw [->] ([yshift=0.2em]new_root.north) .. controls +(north:0.4) and +(south:0.6) ..(newest.south);
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter13/chapter13.tex
+++ b/Chapter13/chapter13.tex
@@ -352,7 +352,7 @@ R(\mathbf{w}) & = & (\big| |\mathbf{w}| {\big|}_2)^2 \\

 \parinterval 在图像识别领域，研究人员就发现，对于输入图像的细小扰动，如像素变化等，会使模型以高置信度给出错误的预测\upcite{DBLP:conf/cvpr/NguyenYC15,DBLP:journals/corr/SzegedyZSBEGF13,DBLP:journals/corr/GoodfellowSS14}，但是这种扰动并不会造成人类的错误判断。也就是说，样本中的微小变化“欺骗”了图像识别系统，但是“欺骗”不了人类。这种现象背后的原因有很多，一种可能的原因是：系统并没有理解图像，而是在拟合数据，因此拟合能力越强，反而对数据中的微小变化更加敏感。从统计学习的角度看，既然新的数据中可能会有扰动，那更好的学习方式就是在训练中显性地把这种扰动建模出来，让模型对输入的细微变化表现得更加健壮。

-\parinterval 这种对原样本上增加一些难以察觉的扰动从而使模型的到错误判断的样本，被称为对抗样本。对于输入$\mathbi{x}$和输出$\mathbi{y}$，对抗样本形式上可以被描述为：
+\parinterval 这种对原样本上增加一些难以察觉的扰动从而使模型的到错误判断的样本，被称为对抗样本。对于模型的输入$\mathbi{x}$和输出$\mathbi{y}$，对抗样本形式上可以被描述为：
 \begin{eqnarray}
 \funp{C}(\mathbi{x}) &=& \mathbi{y}
 \label{eq:13-6}\\
@@ -362,7 +362,7 @@ R(\mathbf{w}) & = & (\big| |\mathbf{w}| {\big|}_2)^2 \\
 \label{eq:13-8}
 \end{eqnarray}

-\noindent 其中，$(\mathbi{x},\mathbi{y})$为原样本{\red（样本和上面的输入输出是不是不太统一？）}，$(\mathbi{x}',\mathbi{y})$为输入中含有扰动的对抗样本，函数$\funp{C}(\cdot)$为模型。公式\eqref{eq:13-8}中$\funp{R}(\mathbi{x},\mathbi{x}')$表示扰动后的输入$\mathbi{x}'$和原输入$\mathbi{x}$之间的距离，$\varepsilon$表示扰动的受限范围。当模型对包含噪声的数据容易给出较差的结果时，往往意味着该模型的抗干扰能力差，因此可以利用对抗样本检测现有模型的健壮性\upcite{DBLP:conf/emnlp/JiaL17}。同时，采用类似数据增强的方式将对抗样本混合至训练数据中，能够帮助模型学习到更普适的特征使模型得到稳定的输出，这种方式也被称为对抗训练\upcite{DBLP:journals/corr/GoodfellowSS14,DBLP:conf/emnlp/BekoulisDDD18,DBLP:conf/naacl/YasunagaKR18}。
+\noindent 其中，$(\mathbi{x}',\mathbi{y})$为输入中含有扰动的对抗样本，函数$\funp{C}(\cdot)$为模型。公式\eqref{eq:13-8}中$\funp{R}(\mathbi{x},\mathbi{x}')$表示扰动后的输入$\mathbi{x}'$和原输入$\mathbi{x}$之间的距离，$\varepsilon$表示扰动的受限范围。当模型对包含噪声的数据容易给出较差的结果时，往往意味着该模型的抗干扰能力差，因此可以利用对抗样本检测现有模型的健壮性\upcite{DBLP:conf/emnlp/JiaL17}。同时，采用类似数据增强的方式将对抗样本混合至训练数据中，能够帮助模型学习到更普适的特征使模型得到稳定的输出，这种方式也被称为对抗训练\upcite{DBLP:journals/corr/GoodfellowSS14,DBLP:conf/emnlp/BekoulisDDD18,DBLP:conf/naacl/YasunagaKR18}。

 \parinterval 通过对抗样本训练来提升模型健壮性的首要问题是：如何生成对抗样本。通过当前模型$\funp{C}$和样本$(\mathbi{x},\mathbi{y})$，生成对抗样本的过程，被称为{\small\bfnew{对抗攻击}}\index{对抗攻击}（Adversarial Attack）\index{Adversarial Attack}。对抗攻击可以被分为两种，分别是黑盒攻击和白盒攻击。在白盒攻击中，攻击算法可以访问模型的完整信息，包括模型结构、网络参数、损失函数、激活函数、输入和输出数据等。而黑盒攻击不需要知道神经网络的详细信息，仅仅通过访问模型的输入和输出就可以达到攻击目的，{\red 因此通常依赖启发式方法来生成对抗样本（Adversarial Examples for Evaluating Reading Comprehension Systems）}。由于神经网络对模型内部的参数干预度有限，其本身便是一个黑盒模型，并且黑盒攻击只需要在输入部分引入攻击信号，因此在神经网络的相关应用中黑盒攻击方法更加实用。

@@ -425,13 +425,13 @@ R(\mathbf{w}) & = & (\big| |\mathbf{w}| {\big|}_2)^2 \\
 \vspace{0.5em}
 \item 此外还可以使用基于梯度的方法来生成对抗样本。例如，可以利用替换词与原始单词词向量之间的差值，以及候选词的梯度之间的相似度来生成对抗样本\upcite{DBLP:conf/acl/ChengJM19}，具体的计算方式如下：
 \begin{eqnarray}
-{\mathbi{x}'}_i &=& \arg\max_{\mathbi{x}\in V}\textrm{sim}(\funp{e}(\mathbi{x})-\funp{e}(\mathbi{x}_i),\mathbi{g}_{\mathbi{x}_i})
+{{x}'}_i &=& \arg\max_{{x}\in V}\textrm{sim}(\funp{e}({x})-\funp{e}({x}_i),\mathbi{g}_{{x}_i})
 \label{eq:13-9} \\
-\mathbi{g}_{\mathbi{x}_i} &=&  \bigtriangledown_{\funp{e}(\mathbi{x}_i)} - \log \funp{P}(\mathbi{y}|\mathbi{x};\theta)
+\mathbi{g}_{{x}_i} &=&  \bigtriangledown_{\funp{e}({x}_i)} - \log \funp{P}(\mathbi{y}|\mathbi{x};\theta)
 \label{eq:13-10}
 \end{eqnarray}

-\noindent 其中，$\mathbi{x}_i$为输入中第$i$个词，$\mathbi{g}_{\mathbi{x}_i}$为对应的梯度向量，$\funp{e}(\cdot)$用于获取词向量，$\textrm{sim}(\cdot,\cdot)$是用于评估两个向量之间相似度（距离）的函数，$V$为源语的词表，$\bigtriangledown$表示求梯度操作，因此公式\eqref{eq:13-10}表示求$- \log \funp{P}(\mathbi{y}|\mathbi{x};\theta)$对$\funp{e}(\mathbi{x}_i)$的梯度。由于对词表中所有单词进行枚举时，计算成本较大。因此利用语言模型选择最可能的$n$ 个词作为候选，进而缩减匹配范围，并从中采样出源语词进行替换是一种更有效地方式。同时，为了保护模型不受解码器预测误差的影响，此时需要对模型目标端的输入做出同样的调整。与在源语端操作不同的地方时，此时会将公式\eqref{eq:13-10}中的损失替换为$- \log \funp{P}(\mathbi{y}|\mathbi{x}')$。同时，在如何利用语言模型选择候选和采样方面，也做出了相应的调整。在进行对抗性训练时，在原有的训练损失上增加了三个额外的损失，最终的训练目标为：
+\noindent 其中，${x}_i$为输入中第$i$个词，$\mathbi{g}_{{x}_i}$为对应的梯度向量，$\funp{e}(\cdot)$用于获取词向量，$\textrm{sim}(\cdot,\cdot)$是用于评估两个向量之间相似度（距离）的函数，$V$为源语的词表，$\bigtriangledown$表示求梯度操作，因此公式\eqref{eq:13-10}表示求$- \log \funp{P}(\mathbi{y}|\mathbi{x};\theta)$对$\funp{e}({x}_i)$的梯度。由于对词表中所有单词进行枚举时，计算成本较大。因此利用语言模型选择最可能的$n$ 个词作为候选，进而缩减匹配范围，并从中采样出源语词进行替换是一种更有效地方式。同时，为了保护模型不受解码器预测误差的影响，此时需要对模型目标端的输入做出同样的调整。与在源语端操作不同的地方时，此时会将公式\eqref{eq:13-10}中的损失替换为$- \log \funp{P}(\mathbi{y}|\mathbi{x}')$。同时，在如何利用语言模型选择候选和采样方面，也做出了相应的调整。在进行对抗性训练时，在原有的训练损失上增加了三个额外的损失，最终的训练目标为：
 \begin{eqnarray}
 Loss(\theta_{\textrm{mt}},\theta_{\textrm{lm}}^{\mathbi{x}},\theta_{\textrm{lm}}^{\mathbi{y}}) &=& Loss_{\textrm{clean}}(\theta_{\textrm{mt}}) + Loss_{\textrm{lm}}(\theta_{\textrm{lm}}^{\mathbi{x}}) + \nonumber \\
 & & Loss_{\textrm{robust}}(\theta_{\textrm{mt}}) + Loss_{\textrm{lm}}(\theta_{\textrm{lm}}^{\mathbi{y}})
@@ -478,6 +478,14 @@ Loss_{\textrm{robust}}(\theta_{\textrm{mt}}) &=&  \frac{1}{N}\sum_{(\mathbi{x},\
 \label{fig:13-21}
 \end{figure}
 %----------------------------------------------
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter13/Figures/figure-exposure-bias}
+\caption{曝光偏置问题}
+\label{fig:13-21}
+\end{figure}
+%----------------------------------------------

 \item 训练目标函数与任务评价指标不一致问题：在训练过程中，在训练数据上进行极大似然估计，而在新数据上进行推断的时候，通常使用BLEU等外部评价指标来评价模型的性能。更加理想的情况是，模型应该直接最大化性能评价指标，而不是训练集数据上的似然函数。但是很多情况下，模型性能评价指标不可微分，这使得我们无法直接利用基于梯度的方法来优化模型。在机器翻译任务中，这个问题的一种体现是，训练数据上更低的困惑度不一定能带来BLEU的提升。
 \vspace{0.5em}
@@ -499,12 +507,12 @@ Loss_{\textrm{robust}}(\theta_{\textrm{mt}}) &=&  \frac{1}{N}\sum_{(\mathbi{x},\

 \parinterval 对于曝光偏置问题，一般可以使用束搜索等启发式搜索方法来进行缓解。也就是，训练过程可以模拟推断时的行为。但是即使使用束搜索，最终得到的有效序列数量很小，仍然无法完全解决训练和推断行为不一致的问题。

-\parinterval 对于一个目标序列$\seq{y}=\{\mathbi{y}_1,\mathbi{y}_2,\ldots,\mathbi{y}_n\}$，在预测第$j$个单词$\mathbi{y}_j$时，训练过程与推断过程之间的主要区别在于：训练过程中使用标准答案$\{\mathbi{y}_{1},...,\mathbi{y}_{j-1}\}$，而推断过程使用的是来自模型本身的预测结果$\{\tilde{\mathbi{y}}_{1},...,\tilde{\mathbi{y}}_{j-1}\}$。此时可以采取一种{\small\bfnew{调度采样}}\index{调度采样}（Scheduled Sampling\index{Scheduled Sampling}）机制（{\color{red} Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks}），在训练中随机决定使用$\mathbi{y}_{j-1}$还是$\tilde{\mathbi{y}}_{j-1}$。 假设训练时使用的是基于小批量的随机梯度下降方法，在第$i$ 个批次中，对序列每一个位置进行预测时以概率$\epsilon_i$使用标准答案，或以概率${(1-\epsilon_i)}^2$使用来自模型本身的预测。具体到序列中的一个位置$j$，可以根据模型预测$\tilde{\mathbi{y}}_{j-1}$ 单词的概率进行采样，在$\epsilon_i$控制的调度策略下，同$\mathbi{y}_{j-1}$一起作为输入。此过程如图\ref{fig:13-22}所示，并且这个过程可以很好地与束搜索融合。{\red 图里的t换成j}
+\parinterval 对于一个目标序列$\seq{y}=\{\mathbi{y}_1,\mathbi{y}_2,\ldots,\mathbi{y}_n\}$，在预测第$j$个单词$\mathbi{y}_j$时，训练过程与推断过程之间的主要区别在于：训练过程中使用标准答案$\{\mathbi{y}_{1},...,\mathbi{y}_{j-1}\}$，而推断过程使用的是来自模型本身的预测结果$\{\tilde{\mathbi{y}}_{1},...,\tilde{\mathbi{y}}_{j-1}\}$。此时可以采取一种{\small\bfnew{调度采样}}\index{调度采样}（Scheduled Sampling\index{Scheduled Sampling}）机制（{\color{red} Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks}），在训练中随机决定使用$\mathbi{y}_{j-1}$还是$\tilde{\mathbi{y}}_{j-1}$。 假设训练时使用的是基于小批量的随机梯度下降方法，在第$i$ 个批次中，对序列每一个位置进行预测时以概率$\epsilon_i$使用标准答案，或以概率${(1-\epsilon_i)}^2$使用来自模型本身的预测。具体到序列中的一个位置$j$，可以根据模型预测$\tilde{\mathbi{y}}_{j-1}$ 单词的概率进行采样，在$\epsilon_i$控制的调度策略下，同$\mathbi{y}_{j-1}$一起作为输入。此过程如图\ref{fig:13-22}所示，并且这个过程可以很好地与束搜索融合。{\red （这样画行不行，另外h1上面为啥没有生成第一个单词？此外，这是什么模型，哪里是解码器，h1是编码器？）}

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
-\includegraphics[scale=1]{./Chapter13/Figures/figure-of-scheduling-sampling-method.png}
+\input{./Chapter13/Figures/figure-of-scheduling-sampling-method}
 \caption{调度采样方法的示意图}
 \label{fig:13-22}
 \end{figure}
@@ -623,57 +631,55 @@ Where did my optimum go?:  An empiricalanalysis of gradient descent optimization

 \subsubsection{2. 基于演员-评论家的强化学习方法}

-\parinterval 基于策略的强化学习是要寻找一个策略$\funp{p}(a|\hat{\mathbi{y}}_{1 \ldots j},\seq{x})$，使得该策略选择的行动$a$未来可以获得的奖励期望（也被称为动作价值函数）最大化（{\color{red} 句子里的专有名词要加黑，加英文}）。这个过程通常用函数$Q$来描述：
-
+\parinterval 基于策略的强化学习是要寻找一个策略$\funp{p}(a|\tilde{\mathbi{y}}_{1 \ldots j},\seq{x})$，使得该策略选择的行动$a$未来可以获得的奖励期望（也被称为动作价值函数）最大化（{\color{red} 句子里的专有名词要加黑，加英文}）。这个过程通常用函数$Q$来描述：
 \begin{eqnarray}
-\funp{Q}(a;\hat{\mathbi{y}}_{1 \ldots j},\seq{y}) & = & \mathbb{E}_{\hat{\mathbi{y}}_{j+1 \ldots J} \sim \funp{p}(a|\hat{\mathbi{y}}_{1 \ldots j} a,\seq{x})}[\funp{r}_j(a;\hat{\mathbi{y}}_{1 \ldots j-1},\seq{y}) + \nonumber \\
-&  & \sum_{i=j+1}^J\funp{r}_i(\hat{\mathbi{y}}_i;\hat{\mathbi{y}}_{1 \ldots i-1}a\hat{\mathbi{y}}_{j+1 \ldots i},\seq{y})]
+\funp{Q}(a;\tilde{\mathbi{y}}_{1 \ldots j},\seq{y}) & = & \mathbb{E}_{\tilde{\mathbi{y}}_{j+1 \ldots J} \sim \funp{p}(a|\tilde{\mathbi{y}}_{1 \ldots j} a,\seq{x})}[\funp{r}_j(a;\tilde{\mathbi{y}}_{1 \ldots j-1},\seq{y}) + \nonumber \\
+&  & \sum_{i=j+1}^J\funp{r}_i(\tilde{\mathbi{y}}_i;\tilde{\mathbi{y}}_{1 \ldots i-1}a\tilde{\mathbi{y}}_{j+1 \ldots i},\seq{y})]
 \label{eq:13-35}
 \end{eqnarray}

-\noindent 其中，$\funp{r}_j(a;\hat{\mathbi{y}}_{1 \ldots j-1},\seq{y})$是$j$时刻做出行动$a$获得的奖励，$\seq{x}$是源语言句子，$\seq{y}$是正确译文，$\hat{\mathbi{y}}_{1 \ldots j}$是策略$\funp{p}$产生的译文的前$j$个词，$J$是生成译文的长度。其（在一个源语句子$x$上的）定义的目标为：{\red 这里的$\hat{y}$仍然是模型生成的译文吧，能否用$\tilde{y}$代替$\hat{y}$。在书里面$\hat{y}$都是表示最优解，不换的话，下面公式中的$\hat{p}$和$\hat{\mathbi{y}}$可能有冲突？}
-
+\noindent 其中，$\funp{r}_j(a;\tilde{\mathbi{y}}_{1 \ldots j-1},\seq{y})$是$j$时刻做出行动$a$获得的奖励，$\seq{x}$是源语言句子，$\seq{y}$是正确译文，$\tilde{\mathbi{y}}_{1 \ldots j}$是策略$\funp{p}$产生的译文的前$j$个词，$J$是生成译文的长度。其（在一个源语句子$x$上的）定义的目标为：
 \begin{eqnarray}
-\hat{p} & = & \max_{\funp{p}}\mathbb{E}_{\hat{\seq{y}} \sim \funp{p}(\hat{\seq{y}} | \seq{x})}\sum_{j=1}^J\sum_{a \in A}\funp{p}(a|\hat{\mathbi{y}}_{1 \ldots j},\seq{x})\funp{Q}(a;\hat{\mathbi{y}}_{1 \ldots j},\seq{y})
+\hat{p} & = & \max_{\funp{p}}\mathbb{E}_{\tilde{\seq{y}} \sim \funp{p}(\tilde{\seq{y}} | \seq{x})}\sum_{j=1}^J\sum_{a \in A}\funp{p}(a|\tilde{\mathbi{y}}_{1 \ldots j},\seq{x})\funp{Q}(a;\tilde{\mathbi{y}}_{1 \ldots j},\seq{y})
 \label{eq:13-15}
 \end{eqnarray}

 \noindent 其中，$A$表示所有可能的行动组成的空间，也就是词表$V$。公式\eqref{eq:13-15}的含义是，计算动作价值函数$\funp{Q}$需要枚举$j$时刻以后所有可能的序列，而可能的序列数目是随着其长度呈指数级增长，因此只能估计的方法计算$\funp{Q}$的值。基于策略的强化学习方法，如最小风险训练（风险$\vartriangle=-\funp{Q}$）等都使用了采样的方法来估计$\funp{Q}$。尽管采样估计的结果是$\funp{Q}$的无偏估计，但是它的缺点在于估计的方差比较大。而$\funp{Q}$直接关系到梯度更新的大小，不稳定的数值会导致模型更新不稳定，难以优化。

-\parinterval 为了避免采样的开销和随机性带来的不稳定，基于{\small\bfnew{演员-评论家}}\index{演员-评论家}（Actor-critic\index{Actor-critic}）的强化学习方法\upcite{DBLP:conf/iclr/BahdanauBXGLPCB17}引入一个可学习的函数$\hat{\funp{Q}}$，通过函数$\hat{\funp{Q}}$来逼近动作价值函数$\funp{Q}$。但是由于$\hat{\funp{Q}}$是人工设计的一个函数，该函数有着自身的偏置，因此$\hat{\funp{Q}}$不是$\funp{Q}$的一个无偏估计，所以使用$\hat{\funp{Q}}$来指导$\funp{p}$的优化无法到达理论上的最优解。尽管如此，得益于神经网络强大的拟合能力，基于演员-评论家的强化学习方法仍更具优势。
+\parinterval 为了避免采样的开销和随机性带来的不稳定，基于{\small\bfnew{演员-评论家}}\index{演员-评论家}（Actor-critic\index{Actor-critic}）的强化学习方法\upcite{DBLP:conf/iclr/BahdanauBXGLPCB17}引入一个可学习的函数$\tilde{\funp{Q}}$，通过函数$\tilde{\funp{Q}}$来逼近动作价值函数$\funp{Q}$。但是由于$\tilde{\funp{Q}}$是人工设计的一个函数，该函数有着自身的偏置，因此$\tilde{\funp{Q}}$不是$\funp{Q}$的一个无偏估计，所以使用$\tilde{\funp{Q}}$来指导$\funp{p}$的优化无法到达理论上的最优解。尽管如此，得益于神经网络强大的拟合能力，基于演员-评论家的强化学习方法仍更具优势。

-\parinterval 对于基于演员-评论家的强化学习方法，演员就是策略$\funp{p}$，而评论家就是动作价值函数$\funp{Q}$的估计$\hat{\funp{Q}}$。对于演员，它的目标函数如下：
+\parinterval 对于基于演员-评论家的强化学习方法，演员就是策略$\funp{p}$，而评论家就是动作价值函数$\funp{Q}$的估计$\tilde{\funp{Q}}$。对于演员，它的目标函数如下：
 \begin{eqnarray}
-\hat{p} & = & \max_{\funp{p}}\mathbb{E}_{\hat{\seq{y}} \sim \funp{p}(\hat{\seq{y}} | \seq{x})}\sum_{j=1}^J\sum_{a \in A}\funp{p}(a|\hat{\mathbi{y}}_{1 \ldots j},\seq{x})\hat{\funp{Q}}(a;\hat{\mathbi{y}}_{1 \ldots j},\seq{y})
+\hat{p} & = & \max_{\funp{p}}\mathbb{E}_{\tilde{\seq{y}} \sim \funp{p}(\tilde{\seq{y}} | \seq{x})}\sum_{j=1}^J\sum_{a \in A}\funp{p}(a|\tilde{\mathbi{y}}_{1 \ldots j},\seq{x})\tilde{\funp{Q}}(a;\tilde{\mathbi{y}}_{1 \ldots j},\seq{y})
 \label{eq:13-16}
 \end{eqnarray}

-\parinterval 与公式\eqref{eq:13-15}对比可以发现，基于演员-评论家的强化学习方法与基于策略的强化学习方法类似，公式\eqref{eq:13-16}对动作价值函数$\funp{Q}$的估计从采样换成了$\hat{\funp{Q}}$（{\color{red} 公式\eqref{eq:13-15}哪里体现出了采样？而公式\eqref{eq:13-16}哪里体现出不需要采样？}）。对于目标函数里的期望，通常使用采样对方式来进行逼近，例如，选择一定量的$\hat{y}$来计算期望，而不是遍历所有的$\hat{y}$。借助与最小风险训练类似的方法，可以计算对$\funp{p}$的梯度来优化演员。
+\parinterval 与公式\eqref{eq:13-15}对比可以发现，基于演员-评论家的强化学习方法与基于策略的强化学习方法类似，公式\eqref{eq:13-16}对动作价值函数$\funp{Q}$的估计从采样换成了$\tilde{\funp{Q}}$（{\color{red} 公式\eqref{eq:13-15}哪里体现出了采样？而公式\eqref{eq:13-16}哪里体现出不需要采样？}）。对于目标函数里的期望，通常使用采样对方式来进行逼近，例如，选择一定量的$\tilde{y}$来计算期望，而不是遍历所有的$\tilde{y}$。借助与最小风险训练类似的方法，可以计算对$\funp{p}$的梯度来优化演员。

-\parinterval 而对于评论家，它的优化目标并不是那么显而易见。尽管可以通过采样得方式来估计$\funp{Q}$，然后使用该估计作为目标让$\hat{\funp{Q}}$进行拟合，但是这样会导致非常高的（采样）代价，同时可以想象，既然有了一个无偏估计，为什么还要用有偏估计$\hat{\funp{Q}}$呢？
+\parinterval 而对于评论家，它的优化目标并不是那么显而易见。尽管可以通过采样得方式来估计$\funp{Q}$，然后使用该估计作为目标让$\tilde{\funp{Q}}$进行拟合，但是这样会导致非常高的（采样）代价，同时可以想象，既然有了一个无偏估计，为什么还要用有偏估计$\tilde{\funp{Q}}$呢？

 \parinterval 回顾动作价值函数的定义，可以对它做适当的展开，可以得到如下等式：
 \begin{eqnarray}
-\funp{Q}(\hat{\mathbi{y}}_j;\hat{\mathbi{y}}_{1 \ldots j -1},\seq{y}) & = & \funp{r}_j(\hat{\mathbi{y}}_j;\hat{\mathbi{y}}_{1 \ldots j-1},\seq{y}) + \nonumber \\
-&   & \sum_{a \in A}\funp{p}(a|\hat{\mathbi{y}}_{1 \ldots j},\seq{x})\funp{Q}(a;\hat{\mathbi{y}}_{1 \ldots j},\seq{y})
+\funp{Q}(\tilde{\mathbi{y}}_j;\tilde{\mathbi{y}}_{1 \ldots j -1},\seq{y}) & = & \funp{r}_j(\tilde{\mathbi{y}}_j;\tilde{\mathbi{y}}_{1 \ldots j-1},\seq{y}) + \nonumber \\
+&   & \sum_{a \in A}\funp{p}(a|\tilde{\mathbi{y}}_{1 \ldots j},\seq{x})\funp{Q}(a;\tilde{\mathbi{y}}_{1 \ldots j},\seq{y})
 \label{eq:13-17}
 \end{eqnarray}

-\parinterval 这个等式也被称为{\small\bfnew{贝尔曼方程}}\index{贝尔曼方程}（Bellman Equation\index{Bellman Equation}）\upcite{sutton2018reinforcement}。这个等式告诉我们$j-1$时刻的动作价值函数$\funp{Q}(\hat{\mathbi{y}}_j;\hat{\mathbi{y}}_{1 \ldots j-1},\seq{y})$跟下一时刻$j$的动作价值函数$\funp{Q}(a;\hat{\mathbi{y}}_{1 \ldots j},\seq{y})$之间的关系。因此可以很自然的使用等式右部作为等式左部$\funp{Q}(\hat{\mathbi{y}}_j;\hat{\mathbi{y}}_{1 \ldots j-1},\seq{y})$的目标。而由于动作价值函数的输出是数值，通常会选用均方误差来计算目标函数值（{\color{red} 为啥输出是数值，就要用局方误差来计算目标函数？}）。
+\parinterval 这个等式也被称为{\small\bfnew{贝尔曼方程}}\index{贝尔曼方程}（Bellman Equation\index{Bellman Equation}）\upcite{sutton2018reinforcement}。这个等式告诉我们$j-1$时刻的动作价值函数$\funp{Q}(\tilde{\mathbi{y}}_j;\tilde{\mathbi{y}}_{1 \ldots j-1},\seq{y})$跟下一时刻$j$的动作价值函数$\funp{Q}(a;\tilde{\mathbi{y}}_{1 \ldots j},\seq{y})$之间的关系。因此可以很自然的使用等式右部作为等式左部$\funp{Q}(\tilde{\mathbi{y}}_j;\tilde{\mathbi{y}}_{1 \ldots j-1},\seq{y})$的目标。而由于动作价值函数的输出是数值，通常会选用均方误差来计算目标函数值（{\color{red} 为啥输出是数值，就要用局方误差来计算目标函数？}）。

 \parinterval 进一步，可以定义$j$时刻动作价值函数的目标如下：
 \begin{eqnarray}
-\funp{q}_j & = &  \funp{r}_j(\hat{\mathbi{y}}_j;\hat{\mathbi{y}}_{1 \ldots j-1},\seq{y}) + \sum_{a \in A}\funp{p}(a|\hat{\mathbi{y}}_{1 \ldots j},\seq{x})\hat{\funp{Q}}(a;\hat{\mathbi{y}}_{1 \ldots j},\seq{y})
+\funp{q}_j & = &  \funp{r}_j(\tilde{\mathbi{y}}_j;\tilde{\mathbi{y}}_{1 \ldots j-1},\seq{y}) + \sum_{a \in A}\funp{p}(a|\tilde{\mathbi{y}}_{1 \ldots j},\seq{x})\tilde{\funp{Q}}(a;\tilde{\mathbi{y}}_{1 \ldots j},\seq{y})
 \label{eq:13-18}
 \end{eqnarray}

 \parinterval 而评论家对应的目标函数定义如下：
 \begin{eqnarray}
-\hat{\hat{\funp{Q}}} & = & \min_{\hat{\funp{Q}}}\sum_{j=1}^J{(\hat{\funp{Q}}(\hat{\mathbi{y}}_j;\hat{\mathbi{y}}_{1 \ldots j-1},\seq{y}) - \funp{q}_j)}^2
+\hat{\tilde{\funp{Q}}} & = & \min_{\tilde{\funp{Q}}}\sum_{j=1}^J{(\tilde{\funp{Q}}(\tilde{\mathbi{y}}_j;\tilde{\mathbi{y}}_{1 \ldots j-1},\seq{y}) - \funp{q}_j)}^2
 \label{eq:13-19}
 \end{eqnarray}

-\parinterval 最后，通过同时优化演员和评论家直到收敛，获得的演员也就是策略$\funp{p}$就是我们期望的翻译模型。图\ref{fig:13-25}展示了演员和评论家的关系。{\red （演员状态没提到）}
+\parinterval 最后，通过同时优化演员和评论家直到收敛，获得的演员也就是策略$\funp{p}$就是我们期望的翻译模型。图\ref{fig:13-25}展示了演员和评论家的关系。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -692,7 +698,7 @@ Where did my optimum go?:  An empiricalanalysis of gradient descent optimization
 \vspace{0.5em}
 \item 优化目标：评论家的优化目标是由自身输出所构造。当模型更新比较快的时候模型的输出变化也会很快，导致构造的优化目标不稳定，影响模型收敛。一个解决方案是在一定更新次数内固定构造优化目标使用的模型，然后再使用比较新的模型来构造后续一定更新次数内的优化目标，如此往复\upcite{DBLP:journals/nature/SilverHMGSDSAPL16}。
 \vspace{0.5em}
-\item 方差惩罚：在机器翻译中使用强化学习方法一个问题是动作空间过大，这是由词表过大造成的。因为模型只根据被采样到的结果来进行更新，很多动作很难得到更新，因此对不同动作的动作价值函数估计值会有很大差异。通常会引入一个正则项$C_j = \sum_{a \in A}{(\hat{\funp{Q}}(a;\hat{\mathbi{y}}_{1 \ldots j-1},\seq{y}) - \frac{1}{|A|} \sum_{b \in A}\hat{\funp{Q}}(b;\hat{\mathbi{y}}_{1 \ldots j-1},\seq{y}))}^2$来约束不同动作的动作函数估计值，使其不会偏离他们的均值太远\upcite{DBLP:conf/icml/ZarembaMJF16}。
+\item 方差惩罚：在机器翻译中使用强化学习方法一个问题是动作空间过大，这是由词表过大造成的。因为模型只根据被采样到的结果来进行更新，很多动作很难得到更新，因此对不同动作的动作价值函数估计值会有很大差异。通常会引入一个正则项$C_j = \sum_{a \in A}{(\tilde{\funp{Q}}(a;\tilde{\mathbi{y}}_{1 \ldots j-1},\seq{y}) - \frac{1}{|A|} \sum_{b \in A}\tilde{\funp{Q}}(b;\tilde{\mathbi{y}}_{1 \ldots j-1},\seq{y}))}^2$来约束不同动作的动作函数估计值，使其不会偏离他们的均值太远\upcite{DBLP:conf/icml/ZarembaMJF16}。
 \vspace{0.5em}
 \item 函数塑形：在机器翻译里面使用强化学习方法另一个问题就是奖励的稀疏性。评价指标如BLEU等只能对完整的句子进行打分，也就是奖励只有在句子结尾有值，而在句子中间只能为0。这种情况意味着模型在生成句子的过程中没有任何信号来指导它的行为，从而大大增加了学习难度。常见的解决方案是进行{\small\bfnew{函数塑形}}\index{函数塑形}（Reward Shaping\index{Reward Shaping}），使得奖励在生成句子的过程中变得稠密，同时也不会改变模型的最优解\upcite{DBLP:conf/icml/NgHR99}。
 \vspace{0.5em}