13章公式

9a59e13c · 单韦乔 · e1c66a7c · 9a59e13c
Commit 9a59e13c authored Jan 20, 2021 by 单韦乔
--- a/Chapter13/chapter13.tex
+++ b/Chapter13/chapter13.tex
@@ -570,14 +570,14 @@ Loss_{\textrm{robust}}(\theta_{\textrm{mt}}) &=&  \frac{1}{N}\sum_{(\mathbi{x},\

 \subsubsection{2. 演员-评论家方法}

-\parinterval 基于策略的强化学习是要寻找一个策略$\funp{p}(a|\hat{{y}}_{1 \ldots j},\seq{x})$，使得该策略选择的行动$a$未来可以获得的奖励期望（也被称为{\small\bfnew{动作价值函数}}\index{动作价值函数}（Action-value Function）\index{Action-value Function}）最大化。这个过程通常用函数$Q$来描述：
+\parinterval 基于策略的强化学习是要寻找一个策略$\funp{p}(a|\hat{{y}}_{1 \ldots j-1},\seq{x})$，使得该策略选择的行动$a$未来可以获得的奖励期望最大化，也被称为{\small\bfnew{动作价值函数}}\index{动作价值函数}（Action-value Function）\index{Action-value Function}最大化。这个过程通常用函数$Q$来描述：
 \begin{eqnarray}
-\funp{Q}(a;\hat{y}_{1 \ldots j},\seq{y}) & = & \mathbb{E}_{\hat{y}_{j+1 \ldots J} \sim \funp{p}(a|\hat{y}_{1 \ldots j} a,\seq{x})}[\funp{r}_j(a;\hat{y}_{1 \ldots j-1},\seq{y}) + \nonumber \\
-&  & \sum_{i=j+1}^J\funp{r}_i(\hat{{y}}_i;\hat{{y}}_{1 \ldots j-1}a\hat{{y}}_{j+1 \ldots i-1},\seq{y})]
+\funp{Q}(a;\hat{y}_{1 \ldots j-1},\seq{y}) & = & \mathbb{E}_{\hat{y}_{j+1 \ldots J} \sim \funp{p}(\cdot|\hat{y}_{1 \ldots j-1} a,\seq{x})}[\funp{r}_j(a;\hat{y}_{1 \ldots j-1},\seq{y}) + \nonumber \\
+&  & \sum_{i=j+1}^J\funp{r}_i(\hat{{y}}_i;\hat{{y}}_{1 \ldots j-1}a\hat{{y}}_{j+1 \ldots i},\seq{y})]
 \label{eq:13-16}
 \end{eqnarray}

-\noindent 其中，$\funp{r}_j(a;\hat{{y}}_{1 \ldots j-1},\seq{y})$是$j$时刻做出行动$a$获得的奖励，$\funp{r}_i(\hat{{y}}_i;\hat{{y}}_{1 \ldots j-1}a\hat{{y}}_{j+1 \ldots i-1},\seq{y})$是在$j$时刻的行动为$a$的前提下，$i$时刻的做出行动$\hat{{y}}_i$获得的奖励，$\seq{x}$是源语言句子，$\seq{y}$是正确译文，$\hat{{y}}_{1 \ldots j}$是策略$\funp{p}$产生的译文的前$j$个词，$J$是生成译文的长度。对于源语句子$x$，最优策略$\hat{p}$可以被定义为：
+\noindent 其中，$\funp{r}_j(a;\hat{{y}}_{1 \ldots j-1},\seq{y})$是$j$时刻做出行动$a$获得的奖励，$\funp{r}_i(\hat{{y}}_i;\hat{{y}}_{1 \ldots j-1}a\hat{{y}}_{j+1 \ldots i},\seq{y})$是在$j$时刻的行动为$a$的前提下，$i$时刻的做出行动$\hat{{y}}_i$获得的奖励，$\seq{x}$是源语言句子，$\seq{y}$是正确译文，$\hat{{y}}_{1 \ldots j-1}$是策略$\funp{p}$产生的译文的前$j-1$个词，$J$是生成译文的长度。对于源语句子$x$，最优策略$\hat{p}$可以被定义为：
 \begin{eqnarray}
 \hat{p} & = & \argmax_{\funp{p}}\mathbb{E}_{\hat{\seq{y}} \sim \funp{p}(\hat{\seq{y}} | \seq{x})}\sum_{j=1}^J\sum_{a \in A}\funp{p}(a|\hat{{y}}_{1 \ldots j},\seq{x})\funp{Q}(a;\hat{{y}}_{1 \ldots j},\seq{y})
 \label{eq:13-17}