10 编辑反馈

cb7e9d5b · zengxin · ed5434e2 · cb7e9d5b · cb7e9d5b · cb7e9d5b
Commit cb7e9d5b authored Dec 26, 2020 by zengxin
--- a/Chapter10/Figures/figure-calculation-process-of-context-vector-c.tex
+++ b/Chapter10/Figures/figure-calculation-process-of-context-vector-c.tex
@@ -36,7 +36,7 @@
 \node [anchor=north] (enc2) at (h2.south) {\scriptsize{编码器输出}};
 \node [anchor=north] (enc22) at ([yshift=0.5em]enc2.south) {\scriptsize{(位置$2$)}};
 \node [anchor=north] (enc4) at (h4.south) {\scriptsize{编码器输出}};
-\node [anchor=north] (enc42) at ([yshift=0.5em]enc4.south) {\scriptsize{(位置$4$)}};
+\node [anchor=north] (enc42) at ([yshift=0.5em]enc4.south) {\scriptsize{(位置$m$)}};
 {
 \node [anchor=west] (math1) at ([xshift=5em,yshift=1em]th2.east) {$\mathbi{C}_j = \sum_{i} \alpha_{i,j} \mathbi{h}_i $ \ \ \ \ };

--- a/Chapter10/Figures/figure-example-of-context-vector-calculation-process.tex
+++ b/Chapter10/Figures/figure-example-of-context-vector-calculation-process.tex
@@ -38,14 +38,14 @@
    \node[elementnode,minimum size=0.6*\hnode*\c,inner sep=0.1pt,fill=blue] (a\i\j) at (0.5*\hnode*\i-5.4*0.5*\hnode,0.5*\hnode*\j-1.05*\hnode) {};
 %attention score labels
-\node[align=center] (l17) at (a17) {\scriptsize{{\color{white} .4}}};
+\node[align=center,scale=0.8] (l17) at (a17) {\scriptsize{{\color{white} 0.4}}};
-\node[align=center] (l26) at (a06) {\scriptsize{{\color{white} .3}}};
+\node[align=center,scale=0.7] (l26) at (a06) {\tiny{{\color{white} 0.3}}};
-\node[align=center] (l26) at (a16) {\scriptsize{{\color{white} .4}}};
+\node[align=center,scale=0.8] (l26) at (a16) {\scriptsize{{\color{white} 0.4}}};
-\node[align=center] (l17) at (a35) {\scriptsize{{\color{white} .3}}};
+\node[align=center,scale=0.7] (l17) at (a35) {\tiny{{\color{white} 0.3}}};
-\node[align=center] (l17) at (a34) {\tiny{{\color{white} .3}}};
+\node[align=center,scale=0.7] (l17) at (a34) {\tiny{{\color{white} 0.3}}};
-\node[align=center] (l17) at (a23) {\small{{\color{white} .8}}};
+\node[align=center] (l17) at (a23) {\small{{\color{white} 0.8}}};
-\node[align=center] (l17) at (a41) {\small{{\color{white} .8}}};
+\node[align=center] (l17) at (a41) {\small{{\color{white} 0.8}}};
-\node[align=center] (l17) at (a50) {\small{{\color{white} .7}}};
+\node[align=center,scale=0.9] (l17) at (a50) {\small{{\color{white} 0.7}}};
 % source
 \node[srcnode] (src1) at (-5.4*0.5*\hnode,-1.05*\hnode+7.5*0.5*\hnode) {\scriptsize{Have}};

--- a/Chapter10/Figures/figure-matrix-representation-of-attention-weights-between-chinese-english-sentence-pairs.tex
+++ b/Chapter10/Figures/figure-matrix-representation-of-attention-weights-between-chinese-english-sentence-pairs.tex
@@ -30,14 +30,14 @@
    \node[elementnode,minimum size=0.6*1.2cm*\c,inner sep=0.1pt,fill=blue] (a\i\j) at (0.5*1.2cm*\i-5.4*0.5*1.2cm,0.5*1.2cm*\j-1.05*1.2cm) {};
 %attention score labels
-\node[align=center] (l17) at (a17) {\scriptsize{{\color{white} .4}}};
+\node[align=center,scale=0.8] (l17) at (a17) {\scriptsize{{\color{white} 0.4}}};
-\node[align=center] (l26) at (a06) {\scriptsize{{\color{white} .3}}};
+\node[align=center,scale=0.8] (l26) at (a06) {\scriptsize{{\color{white} 0.3}}};
-\node[align=center] (l26) at (a16) {\scriptsize{{\color{white} .4}}};
+\node[align=center,scale=0.8] (l26) at (a16) {\scriptsize{{\color{white} 0.4}}};
-\node[align=center] (l17) at (a35) {\scriptsize{{\color{white} .3}}};
+\node[align=center,scale=0.8] (l17) at (a35) {\scriptsize{{\color{white} 0.3}}};
-\node[align=center] (l17) at (a34) {\tiny{{\color{white} .3}}};
+\node[align=center,scale=0.8] (l17) at (a34) {\scriptsize{{\color{white} 0.3}}};
-\node[align=center] (l17) at (a23) {\small{{\color{white} .8}}};
+\node[align=center] (l17) at (a23) {\small{{\color{white} 0.8}}};
-\node[align=center] (l17) at (a41) {\small{{\color{white} .8}}};
+\node[align=center] (l17) at (a41) {\small{{\color{white} 0.8}}};
-\node[align=center] (l17) at (a50) {\small{{\color{white} .7}}};
+\node[align=center] (l17) at (a50) {\small{{\color{white} 0.7}}};
 % source
 \node[srcnode] (src1) at (-5.4*0.5*1.2cm,-1.05*1.2cm+7.5*0.5*1.2cm) {\scriptsize{Have}};

--- a/Chapter10/chapter10.tex
+++ b/Chapter10/chapter10.tex
@@ -191,7 +191,7 @@ NMT                     & 21.7          & 18.7           & -13.7      \\
 \end{table}
 %----------------------------------------------
-\parinterval  在最近两年，神经机器翻译的发展更加迅速，新的模型及方法层出不穷。表\ref{tab:10-3}给出了到2020年为止一些主流的神经机器翻译模型的对比。可以看到，相比2017年，2018-2020年中机器翻译仍然有明显的进步。
+\parinterval  在最近两年，神经机器翻译的发展更加迅速，新的模型及方法层出不穷。表\ref{tab:10-3}给出了到2020年为止，一些主流的神经机器翻译模型在WMT14英德数据集上的表现。可以看到，相比2017年，2018-2020年中机器翻译仍然有明显的进步。
 \vspace{0.5em}%全局布局使用
 %----------------------------------------------
@@ -556,7 +556,7 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm
 \end{itemize}
-\parinterval LSTM的完整结构如图\ref{fig:10-12}所示，模型的参数包括：参数矩阵$\mathbi{W}_f$、$\mathbi{W}_i$ 、$\mathbi{W}_c$、\\$\mathbi{W}_o$和偏置$\mathbi{b}_f$、$\mathbi{b}_i$、$\mathbi{b}_c$、$\mathbi{b}_o$。可以看出，$\mathbi{h}_t$是由$\mathbi{c}_{t-1}$、$\mathbi{h}_{t-1}$与$\mathbi{x}_t$共同决定的。此外，上述公式中激活函数的选择是根据函数各自的特点决定的。
+\parinterval LSTM的完整结构如图\ref{fig:10-12}所示，模型的参数包括：参数矩阵$\mathbi{W}_f$、$\mathbi{W}_i$ 、$\mathbi{W}_c$、\\$\mathbi{W}_o$和偏置$\mathbi{b}_f$、$\mathbi{b}_i$、$\mathbi{b}_c$、$\mathbi{b}_o$。可以看出，$\mathbi{h}_t$是由$\mathbi{c}_{t-1}$、$\mathbi{h}_{t-1}$与$\mathbi{x}_t$共同决定的。此外，本节公式中激活函数的选择是根据函数各自的特点决定的。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -614,7 +614,7 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm
 \subsection{双向模型}
-\parinterval 前面提到的循环神经网络都是自左向右运行的，也就是说在处理一个单词的时候只能访问它前面的序列信息。但是，只根据句子的前文来生成一个序列的表示是不全面的，因为从最后一个词来看，第一个词的信息可能已经很微弱了。为了同时考虑前文和后文的信息，一种解决办法是使用双向循环网络，其结构如图\ref{fig:10-14}所示。这里，编码器可以看作有两个循环神经网络，第一个网络，即红色虚线框里的网络，从句子的右边进行处理，第二个网络从句子左边开始处理，最终将正向和反向得到的结果都融合后传递给解码器。
+\parinterval 前面提到的循环神经网络都是自左向右运行的，也就是说在处理一个单词的时候只能访问它前面的序列信息。但是，只根据句子的前文来生成一个序列的表示是不全面的，因为从最后一个词来看，第一个词的信息可能已经很微弱了。为了同时考虑前文和后文的信息，一种解决办法是使用双向循环网络，其结构如图\ref{fig:10-14}所示。这里，编码器可以看作由两个循环神经网络构成，第一个网络，即红色虚线框里的网络，从句子的右边进行处理，第二个网络从句子左边开始处理，最终将正向和反向得到的结果都融合后传递给解码器。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -898,7 +898,7 @@ a (\mathbi{s},\mathbi{h}) &=&  \left\{ \begin{array}{ll}
 \caption{GNMT与其他翻译模型对比\upcite{Wu2016GooglesNM}}
 \label{tab:10-8}
 \begin{tabular}{l l l}
-\multicolumn{1}{l|}{\multirow{3}{*}{\#}} & \multicolumn{2}{c}{BLEU[\%]} \\
+\multicolumn{1}{l|}{\multirow{3}{*}{}} & \multicolumn{2}{c}{BLEU[\%]} \\
 \multicolumn{1}{l|}{}                    & 英德  & 英法                                               \\
 \multicolumn{1}{l|}{}                    & EN-DE  & EN-FR                                               \\ \hline
 \multicolumn{1}{l|}{PBMT}                & 20.7            & 37.0            \\
@@ -953,7 +953,7 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) &=& \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j
 \parinterval 公式\eqref{eq:10-26}是一种非常通用的损失函数形式，除了交叉熵，也可以使用其他的损失函数，这时只需要替换$L_{\textrm{ce}} (\cdot)$即可。这里使用交叉熵损失函数的好处在于，它非常容易优化，特别是与Softmax组合，其反向传播的实现非常高效。此外，交叉熵损失（在一定条件下）也对应了极大似然的思想，这种方法在自然语言处理中已经被证明是非常有效的。
-\parinterval 除了交叉熵，很多系统也使用了面向评价的损失函数，比如，直接利用评价指标BLEU定义损失函数\upcite{DBLP:conf/acl/ShenCHHWSL16}。不过这类损失函数往往不可微分，因此无法直接获取梯度。这时可以引入强化学习技术，通过策略梯度等方法进行优化。不过这类方法需要采样等手段，这里不做重点讨论，相关内容会在后面技术部分进行介绍。
+\parinterval 除了交叉熵，很多系统也使用了面向评价的损失函数，比如，直接利用评价指标BLEU定义损失函数\upcite{DBLP:conf/acl/ShenCHHWSL16}。不过这类损失函数往往不可微分，因此无法直接获取梯度。这时可以引入强化学习技术，通过策略梯度等方法进行优化。不过这类方法需要采样等手段，这里不做重点讨论，相关内容会在{\chapterthirteen}进行介绍。
 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -988,9 +988,9 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) &=& \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j
 \subsubsection{3. 优化策略}
 %\vspace{0.5em}
-\parinterval 公式\eqref{eq:10-24}展示了最基本的优化策略，也被称为标准的SGD优化器。实际上，训练神经机器翻译模型时，还有非常多的优化器可以选择，在{\chapternine}也有详细介绍，这里考虑Adam优化器\upcite{kingma2014adam}。 Adam 通过对梯度的{\small\bfnew{一阶矩估计}}\index{一阶矩估计}（First Moment Estimation）\index{First Moment Estimation}和{\small\bfnew{二阶矩估计}}\index{二阶矩估计}（Second Moment Estimation）\index{Second Moment Estimation}进行综合考虑，计算出更新步长。
+\parinterval 公式\eqref{eq:10-24}展示了最基本的优化策略，也被称为标准的SGD优化器。实际上，训练神经机器翻译模型时，还有非常多的优化器可以选择，在{\chapternine}也有详细介绍，本章介绍的循环神经网络考虑使用Adam优化器\upcite{kingma2014adam}。 Adam 通过对梯度的{\small\bfnew{一阶矩估计}}\index{一阶矩估计}（First Moment Estimation）\index{First Moment Estimation}和{\small\bfnew{二阶矩估计}}\index{二阶矩估计}（Second Moment Estimation）\index{Second Moment Estimation}进行综合考虑，计算出更新步长。
-\parinterval 通常，Adam收敛地比较快，不同任务基本上可以使用一套配置进行优化，虽性能不算差，但很难达到最优效果。相反，SGD虽能通过在不同的数据集上进行调整，来达到最优的结果，但是收敛速度慢。因此需要根据不同的需求来选择合适的优化器。若需要快得到模型的初步结果，选择Adam较为合适，若是需要在一个任务上得到最优的结果，选择SGD更为合适。
+\parinterval 通常，Adam收敛地比较快，不同任务基本上可以使用一套配置进行优化，虽性能不算差，但很难达到最优效果。相反，SGD虽能通过在不同的数据集上进行调整，来达到最优的结果，但是收敛速度慢。因此需要根据不同的需求来选择合适的优化器。若需要快速得到模型的初步结果，选择Adam较为合适，若是需要在一个任务上得到最优的结果，选择SGD更为合适。
 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -1039,7 +1039,7 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) &=& \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j
 \end{eqnarray}
 %-------
-\noindent 另一方面，当模型训练逐渐接近收敛的时候，使用太大学习率会很容易让模型在局部最优解附近震荡，从而错过局部极小，因此需要通过减小学习率来调整更新的步长，以此来不断地逼近局部最优，这一阶段也称为学习率的衰减阶段。学习率衰减的方法有很多，比如指数衰减以及余弦衰减等，图\ref{fig:10-26}右侧展示的是{\small\bfnew{分段常数衰减}}\index{分段常数衰减}（Piecewise Constant Decay）\index{Piecewise Constant Decay}，即每经过$m$次更新，学习率衰减为原来的$\beta_m$（$\beta_m<1$）倍，其中$m$和$\beta_m$为经验设置的超参。
+\noindent 另一方面，当模型训练逐渐接近收敛的时候，使用太大学习率会很容易让模型在局部最优解附近震荡，从而错过局部极小，因此需要通过减小学习率来调整更新的步长，以此来不断地逼近局部最优，这一阶段也称为学习率的衰减阶段。学习率衰减的方法有很多，比如指数衰减以及余弦衰减等，图\ref{fig:10-26}右侧下降部分的曲线展示了{\small\bfnew{分段常数衰减}}\index{分段常数衰减}（Piecewise Constant Decay）\index{Piecewise Constant Decay}，即每经过$m$次更新，学习率衰减为原来的$\beta_m$（$\beta_m<1$）倍，其中$m$和$\beta_m$为经验设置的超参。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -1110,9 +1110,9 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) &=& \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j
 \begin{figure}[htp]
 \centering
 \begin{tabular}{l l}
-\subfigure[{\small }]{\input{./Chapter10/Figures/figure-process01}} &\subfigure[{\small }]{\input{./Chapter10/Figures/figure-process02}} \\
+\subfigure[{\small 第一步\qquad }]{\input{./Chapter10/Figures/figure-process01}} &\subfigure[{\small 第二步\qquad }]{\input{./Chapter10/Figures/figure-process02}} \\
-\subfigure[{\small }]{\input{./Chapter10/Figures/figure-process03}}  &\subfigure[{\small }]{\input{./Chapter10/Figures/figure-process04}} \\
+\subfigure[{\small 第三步}]{\input{./Chapter10/Figures/figure-process03}}  &\subfigure[{\small 第四步}]{\input{./Chapter10/Figures/figure-process04}} \\
-\subfigure[{\small }]{\input{./Chapter10/Figures/figure-process05}}  &\subfigure[{\small }]{\input{./Chapter10/Figures/figure-process06}}
+\subfigure[{\small 第五步\qquad }]{\input{./Chapter10/Figures/figure-process05}}  &\subfigure[{\small 第六步\qquad }]{\input{./Chapter10/Figures/figure-process06}}
 \end{tabular}
 \caption{一个三层循环神经网络的模型并行过程}
 \label{fig:10-28}