合并分支 'shanweiqiao' 到 'caorunzhe'

1、2、13、15细节修正（出版社以及学弟反馈）查看合并请求 !1118

合并分支 'shanweiqiao' 到 'caorunzhe'
1、2、13、15细节修正（出版社以及学弟反馈）查看合并请求 !1118
db8a6953 · 单韦乔 · 214b5e5b · dce0ff1d · db8a6953 · db8a6953
Commit db8a6953 authored Aug 12, 2021 by 单韦乔
--- a/Chapter1/Figures/figure-example-rbmt.tex
+++ b/Chapter1/Figures/figure-example-rbmt.tex
@@ -17,7 +17,7 @@
 \node [anchor=north west] (rule4part2) at ([yshift=0.5em]rule4.south west) {\textbf{\hspace{0.95em} then} 调序[动词 + 对象]};
 \node [anchor=north west] (rule5) at ([yshift=0.1em]rule4part2.south west) {\textbf{5: If} 译文主语是\ I};
 \node [anchor=north west] (rule5part2) at ([yshift=0.5em]rule5.south west) {\textbf{\hspace{0.95em} then} be动词为\ am/was};
-\node [anchor=north west] (rule6) at ([yshift=0.1em]rule5part2.south west) {\textbf{6: If} 源语是主谓结构};
+\node [anchor=north west] (rule6) at ([yshift=0.1em]rule5part2.south west) {\textbf{6: If} 源语言是主谓结构};
 \node [anchor=north west] (rule6part2) at ([yshift=0.5em]rule6.south west) {\textbf{\hspace{0.95em} then} 译文为主谓结构};
 \node [anchor=south west] (rulebaselabel) at (rule1.north west) {{\color{ublue} 资源：规则库}};
 }

--- a/Chapter1/chapter1.tex
+++ b/Chapter1/chapter1.tex
@@ -267,7 +267,7 @@
 \end{figure}
 %-------------------------------------------
-\parinterval 图\ref{fig:1-8}展示了一个使用转换法进行翻译的实例。这里，利用一个简单的汉译英规则库完成对句子“我对你感到满意”的翻译。当翻译“我”时，从规则库中找到规则1，该规则表示遇到单词“我”就翻译为“I”；类似地，也可以从规则库中找到规则4，该规则表示翻译调序，即将单词“you”放到“be satisfied with”后面。这种通过规则表示单词之间对应关系的方式，也为统计机器翻译方法提供了思路。如统计机器翻译中，基于短语的翻译模型使用短语对对原文进行替换，详细描述可以参考{\chapterseven}。
+\parinterval 图\ref{fig:1-8}展示了一个使用转换法进行翻译的实例。这里，利用一个简单的汉译英规则库完成对句子“我对你感到满意”的翻译。当翻译“我”时，从规则库中找到规则1，该规则表示遇到单词“我”就翻译为“I”；类似地，也可以从规则库中找到规则4，该规则表示翻译调序，即将单词“you”放到“be satisfied with”后面。这种通过规则表示单词之间对应关系的方式，也为统计机器翻译方法提供了思路。如统计机器翻译中，基于短语的翻译模型使用短语对对源语言进行替换，详细描述可以参考{\chapterseven}。
 \parinterval 在上述例子中可以发现，规则不仅仅可以翻译句子之间单词的对应，如规则1，还可以表示句法甚至语法之间的对应，如规则6。因此基于规则的方法可以分成多个层次，如图\ref{fig:1-9}所示。图中不同的层次表示采用不同的知识来书写规则，进而完成机器翻译过程。对于翻译问题，可以构建不同层次的基于规则的机器翻译系统。这里包括四个层次，分别为：词汇转换、句法转换、语义转换和中间语言层。其中，上层可以继承下层的翻译知识，比如说句法转换层会利用词汇转换层知识。早期基于规则的方法属于词汇转换层。

--- a/Chapter13/Figures/figure-reinforcement-learning-method-based-on-actor-critic.tex
+++ b/Chapter13/Figures/figure-reinforcement-learning-method-based-on-actor-critic.tex
@@ -24,8 +24,8 @@
 %\draw [->,dotted,very thick] ([xshift=0em,yshift=0em]n1.east)  .. controls ([xshift=3em,yshift=-1em]n1.-90) and ([xshift=-3em,yshift=-1em]n2.-90) .. (n2.west);
-	\node[anchor=west,inner sep=0mm] (n3) at ([xshift=4.1em,yshift=1em]n1.east) {$Q_1,Q_2,\ldots,Q_J$};
+	\node[anchor=west,inner sep=0mm] (n3) at ([xshift=4.1em,yshift=1.2em]n1.east) {$Q_1,Q_2,\ldots,Q_J$};
-	\node[anchor=west,inner sep=0mm] (n4) at ([xshift=4.9em,yshift=-1em]n1.east) {$\tilde{{y}}_1,\tilde{{y}}_2,\ldots,\tilde{{y}}_J$};
+	\node[anchor=west,inner sep=0mm] (n4) at ([xshift=4.9em,yshift=-1.2em]n1.east) {$\tilde{{y}}_1,\tilde{{y}}_2,\ldots,\tilde{{y}}_J$};
 \draw [->,thick] ([xshift=-0.1em,yshift=0.6em]n2.west) -- ([xshift=0.1em,yshift=0.6em]n1.east);
 \draw [->,thick] ([xshift=0.1em,yshift=-0.6em]n1.east) -- ([xshift=-0.1em,yshift=-0.6em]n2.west);

--- a/Chapter15/Figures/figure-relative-position-coding-and-absolute-position-coding.tex
+++ b/Chapter15/Figures/figure-relative-position-coding-and-absolute-position-coding.tex
@@ -105,15 +105,15 @@
 	\node [rectangle,inner sep=0.3em,rounded corners=5pt,very thick,dotted,draw=ublue,minimum height=1.4em,minimum width=7em] [fit = (l2) (sa2) (res4) (l5) (set2)] (b3) {};	
 \end{pgfonlayer}
-\node [inputnode,anchor=north west] (input1) at ([yshift=-1.6em,xshift=-0.5em]sa1.south west) {\tiny{Embedding}};
+\node [inputnode,anchor=north west] (input1) at ([yshift=-1.6em,xshift=-0.5em]sa1.south west) {\tiny{$\textbf{Embedding}$}};
 \node [] (add) at ([yshift=-2.2em,xshift=3.5em]sa1.south west) {$+$};
-\node [posnode,anchor=north east] (pos1) at ([yshift=-1.6em,xshift=1.5em]sa1.south east) {\tiny{Absolute Position}};
+\node [posnode,anchor=north east] (pos1) at ([yshift=-1.6em,xshift=1.5em]sa1.south east) {\tiny{$\textbf{Absolute Position}$}};
 \node [anchor=north] (wi) at ([yshift=-0.5em]pos1.south) {\scriptsize{词序信息}};
-\node [posnode,anchor=west,font=\tiny,align=center] (pos2) at ([yshift=0em,xshift=1em]pos1.east) {Relative \\ Position 1};
+\node [posnode,anchor=west,font=\tiny,align=center] (pos2) at ([yshift=0em,xshift=1em]pos1.east) {$\textbf{Relative}$ \\ $\textbf{Position 1}$};
 \node [posnode,anchor=west,font=\tiny,align=center,minimum width=1em] (pos3) at ([yshift=0em,xshift=1em]pos2.east) {$\cdots$};
-\node [posnode,anchor=west,font=\tiny,align=center] (pos4) at ([yshift=0em,xshift=1em]pos3.east) {Relative \\ Position $n$};
+\node [posnode,anchor=west,font=\tiny,align=center] (pos4) at ([yshift=0em,xshift=1em]pos3.east) {$\textbf{Relative}$ \\ $\textbf{Position n}$};
 \draw [->] (wi.north) -- (pos1.south);
 \draw [->] (add.north) -- (sa1.south);

--- a/Chapter15/chapter15.tex
+++ b/Chapter15/chapter15.tex
@@ -182,7 +182,7 @@ A_{ij}^{\rm rel} &=& \underbrace{\mathbi{E}_{x_i}\mathbi{W}_Q\mathbi{W}_{K}^{\te
 \noindent 具体的形式如下：
 \begin{eqnarray}
-\mathbi{e}_{ij} &=& \frac{(\mathbi{x}_i \mathbi{W}_Q){(\mathbi{x}_j \mathbi{W}_K)}^{\textrm{T}}}{\sqrt{d_k}} + G_{ij}
+e_{ij} &=& \frac{(\mathbi{x}_i \mathbi{W}_Q){(\mathbi{x}_j \mathbi{W}_K)}^{\textrm{T}}}{\sqrt{d_k}} + G_{ij}
 \label{eq:15-15}
 \end{eqnarray}
@@ -221,7 +221,7 @@ v_i &=& \mathbi{I}_d^{\textrm{T}}\textrm{Tanh}(\mathbi{W}_d\mathbi{Q}_i)
 \noindent 于是，在计算第$i$个词对第$j$个词的相关系数时，通过超参数$\omega$控制实际的感受野为$j-\omega,\ldots,j+\omega$，注意力计算中$\mathbi{e}_{ij}$的计算方式与公式\eqref{eq:15-6}相同，权重$\alpha_{ij}$的具体计算公式为：
 \begin{eqnarray}
-\alpha_{ij} &=& \frac{\exp (\mathbi{e}_{ij})}{\sum_{k=j-\omega}^{j+\omega}\exp (\mathbi{e}_{ik})}
+\alpha_{ij} &=& \frac{\exp (e_{ij})}{\sum_{k=j-\omega}^{j+\omega}\exp (e_{ik})}
 \label{eq:15-20}
 \end{eqnarray}
@@ -687,9 +687,9 @@ v_i &=& \mathbi{I}_d^{\textrm{T}}\textrm{Tanh}(\mathbi{W}_d\mathbi{Q}_i)
 \vspace{0.5em}
 \item 类似于标准的Transformer初始化方式，使用Xavier初始化方式来初始化除了词嵌入以外的所有参数矩阵。词嵌入矩阵服从$\mathbb{N}(0,d^{-\frac{1}{2}})$的高斯分布，其中$d$代表词嵌入的维度。
 \vspace{0.5em}
-\item 对编码器中部分自注意力机制的参数矩阵以及前馈神经网络的参数矩阵进行缩放因子为$0.67 {L}^{-\frac{1}{4}}$的缩放，$L$为编码器层数。
+\item 对编码器中部分自注意力机制的参数矩阵以及前馈神经网络的参数矩阵进行缩放因子为$0.67 {L}^{-\frac{1}{4}}$的缩放，对编码器中词嵌入的参数矩阵进行缩放因子为$(9 {L})^{-\frac{1}{4}}$的缩放，其中$L$为编码器的层数。
 \vspace{0.5em}
-\item 对解码器中部分注意力机制的参数矩阵、前馈神经网络的参数矩阵以及前馈神经网络的嵌入式输入进行缩放因子为$(9 {M})^{-\frac{1}{4}}$的缩放，其中$M$为解码器层数。
+\item 对解码器中部分注意力机制的参数矩阵、前馈神经网络的参数矩阵以及解码器词嵌入的参数矩阵进行缩放因子为$(9 {M})^{-\frac{1}{4}}$的缩放，其中$M$为解码器的层数。
 \vspace{0.5em}
 \end{itemize}
@@ -703,7 +703,7 @@ v_i &=& \mathbi{I}_d^{\textrm{T}}\textrm{Tanh}(\mathbi{W}_d\mathbi{Q}_i)
 \parinterval 也有研究发现Post-Norm结构在训练过程中过度依赖残差支路，在训练初期很容易发生参数梯度方差过大的现象\upcite{DBLP:conf/emnlp/LiuLGCH20}。经过分析发现，虽然底层神经网络发生梯度消失是导致训练不稳定的重要因素，但并不是唯一因素。例如，标准Transformer模型中梯度消失的原因在于使用了Post-Norm结构的解码器。尽管通过调整模型结构解决了梯度消失问题，但是模型训练不稳定的问题仍然没有被很好地解决。研究人员观测到Post-Norm结构在训练过程中过于依赖残差支路，而Pre-Norm结构在训练过程中逐渐呈现出对残差支路的依赖性，这更易于网络的训练。进一步，从参数更新的角度出发，Pre-Norm由于参数的改变导致网络输出变化的方差经推导后可以表示为$O(\log L)$，而Post-Norm对应的方差为$O(L)$。因此，可以尝试减小Post-Norm中由于参数更新导致的输出的方差值，从而达到稳定训练的目的。针对该问题，可以采用两阶段的初始化方法。这里，可以重新定义子层之间的残差连接如下：
 \begin{eqnarray}
-\mathbi{x}_{l+1} &=& \mathbi{x}_l \cdot {\bm  \omega_{l+1}} + F_{l+1}(\mathbi{x}_l)
+\mathbi{x}_{l+1} &=& \mathbi{x}_l \odot {\bm  \omega_{l+1}} + F_{l+1}(\mathbi{x}_l)
 \label{eq:15-47}
 \end{eqnarray}

--- a/Chapter2/Figures/figure-word-frequency-distribution.tex
+++ b/Chapter2/Figures/figure-word-frequency-distribution.tex
@@ -4,7 +4,7 @@
  width=13cm,
  height=5.5cm,
  xlabel={WikiText-103上的词表},
-  ylabel={词汇出现总次数},
+  ylabel={单词出现总次数},
  xlabel style={xshift=4.2cm,yshift=0.4cm,font=\footnotesize},
  ylabel style={rotate=-90,yshift=2.8cm,xshift=1.2cm,font=\footnotesize},
  xticklabel style={opacity=0},