第七章反馈后更新-图

9975534e · 单韦乔 · 9a776042 · 9975534e · 9975534e · 9975534e
Commit 9975534e authored May 13, 2020 by 单韦乔
--- a/Book/Chapter7/Chapter7.tex
+++ b/Book/Chapter7/Chapter7.tex
@@ -1476,7 +1476,16 @@ x_{l+1}=\mathcal{F}(\textrm{LN}(x_l))+x_l
 x_{l+1}=M \cdot \mathcal{F}(\textrm{LN}(x_l))+x_l
 \label{eq:7-25}
 \end{eqnarray}
-$M=0$代表该子层被丢弃，而$M=1$代表正常进行当前子层的计算。图ref{fig:7-34}展示了这个方法与标准Transformer之间的区别。
+$M=0$代表该子层被丢弃，而$M=1$代表正常进行当前子层的计算。图\ref{fig:7-34}展示了这个方法与标准Pre-Norm结构之间的区别。
+
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter7/Figures/figure-sublayer-skip}
+\caption{标准的Pre-Norm结构与基于随机跳跃子层的Pre-Norm结构}
+\label{fig:7-34}
+\end{figure}
+%-------------------------------------------

 \parinterval 除此之外，有研究者已经发现残差网络中底层的子网络通过对输入进行抽象得到的表示对最终的输出有很大的影响，上层网络通过对底层网络得到的表示不断修正来拟合训练目标\cite{DBLP:journals/corr/GreffSS16}。该结论同样适用于Transformer模型，比如，在训练中，残差支路以及底层的梯度范数通常比较大，这也间接表明底层网络在整个优化的过程中需要更大的更新。考虑到这个因素，在设计每一个子层被丢弃的概率时可以采用自底向上线性增大的策略，保证底层的网络相比于顶层更容易保留下来。这里用$L$来代表编码端块的个数，$l$代表当前的子层的编号，那么$M$可以通过以下的方式得到：
 \begin{eqnarray}
@@ -1493,14 +1502,14 @@ p_l=\frac{l}{2L}\cdot \varphi
 \end{eqnarray}
 这里，$1 \leqslant l \leqslant 2L$ ，且$\varphi$是预先设定的超参数。

-\parinterval 在Layer Dropout中，一个由$2L$个子层构成的残差网络，其顶层的输出相当于是$2^{2L}$个子网络的聚合结果。通过随机丢弃$n$个子层，则会屏蔽掉$2^n$个子网络的输出，将子网络的总体数量降低至$2^{2L-n}$。如图\ref{fig:7-34}所示的残差网络展开图，当有3个子层时，从输入到输出共存在8条路径，当删除子层sublayer2后，从输入到输出路径的路径则会减少到4条。
+\parinterval 在Layer Dropout中，一个由$2L$个子层构成的残差网络，其顶层的输出相当于是$2^{2L}$个子网络的聚合结果。通过随机丢弃$n$个子层，则会屏蔽掉$2^n$个子网络的输出，将子网络的总体数量降低至$2^{2L-n}$。如图\ref{fig:7-35}所示的残差网络展开图，当有3个子层时，从输入到输出共存在8条路径，当删除子层sublayer2后，从输入到输出路径的路径则会减少到4条。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter7/Figures/figure-expanded-residual-network}
 \caption{Layer Dropout中残差网络的展开图}
-\label{fig:7-34}
+\label{fig:7-35}
 \end{figure}
 %-------------------------------------------

@@ -1538,27 +1547,27 @@ p_l=\frac{l}{2L}\cdot \varphi
 \vspace{0.5em}
 \end{itemize}

-\parinterval 使用单语数据构建（双语）伪数据属于后者，它也是一种典型的{\small\bfnew{数据增强}}\index{数据增强}（Data Augmentation）\index{Data Augmentation}方法。一种常用做法是{\small\bfnew{回译}}\index{回译}（Back Translation）\index{Back Translation} \cite{DBLP:conf/acl/SennrichHB16,DBLP:conf/emnlp/EdunovOAG18}：训练一个从目标语翻译到源语的系统，也就是一个反向翻译系统；之后，用这个系统翻译目标语言单语数据；最后将单语数据（目标语言）和翻译的结果（源语言）作为训练数据，送入源语言到目标语言的翻译系统。这种做法不需要更改任何模型结构，就能很好的利用单语数据，因此也被广泛采用。图\ref{fig:7-35}给出了回译方法的一个简要流程。
+\parinterval 使用单语数据构建（双语）伪数据属于后者，它也是一种典型的{\small\bfnew{数据增强}}\index{数据增强}（Data Augmentation）\index{Data Augmentation}方法。一种常用做法是{\small\bfnew{回译}}\index{回译}（Back Translation）\index{Back Translation} \cite{DBLP:conf/acl/SennrichHB16,DBLP:conf/emnlp/EdunovOAG18}：训练一个从目标语翻译到源语的系统，也就是一个反向翻译系统；之后，用这个系统翻译目标语言单语数据；最后将单语数据（目标语言）和翻译的结果（源语言）作为训练数据，送入源语言到目标语言的翻译系统。这种做法不需要更改任何模型结构，就能很好的利用单语数据，因此也被广泛采用。图\ref{fig:7-36}给出了回译方法的一个简要流程。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter7/Figures/figure-application-process-of-back-translation}
 \caption{回译方法的流程}
-\label{fig:7-35}
+\label{fig:7-36}
 \end{figure}
 %-------------------------------------------

 \parinterval 在理想情况下，生成的伪数据和真实数据分布越接近越好。不过，在实践中发现，即使一些简单的策略也能带来性能的增长。比如，在一些低资源的语种，仅仅通过将目标语句子复制到源语端构造的伪数据都能为模型带来增益\cite{DBLP:conf/wmt/CurreyBH17}。相比这些简单的构造策略，利用目标语言单语数据进行回译可以获得更高质量的伪数据。因为目标语是正确的句子，这种方法可以保证译文的流畅度。这也间接达到了对目标语言进行语言建模的目的。在富资源的语种中，通常对回译产生的源语句子添加一些噪音，比如随机删除、替换一些词，或者交换两个词的位置，这样可以为模型提供一些训练噪声。而在低资源的语种上，由于双语数据稀缺，模型需要更多的高质量双语数据，不加噪音反而具有更好的效果。

-\parinterval 回译方法的一个问题是：反向翻译模型的训练只依赖于有限的双语数据，生成的源语言端伪数据的质量难以保证。为此，可以采用{\small\bfnew{迭代式回译}}\index{迭代式回译}（Iterative Back Translation）\index{Iterative Back Translation}的方法，同时利用源语端和目标语端的单语数据，不断通过回译的方式来提升前向和反向翻译模型的性能。图\ref{fig:7-36}展示了迭代式回译的框架。首先使用双语数据训练一个前向翻译模型，然后利用源语言单语数据通过回译的方式来提升反向翻译模型的性能，最后由反向翻译模型和目标端单语数据生成的伪数据来提升前向翻译模型的性能。可以看出，这个往复的过程是闭环的，因此可以一直进行下去，直到两个翻译模型的性能不再提升。
+\parinterval 回译方法的一个问题是：反向翻译模型的训练只依赖于有限的双语数据，生成的源语言端伪数据的质量难以保证。为此，可以采用{\small\bfnew{迭代式回译}}\index{迭代式回译}（Iterative Back Translation）\index{Iterative Back Translation}的方法，同时利用源语端和目标语端的单语数据，不断通过回译的方式来提升前向和反向翻译模型的性能。图\ref{fig:7-37}展示了迭代式回译的框架。首先使用双语数据训练一个前向翻译模型，然后利用源语言单语数据通过回译的方式来提升反向翻译模型的性能，最后由反向翻译模型和目标端单语数据生成的伪数据来提升前向翻译模型的性能。可以看出，这个往复的过程是闭环的，因此可以一直进行下去，直到两个翻译模型的性能不再提升。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter7/Figures/figure-example-of-iterative-back-translation}
 \caption{迭代式回译方法的流程}
-\label{fig:7-36}
+\label{fig:7-37}
 \end{figure}
 %-------------------------------------------

@@ -1572,14 +1581,14 @@ p_l=\frac{l}{2L}\cdot \varphi

 \parinterval 编码器-解码器框架天然就包含了对输入（源语言）和输出（目标语言）进行表示学习的过程。比如，在编码端需要学习一种分布式表示（Distributed Representation）来表示源语言句子的信息，这种分布式表示既包含单词的表示也包括整个序列的表示。因此，可以使用更大规模的源语言单语数据完成编码器的训练。

-\parinterval 实现上述想法的一种手段是{\small\bfnew{预训练}}\index{预训练}（Pre-training）\index{Pre-training}。常用的方法是将机器翻译模型中的一部分（比如，编码器）单独提抽取出来，之后用语言建模等方式在大规模单语数据上进行训练。得到优化后的参数后，将其重新放入神经机器翻译模型中，作为模型的初始值。最后，神经机器翻译模型在双语数据上进行{\small\bfnew{微调}}\index{微调}（Fine-tuning）\index{Fine-tuning}，以得到最终的翻译模型。图\ref{fig:7-37}给出了机器翻译编码器预训练流程的示意图。
+\parinterval 实现上述想法的一种手段是{\small\bfnew{预训练}}\index{预训练}（Pre-training）\index{Pre-training}。常用的方法是将机器翻译模型中的一部分（比如，编码器）单独提抽取出来，之后用语言建模等方式在大规模单语数据上进行训练。得到优化后的参数后，将其重新放入神经机器翻译模型中，作为模型的初始值。最后，神经机器翻译模型在双语数据上进行{\small\bfnew{微调}}\index{微调}（Fine-tuning）\index{Fine-tuning}，以得到最终的翻译模型。图\ref{fig:7-38}给出了机器翻译编码器预训练流程的示意图。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter7/Figures/figure-encoder-fin}
 \caption{机器翻译编码器预训练流程}
-\label{fig:7-37}
+\label{fig:7-38}
 \end{figure}
 %-------------------------------------------

@@ -1604,11 +1613,11 @@ p_l=\frac{l}{2L}\cdot \varphi
 \centering
 \input{./Chapter7/Figures/figure-MASS}
 \caption{MASS 预训练方法}
-\label{fig:7-38}
+\label{fig:7-39}
 \end{figure}
 %-------------------------------------------

-\parinterval 以MASS方法为例\cite{song2019mass}，可以直接对整个编码器-解码器的结构进行预训练。训练中采用掩码的方式，将源语词序列中的片段替换成特殊词<mask>，然后在解码器端预测这个未知片段，如图\ref{fig:7-38}所示，\#号表示特殊词<mask>。这种做法可以使得编码器端捕捉上下文信息，同时迫使解码器依赖于编码器，学习编码器和解码器之间的注意力进行预训练。而解码器端片段的预测也使得解码器能够学习到向前依赖的上下文表示。
+\parinterval 以MASS方法为例\cite{song2019mass}，可以直接对整个编码器-解码器的结构进行预训练。训练中采用掩码的方式，将源语词序列中的片段替换成特殊词<mask>，然后在解码器端预测这个未知片段，如图\ref{fig:7-39}所示，\#号表示特殊词<mask>。这种做法可以使得编码器端捕捉上下文信息，同时迫使解码器依赖于编码器，学习编码器和解码器之间的注意力进行预训练。而解码器端片段的预测也使得解码器能够学习到向前依赖的上下文表示。

 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -1618,14 +1627,14 @@ p_l=\frac{l}{2L}\cdot \varphi

 \parinterval {\small\bfnew{多任务学习}}\index{多任务学习}（Multitask Learning）\index{Multitask Learning}是机器学习的一个子领域，是指同时学习多个独立但是相关的任务\cite{DBLP:journals/corr/Ruder17a}。多任务学习通过模型共享的方式，对多个模型进行学习，而这些模型都对应不同的任务，这样不同模型可以互相``促进''。在神经机器翻译中，为了使用单语数据，可以将翻译任务作为主任务，同时设置一些仅使用单语数据的子任务，通过这些子任务来捕捉单语数据中的语言知识\cite{DBLP:conf/emnlp/DomhanH17}。

-\parinterval 语言模型是使用目标端单语数据最直接的方式，但是翻译模型作为一个受限的语言模型，还需要依赖于源语，并不能直接进行多任务学习。针对这个问题，对原有翻译模型结构进行了修改，在解码器中增加了一个语言模型子层，将这个子层用于语言模型任务（图\ref{fig:7-39}）。在训练过程中，分别将双语数据和单语数据送入翻译模型和语言模型进行计算，得到的损失相加用于整体模型参数的梯度计算和参数更新，其中语言模型的参数是翻译模型的一部分。
+\parinterval 语言模型是使用目标端单语数据最直接的方式，但是翻译模型作为一个受限的语言模型，还需要依赖于源语，并不能直接进行多任务学习。针对这个问题，对原有翻译模型结构进行了修改，在解码器中增加了一个语言模型子层，将这个子层用于语言模型任务（图\ref{fig:7-40}）。在训练过程中，分别将双语数据和单语数据送入翻译模型和语言模型进行计算，得到的损失相加用于整体模型参数的梯度计算和参数更新，其中语言模型的参数是翻译模型的一部分。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter7/Figures/figure-target-side-multi-task-learning}
 \caption{机器翻译中的单任务学习和多任务学习}
-\label{fig:7-39}
+\label{fig:7-40}
 \end{figure}
 %-------------------------------------------

@@ -1698,7 +1707,7 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x})
 \label{eq:7-30}
 \end{eqnarray}

-这样的损失函数带来最直接的好处是，知识精炼的流程会非常简单。因为只需要利用教师模型将训练数据（源语言）翻译一遍，之后把它的输出替换为训练数据的目标语言部分。之后，利用得到的新的双语数据训练学生模型即可，图\ref{fig:7-40}展示了简化后词级和序列级的不同，其中词级知识精炼的解码端输入为真实双语数据的目标语言，并以teacher模型输出的概率分布作为学习目标，而序列级则直接将teacher推断后得到的结果作为解码端的输入，并将解码结果的One-hot向量作为学习目标。
+这样的损失函数带来最直接的好处是，知识精炼的流程会非常简单。因为只需要利用教师模型将训练数据（源语言）翻译一遍，之后把它的输出替换为训练数据的目标语言部分。之后，利用得到的新的双语数据训练学生模型即可，图\ref{fig:7-41}展示了简化后词级和序列级的不同，其中词级知识精炼的解码端输入为真实双语数据的目标语言，并以teacher模型输出的概率分布作为学习目标，而序列级则直接将teacher推断后得到的结果作为解码端的输入，并将解码结果的One-hot向量作为学习目标。
 \vspace{0.5em}
 \end{itemize}

@@ -1707,7 +1716,7 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x})
 \centering
 \input{./Chapter7/Figures/figure-difference-between-word-level-and-sequence-level-in-knowledge-distillation}
 \caption{词级和序列级知识精炼的差异}
-\label{fig:7-40}
+\label{fig:7-41}
 \end{figure}
 %-------------------------------------------

@@ -1734,14 +1743,14 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x})
 \vspace{0.5em}
 \end{itemize}

-\parinterval 此外还可以采用迭代的知识精炼的方式。首先，通过模型集成得到较强的教师模型，再将知识迁移到不同的学生模型上，随后继续使用这些学生模型集成新的教师模型。不断的重复上述过程可以逐步提升集成模型的性能，如图\ref{fig:7-41}所示。值得注意的是，随着迭代次数的增加，集成所带来的收益也会随着子模型之间差异性的减小而减少。
+\parinterval 此外还可以采用迭代的知识精炼的方式。首先，通过模型集成得到较强的教师模型，再将知识迁移到不同的学生模型上，随后继续使用这些学生模型集成新的教师模型。不断的重复上述过程可以逐步提升集成模型的性能，如图\ref{fig:7-42}所示。值得注意的是，随着迭代次数的增加，集成所带来的收益也会随着子模型之间差异性的减小而减少。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter7/Figures/figure-ensemble-knowledge-distillation}
 \caption{迭代式知识精炼}
-\label{fig:7-41}
+\label{fig:7-42}
 \end{figure}
 %-------------------------------------------

@@ -1799,14 +1808,14 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x})
 \label{eq:7-34}
 \end{eqnarray}

-\noindent  公式\ref{eq:7-34}假设$\textrm{P}(\mathbf s|\mathbf t)=\textrm{P}(\mathbf s|\mathbf s,\mathbf t)$。这个假设显然是成立的，因为当知道一个句子的译文时，并不需要知道它的源文就可以把它翻译回去。如果直接优化（最大化）公式\ref{eq:7-34}右侧，相当于对这个等式$\textrm{P}(\mathbf s|\mathbf t)$和$\textrm{P}(\mathbf t|\mathbf s)$施加了{\small\bfnew{循环一致性}}\index{循环一致性}（Circle Consistency）\index{Circle Consistency}的约束\cite{DBLP:conf/iccv/ZhuPIE17}，也就是对于一个句子$\mathbf s$，通过$\textrm{P}(\mathbf t|\mathbf s)$把它翻译成$\mathbf t$后，根据$\textrm{P}(\mathbf s|\mathbf t)$应该能重新翻译出$\mathbf s$，如图\ref{fig:7-42}所示。公式\ref{eq:7-34}给出了同时优化$\textrm{P}(\mathbf s|\mathbf t)$和$\textrm{P}(\mathbf t|\mathbf s)$的一个目标函数形式。这个目标函数的一个额外的好处是它本质上是在学习一个由$\textrm{P}(\mathbf s|\mathbf t)$和$\textrm{P}(\mathbf t|\mathbf s)$组成的语言模型$\textrm{P}(\mathbf s)$，而$\textrm{P}(\mathbf s)$的学习依赖于单语数据，这意味着这个目标函数可以很自然地直接使用大量单语数据来同时训练两个翻译模型。相同的结论可以推广到$\textrm{P}(\mathbf t)$上\cite{DBLP:conf/nips/HeXQWYLM16}。
+\noindent  公式\ref{eq:7-34}假设$\textrm{P}(\mathbf s|\mathbf t)=\textrm{P}(\mathbf s|\mathbf s,\mathbf t)$。这个假设显然是成立的，因为当知道一个句子的译文时，并不需要知道它的源文就可以把它翻译回去。如果直接优化（最大化）公式\ref{eq:7-34}右侧，相当于对这个等式$\textrm{P}(\mathbf s|\mathbf t)$和$\textrm{P}(\mathbf t|\mathbf s)$施加了{\small\bfnew{循环一致性}}\index{循环一致性}（Circle Consistency）\index{Circle Consistency}的约束\cite{DBLP:conf/iccv/ZhuPIE17}，也就是对于一个句子$\mathbf s$，通过$\textrm{P}(\mathbf t|\mathbf s)$把它翻译成$\mathbf t$后，根据$\textrm{P}(\mathbf s|\mathbf t)$应该能重新翻译出$\mathbf s$，如图\ref{fig:7-43}所示。公式\ref{eq:7-34}给出了同时优化$\textrm{P}(\mathbf s|\mathbf t)$和$\textrm{P}(\mathbf t|\mathbf s)$的一个目标函数形式。这个目标函数的一个额外的好处是它本质上是在学习一个由$\textrm{P}(\mathbf s|\mathbf t)$和$\textrm{P}(\mathbf t|\mathbf s)$组成的语言模型$\textrm{P}(\mathbf s)$，而$\textrm{P}(\mathbf s)$的学习依赖于单语数据，这意味着这个目标函数可以很自然地直接使用大量单语数据来同时训练两个翻译模型。相同的结论可以推广到$\textrm{P}(\mathbf t)$上\cite{DBLP:conf/nips/HeXQWYLM16}。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter7/Figures/figure-cycle-consistency}
 \caption{循环一致性}
-\label{fig:7-42}
+\label{fig:7-43}
 \end{figure}
 %----------------------------------------------


--- a/Book/Chapter7/Figures/figure-post-norm-vs-pre-norm.tex
+++ b/Book/Chapter7/Figures/figure-post-norm-vs-pre-norm.tex
@@ -6,16 +6,16 @@
 \begin{scope}[minimum height = 20pt]

 \node [anchor=east] (x1) at (-0.5em, 0) {$x_l$};
-\node [anchor=west,draw=green,fill=green!20,inner xsep=5pt] (F1) at ([xshift=2em]x1.east){$\mathcal{F}$};
+\node [anchor=west,draw,fill=red!20,inner xsep=5pt,rounded corners=2pt] (F1) at ([xshift=2em]x1.east){\small{$\mathcal{F}$}};
 \node [anchor=west,circle,draw,minimum size=1em] (n1) at ([xshift=2em]F1.east) {};
-\node [anchor=west,draw=green,fill=green!20,inner xsep=5pt] (ln1) at ([xshift=2em]n1.east){\textrm{LN}};
-\node [anchor=west] (x2) at ([xshift=2em]ln1.east) {$x_{l+l}$};
+\node [anchor=west,draw,fill=green!20,inner xsep=5pt,rounded corners=2pt] (ln1) at ([xshift=2em]n1.east){\small{\textrm{LN}}};
+\node [anchor=west] (x2) at ([xshift=2em]ln1.east) {$x_{l+1}$};

 \node [anchor=north] (x3) at ([yshift=-5em]x1.south) {$x_l$};
-\node [anchor=west,draw=green,fill=green!20,inner xsep=5pt] (F2) at ([xshift=2em]x3.east){$\mathcal{F}$};
-\node [anchor=west,draw=green,fill=green!20,inner xsep=5pt] (ln2) at ([xshift=2em]F2.east){\textrm{LN}};
+\node [anchor=west,draw,fill=green!20,inner xsep=5pt,rounded corners=2pt] (F2) at ([xshift=2em]x3.east){\small{\textrm{LN}}};
+\node [anchor=west,draw,fill=red!20,inner xsep=5pt,rounded corners=2pt] (ln2) at ([xshift=2em]F2.east){\small{$\mathcal{F}$}};
 \node [anchor=west,circle,draw,,minimum size=1em] (n2) at ([xshift=2em]ln2.east){};
-\node [anchor=west] (x4) at ([xshift=2em]n2.east) {$x_{l+l}$};
+\node [anchor=west] (x4) at ([xshift=2em]n2.east) {$x_{l+1}$};

 \draw[->, line width=1pt] ([xshift=-0.1em]x1.east)--(F1.west);
 \draw[->, line width=1pt] ([xshift=-0.1em]F1.east)--(n1.west);
@@ -25,8 +25,8 @@
 \draw[->, line width=1pt] ([xshift=-0.1em]F2.east)--(ln2.west);
 \draw[->, line width=1pt] ([xshift=0.1em]ln2.east)--node[above]{$y_l$}(n2.west);
 \draw[->, line width=1pt] (n2.east)--(x4.west);
-\draw[->, line width=1pt] (x1.north) -- ([yshift=1em]x1.north) -- ([yshift=1.4em]n1.north) -- (n1.north);
-\draw[->, line width=1pt] (x3.north) -- ([yshift=1em]x3.north) -- ([yshift=1.4em]n2.north) -- (n2.north);
+\draw[->,rounded corners,line width=1pt] ([yshift=-0.2em]x1.north) -- ([yshift=1em]x1.north) -- ([yshift=1.4em]n1.north) -- (n1.north);
+\draw[->,rounded corners,line width=1pt] ([yshift=-0.2em]x3.north) -- ([yshift=1em]x3.north) -- ([yshift=1.4em]n2.north) -- (n2.north);
 \draw[-] (n1.west)--(n1.east);
 \draw[-] (n1.north)--(n1.south);
 \draw[-] (n2.west)--(n2.east);
@@ -39,8 +39,8 @@
 \node [rectangle,inner sep=0.3em,fill=blue!10] [fit = (x3) (F2) (n2) (ln2) (x4) (k2)] (box1) {};
 \end{pgfonlayer}

-\node [anchor=north] (c1) at (box0.south){\small (a)后作方式的残差连接};
-\node [anchor=north] (c2) at (box1.south){\small (b)前作方式的残差连接};
+\node [anchor=north] (c1) at (box0.south){\footnotesize {(a)后作方式的残差连接}};
+\node [anchor=north] (c2) at (box1.south){\footnotesize {(b)前作方式的残差连接}};
 \end{scope}
 \end{tikzpicture}
 \end{center}
\ No newline at end of file
--- a/Book/Chapter7/Figures/figure-sublayer-skip.tex
+++ b/Book/Chapter7/Figures/figure-sublayer-skip.tex
+%%%------------------------------------------------------------------------------------------------------------
+%%% 调序模型1：基于距离的调序
+\begin{center}
+\begin{tikzpicture}
+
+\begin{scope}[minimum height = 20pt]
+
+\node [anchor=east] (x1) at (-0.5em, 0) {$x_l$};
+\node [anchor=west,draw,fill=red!20,inner xsep=5pt,rounded corners=2pt] (ln1) at ([xshift=1em]x1.east){\small{\textrm{LN}}};
+\node [anchor=west,draw,fill=green!20,inner xsep=5pt,rounded corners=2pt] (f1) at ([xshift=0.6em]ln1.east){\small{$\mathcal{F}$}};
+\node [anchor=west,circle,draw,,minimum size=1em] (n1) at ([xshift=3em]f1.east){};
+\node [anchor=west] (x2) at ([xshift=1em]n1.east) {$x_{l+1}$};
+\node [anchor=west,draw,fill=red!20,inner xsep=5pt,rounded corners=2pt] (ln12) at ([xshift=1em]x2.east){\small{\textrm{LN}}};
+\node [anchor=west,draw,fill=green!20,inner xsep=5pt,rounded corners=2pt] (f12) at ([xshift=0.6em]ln12.east){\small{$\mathcal{F}$}};
+\node [anchor=west,circle,draw,,minimum size=1em] (n12) at ([xshift=3em]f12.east){};
+\node [anchor=west] (x22) at ([xshift=1em]n12.east) {$x_{l+2}$};
+
+\node [anchor=north] (x3) at ([yshift=-5em]x1.south) {$x_l$};
+\node [anchor=west,draw,fill=red!20,inner xsep=5pt,rounded corners=2pt] (ln2) at ([xshift=1em]x3.east){\small{\textrm{LN}}};
+\node [anchor=west,draw,fill=green!20,inner xsep=5pt,rounded corners=2pt] (f2) at ([xshift=0.6em]ln2.east){\small{$\mathcal{F}$}};
+\node [anchor=west,minimum size=1em] (p1) at ([xshift=1em]f2.east){};
+\node [anchor=north] (m1) at ([yshift=0.6em]p1.south){\tiny{\red{$M=1$}}};
+\node [anchor=west,circle,draw,,minimum size=1em] (n2) at ([xshift=3em]f2.east){};
+\node [anchor=west] (x4) at ([xshift=1em]n2.east) {$x_{l+1}$};
+\node [anchor=west,draw,fill=red!20,inner xsep=5pt,rounded corners=2pt] (ln22) at ([xshift=1em]x4.east){\small{\textrm{LN}}};
+\node [anchor=west,draw,fill=green!20,inner xsep=5pt,rounded corners=2pt] (f22) at ([xshift=0.6em]ln22.east){\small{$\mathcal{F}$}};
+\node [anchor=west,minimum size=1em] (p2) at ([xshift=1em]f22.east){};
+\node [anchor=north] (m2) at ([yshift=0.6em]p2.south){\tiny{\red{$M=0$}}};
+\node [anchor=west,circle,draw,,minimum size=1em] (n22) at ([xshift=3em]f22.east){};
+\node [anchor=west] (x42) at ([xshift=1em]n22.east) {$x_{l+2}$};
+
+\draw[->, line width=1pt] ([xshift=-0.1em]x1.east)--(ln1.west);
+\draw[->, line width=1pt] ([xshift=-0.1em]ln1.east)--(f1.west);
+\draw[->, line width=1pt] ([xshift=0.1em]f1.east)--(n1.west);
+\draw[->, line width=1pt] (n1.east)--(x2.west);
+\draw[->, line width=1pt] ([xshift=-0.1em]x3.east)--(ln2.west);
+\draw[->, line width=1pt] ([xshift=-0.1em]ln2.east)--(f2.west);
+\draw[-, line width=1pt] ([xshift=0.1em]f2.east)--(p1.west);
+\draw[*-,red,line width=0.6pt] (p1.west) -- (p1.east);
+\draw[->, line width=1pt] (p1.east)--(n2.west);
+\draw[->, line width=1pt] (n2.east)--(x4.west);
+\draw[->,rounded corners,line width=1pt] ([yshift=-0.2em]x1.north) -- ([yshift=1em]x1.north) -- ([yshift=1.4em]n1.north) -- (n1.north);
+\draw[->,rounded corners,line width=1pt] ([yshift=-0.2em]x3.north) -- ([yshift=1em]x3.north) -- ([yshift=1.4em]n2.north) -- (n2.north);
+\draw[-] (n1.west)--(n1.east);
+\draw[-] (n1.north)--(n1.south);
+\draw[-] (n2.west)--(n2.east);
+\draw[-] (n2.north)--(n2.south);
+
+\draw[->, line width=1pt] ([xshift=-0.1em]x2.east)--(ln12.west);
+\draw[->, line width=1pt] ([xshift=-0.1em]ln12.east)--(f12.west);
+\draw[->, line width=1pt] ([xshift=0.1em]f12.east)--(n12.west);
+\draw[->, line width=1pt] (n12.east)--(x22.west);
+\draw[->, line width=1pt] ([xshift=-0.1em]x4.east)--(ln22.west);
+\draw[->, line width=1pt] ([xshift=-0.1em]ln22.east)--(f22.west);
+\draw[-, line width=1pt] ([xshift=0.1em]f22.east)--(p2.west);
+\draw[*-,red,line width=0.6pt] ([yshift=-0.1em]p2.west) -- (p2.north east);
+\draw[->, line width=1pt] (p2.east)--(n22.west);
+\draw[->, line width=1pt] (n22.east)--(x42.west);
+\draw[->,rounded corners,line width=1pt] ([yshift=-0.2em]x2.north) -- ([yshift=1em]x2.north) -- ([yshift=1.4em]n12.north) -- (n12.north);
+\draw[->,rounded corners,line width=1pt] ([yshift=-0.2em]x4.north) -- ([yshift=1em]x4.north) -- ([yshift=1.4em]n22.north) -- (n22.north);
+\draw[-] (n12.west)--(n12.east);
+\draw[-] (n12.north)--(n12.south);
+\draw[-] (n22.west)--(n22.east);
+\draw[-] (n22.north)--(n22.south);
+
+\node [anchor=south] (k1) at ([yshift=-0.1em]x1.north){};
+\node [anchor=south] (k2) at ([yshift=-0.1em]x3.north){};
+\begin{pgfonlayer}{background}
+\node [rectangle,inner sep=0.3em,fill=orange!10] [fit = (x1) (f1) (n1) (ln1) (x2) (k1) (f12) (n12) (ln12) (x22)] (box0) {};
+\node [rectangle,inner sep=0.3em,fill=blue!10] [fit = (x3) (f2) (n2) (ln2) (x4) (k2) (f22) (n22) (ln22) (x42)] (box1) {};
+\end{pgfonlayer}
+
+\node [anchor=north] (c1) at (box0.south){\footnotesize {(a)标准的Pre-Norm}};
+\node [anchor=north] (c2) at (box1.south){\footnotesize {(b)基于随机子层跳跃的Pre-Norm}};
+\end{scope}
+\end{tikzpicture}
+\end{center}
\ No newline at end of file