合并分支 'master' 到 'mengxia'

Master 查看合并请求 !917

合并分支 'master' 到 'mengxia'
Master 查看合并请求 !917
6ccd1667 · 孟霞 · 0e2895d3 · bcc8f3d9 · 6ccd1667 · 6ccd1667
Commit 6ccd1667 authored Jan 15, 2021 by 孟霞
--- a/Chapter1/Figures/figure-example-smt.tex
+++ b/Chapter1/Figures/figure-example-smt.tex
@@ -52,7 +52,7 @@
 }

 {
-\begin{scope}[xshift=1.7in]
+\begin{scope}[xshift=2in]
 {\scriptsize
 \node [anchor=north west] (phrase1) at (0,0) {$\textrm{Pr}(\textrm{我} \to \textrm{I}) = 0.7$};
 \node [anchor=north west] (phrase2) at ([yshift=0.1em]phrase1.south west) {$\textrm{Pr}(\textrm{我} \to \textrm{me}) = 0.3$};
@@ -76,7 +76,7 @@
 }

 {
-\begin{scope}[xshift=1.7in,yshift=-1.55in]
+\begin{scope}[xshift=2in,yshift=-1.55in]
 {\scriptsize
 \node [anchor=north west] (ngram1) at (0,0) {$\textrm{Pr}(\textrm{I}) = 0.0001$};
 \node [anchor=north west] (ngram2) at ([yshift=0.0em]ngram1.south west) {$\textrm{Pr}(\textrm{I}\ \textrm{am}) = 0.623$};
@@ -96,14 +96,14 @@
 }

 {
-\draw[->,thick,ublue] (bidata.east)--([xshift=2.2em]bidata.east) node[pos=0.5,above] (simexample) {\color{red}{\scriptsize{\scriptsize{学习}}}};
+\draw[->,thick,ublue] (bidata.east)--([xshift=4.2em]bidata.east) node[pos=0.5,above] (simexample) {\color{red}{\scriptsize{\scriptsize{学习}}}};
 }

 {
-\draw[->,thick,ublue] (monodata.east)--([xshift=2.2em]monodata.east) node[pos=0.5,above] (simexample) {\color{red}{\scriptsize{\scriptsize{学习}}}};
+\draw[->,thick,ublue] (monodata.east)--([xshift=4.2em]monodata.east) node[pos=0.5,above] (simexample) {\color{red}{\scriptsize{\scriptsize{学习}}}};
 }

-\begin{scope}[xshift=3.6in]
+\begin{scope}[xshift=4in]
 {\footnotesize
 {
 \node[anchor=center] (srcsentence) at (0,0) {我 对 你 感到 满意};

--- a/Chapter1/chapter1.tex
+++ b/Chapter1/chapter1.tex
@@ -54,14 +54,6 @@

 \parinterval 从机器翻译系统的组成上来看，通常可以抽象为两个部分，如图\ref{fig:1-2}所示：

-\begin{itemize}
-\vspace{0.5em}
-\item {\small\bfnew{资源}}：如果把机器翻译系统比作一辆汽车，资源就好比是可以使汽车运行的“汽油”，它包括很多内容，如翻译规则、双（单）语数据、知识库等翻译知识，且这些“知识”都是计算机可读的。值得一提的是,如果没有翻译资源的支持，任何机器翻译系统都无法运行起来。
-\vspace{0.5em}
-\item {\small\bfnew{系统}}：机器翻译算法的程序实现被称作系统，也就是机器翻译研究人员开发的软件。无论是翻译规则、翻译模板还是统计模型中的参数都需要通过机器翻译系统进行读取和使用。
-\vspace{0.5em}
-\end{itemize}
-
 %----------------------------------------------
 \begin{figure}[htp]
    \centering
@@ -71,6 +63,14 @@
 \end{figure}
 %-------------------------------------------

+\begin{itemize}
+\vspace{0.5em}
+\item {\small\bfnew{资源}}：如果把机器翻译系统比作一辆汽车，资源就好比是可以使汽车运行的“汽油”，它包括很多内容，如翻译规则、双（单）语数据、知识库等翻译知识，且这些“知识”都是计算机可读的。值得一提的是,如果没有翻译资源的支持，任何机器翻译系统都无法运行起来。
+\vspace{0.5em}
+\item {\small\bfnew{系统}}：机器翻译算法的程序实现被称作系统，也就是机器翻译研究人员开发的软件。无论是翻译规则、翻译模板还是统计模型中的参数都需要通过机器翻译系统进行读取和使用。
+\vspace{0.5em}
+\end{itemize}
+
 \parinterval 构建一个强大的机器翻译系统需要“资源”和“系统”两方面共同作用。在资源方面，随着语料库语言学的发展，已经有大量的高质量的双语和单语数据（称为语料）被整理并且被电子化存储，因此可以说具备了研发机器翻译系统所需要的语料基础。特别是像英语、汉语等世界主流语种，相关语料资源已经非常丰富，这也大大加速了相关研究的进展。当然，对于一些稀缺资源语种或者特殊的领域，语料库中的语料仍然匮乏，但是这些并不影响机器翻译领域整体的发展速度。因此在现有语料库的基础上，很多研究者把精力集中在“系统”研发上。

 %----------------------------------------------------------------------------------------
@@ -129,7 +129,7 @@

 \parinterval 1957年，Noam Chomsky在\emph{Syntactic Structures}中描述了转换生成语法\upcite{chomsky1957syntactic}，并使用数学方法来研究自然语言，建立了包括上下文有关语法、上下文无关语法等4种类型的语法。这些工作最终为今天计算机中广泛使用的“形式语言”奠定了基础。而他的思想也深深地影响了同时期的语言学和自然语言处理领域的学者。特别的是，早期基于规则的机器翻译中也大量使用了这些思想。

-\parinterval 虽然在这段时间，使用机器进行翻译的议题越加火热，但是事情并不总是一帆风顺，怀疑论者对机器翻译一直存有质疑，并很容易找出一些机器翻译无法解决的问题。自然地，人们也期望能够客观地评估一下机器翻译的可行性。当时美国基金资助组织委任自动语言处理咨询会承担了这项任务。经过近两年的调查与分析，该委员会于1966年11月公布了一个题为\emph{LANGUAGE AND MACHINES}的报告（图\ref{fig:1-5}），即ALPAC报告。该报告全面否定了机器翻译的可行性，为机器翻译的研究泼了一盆冷水。
+\parinterval 虽然在这段时间，使用机器进行翻译的议题越加火热，但是事情并不总是一帆风顺，怀疑论者对机器翻译一直存有质疑，并很容易找出一些机器翻译无法解决的问题。自然地，人们也期望能够客观地评估一下机器翻译的可行性。当时美国基金资助组织委任自动语言处理咨询会承担了这项任务。经过近两年的调查与分析，该委员会于1966年11月公布了一个题为\emph{LANGUAGE AND MACHINES}的报告（图\ref{fig:1-4}），即ALPAC报告。该报告全面否定了机器翻译的可行性，为机器翻译的研究泼了一盆冷水。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -258,8 +258,6 @@

 \parinterval 规则就像语言中的“If-then”语句，如果满足条件，则执行相应的语义动作。比如，可以将待翻译句子中的某个词，使用目标语言单词进行替换，但是这种替换并非随意的，而是在语言学知识的指导下进行的。

-\parinterval 图\ref{fig:1-8}展示了一个使用转换法进行翻译的实例。这里，利用一个简单的汉译英规则库完成对句子“我对你感到满意”的翻译。当翻译“我”时，从规则库中找到规则1，该规则表示遇到单词“我”就翻译为“I”；类似地，也可以从规则库中找到规则4，该规则表示翻译调序，即将单词“you”放到“be satisfied with”后面。这种通过规则表示单词之间的对应关系也为统计机器翻译方法提供了思路。如统计机器翻译中，基于短语的翻译模型使用短语对对原文进行替换，详细描述可以参考{\chapterseven}。
-
 %----------------------------------------------
 \begin{figure}[htp]
    \centering
@@ -269,6 +267,8 @@
 \end{figure}
 %-------------------------------------------

+\parinterval 图\ref{fig:1-8}展示了一个使用转换法进行翻译的实例。这里，利用一个简单的汉译英规则库完成对句子“我对你感到满意”的翻译。当翻译“我”时，从规则库中找到规则1，该规则表示遇到单词“我”就翻译为“I”；类似地，也可以从规则库中找到规则4，该规则表示翻译调序，即将单词“you”放到“be satisfied with”后面。这种通过规则表示单词之间的对应关系也为统计机器翻译方法提供了思路。如统计机器翻译中，基于短语的翻译模型使用短语对对原文进行替换，详细描述可以参考{\chapterseven}。
+
 \parinterval 在上述例子中可以发现，规则不仅仅可以翻译句子之间单词的对应，如规则1，还可以表示句法甚至语法之间的对应，如规则6。因此基于规则的方法可以分成多个层次，如图\ref{fig:1-9}所示。图中不同的层次表示采用不同的知识来书写规则，进而完成机器翻译过程。对于翻译问题，可以构建不同层次的基于规则的机器翻译系统。这里包括四个层次，分别为：词汇转换、句法转换、语义转换和中间语言层。其中，上层可以继承下层的翻译知识，比如说句法转换层会利用词汇转换层知识。早期基于规则的方法属于词汇转换层。

 %----------------------------------------------

--- a/Chapter10/Figures/figure-3-base-problom-of-p.tex
+++ b/Chapter10/Figures/figure-3-base-problom-of-p.tex
@@ -12,7 +12,7 @@
 				% RNN Encoder
 				\coordinate (eemb0) at (0,0);
 				\foreach \x [count=\y from 0] in {1,2,...,3}
-					\node[rnnnode,minimum height=0.5\base,fill=green!30!white,anchor=west] (eemb\x) at ([xshift=0.4\base]eemb\y.east) {\tiny{$\textrm{e}_x()$}};
+					\node[rnnnode,minimum height=0.5\base,fill=green!30!white,anchor=west] (eemb\x) at ([xshift=0.6\base]eemb\y.east) {\tiny{$\textrm{e}_x()$}};
 				\foreach \x in {1,2,...,3}
 					\node[rnnnode,fill=blue!30!white,anchor=south] (enc\x) at ([yshift=0.3\base]eemb\x.north) {};
 			        \node[] (enclabel1) at (enc1) {\tiny{$\mathbi{h}_{m-2}$}};
@@ -73,7 +73,7 @@
 				\draw[-latex'] (enc3.north) .. controls +(north:0.3\base) and +(east:\base) .. (bridge) .. controls +(west:2.7\base) and +(west:0.3\base) .. (dec1.west);
 				
 				{
-				\node [anchor=east] (line1) at ([xshift=-3em,yshift=0.5em]softmax1.west) {\scriptsize{基于RNN的隐层状态$\mathbi{s}_i$}};
+				\node [anchor=east] (line1) at ([xshift=-4em,yshift=0.5em]softmax1.west) {\scriptsize{基于RNN的隐层状态$\mathbi{s}_i$}};
 				\node [anchor=north west] (line2) at ([yshift=0.3em]line1.south west) {\scriptsize{预测目标词的概率}};
 				\node [anchor=north west] (line3) at ([yshift=0.3em]line2.south west) {\scriptsize{通常，用Softmax函数}};
 				\node [anchor=north west] (line4) at ([yshift=0.3em]line3.south west) {\scriptsize{实现 $\funp{P}(y_i|...)$}};
@@ -87,7 +87,7 @@
 				}
 				
 				{
-				\node [anchor=west] (line21) at ([xshift=1.3em,yshift=1.5em]enc3.east)  {\scriptsize{源语编码器最后一个}};
+				\node [anchor=west] (line21) at ([xshift=3.3em,yshift=1.5em]enc3.east)  {\scriptsize{源语编码器最后一个}};
 				\node [anchor=north west] (line22) at ([yshift=0.3em]line21.south west) {\scriptsize{循环单元的输出被}};
 				\node [anchor=north west] (line23) at ([yshift=0.3em]line22.south west) {\scriptsize{看作是句子的表示,}};
 				\node [anchor=north west] (line24) at ([yshift=0.3em]line23.south west) {\scriptsize{记为$\mathbi{C}$}};
@@ -97,21 +97,21 @@
 				{
 				\node [rectangle,inner sep=0.2em,rounded corners=1pt,fill=red!10,drop shadow,draw=red,minimum width =9em] [fit = (line1) (line2) (line3) (line4)] (box1) {};
 				\node [rectangle,inner sep=0.2em,rounded corners=1pt,very thick,dotted,draw=red] [fit = (softmax1) (softmax2) (softmax3)] (box4) {};
-				\draw [->,dotted,very thick,red] ([yshift=1em,xshift=2.5em]box1.east) -- ([yshift=1em,xshift=0.1em]box1.east);
+				\draw [->,dotted,very thick,red] ([yshift=1em,xshift=3.5em]box1.east) -- ([yshift=1em,xshift=0.1em]box1.east);
 				}
 				
 				{
 				\node [rectangle,inner sep=0.2em,rounded corners=1pt,fill=green!10,drop shadow,draw=ugreen,minimum width =9em] [fit = (line11) (line12) (line13) (line14)] (box2) {};
 				\node [rectangle,inner sep=0.2em,rounded corners=1pt,very thick,dotted,draw=ugreen] [fit = (eemb1) (eemb2) (eemb3)] (box5) {};
 				\node [rectangle,inner sep=0.2em,rounded corners=1pt,very thick,dotted,draw=ugreen] [fit = (demb1) (demb2) (demb3)] (box6) {};
-				\draw [->,dotted,very thick,ugreen] ([yshift=-1.3em,xshift=2.5em]box2.east) -- ([yshift=-1.3em,xshift=0.1em]box2.east);
+				\draw [->,dotted,very thick,ugreen] ([yshift=-1.7em,xshift=3.5em]box2.east) -- ([yshift=-1.7em,xshift=0.1em]box2.east);
 				\draw [->,dotted,very thick,ugreen] ([xshift=0.1em]box6.west) .. controls +(west:1) and +(east:1) .. ([yshift=1.0em]box2.east) ;
 				}
 				
 				{
 				\node [rectangle,inner sep=0.2em,rounded corners=1pt,fill=purple!10,drop shadow,draw=purple] [fit = (line21) (line22) (line23) (line24)] (box3) {};
 				\node [rectangle,inner sep=0.2em,rounded corners=1pt,very thick,dotted,draw=purple] [fit = (enc3)] (box7) {};
-				\draw [->,dotted,very thick,purple] ([xshift=0.1em]box7.east) -- ([xshift=0.8em]box7.east) ;
+				\draw [->,dotted,very thick,purple] ([xshift=0.1em]box7.east) -- ([xshift=2.8em]box7.east) ;
 				}
 							
 				\end{pgfonlayer}

--- a/Chapter10/Figures/figure-a-working-example-of-neural-machine-translation.tex
+++ b/Chapter10/Figures/figure-a-working-example-of-neural-machine-translation.tex
@@ -10,11 +10,11 @@
            % RNN Encoder
            \coordinate (eemb0) at (0,0);
            \foreach \x [count=\y from 0] in {1,2,...,4}
-                \node[rnnnode,minimum height=0.5\base,fill=green!30!white,anchor=west] (eemb\x) at ([xshift=0.4\base]eemb\y.east) {};
+                \node[rnnnode,minimum height=0.5\base,fill=green!30!white,anchor=west] (eemb\x) at ([xshift=0.7\base]eemb\y.east) {};
            \foreach \x in {1,2,...,4}
                \node[rnnnode,fill=blue!30!white,anchor=south] (enc\x) at ([yshift=0.5\base]eemb\x.north) {};
            \node[wordnode,left=0.4\base of enc1,font=\footnotesize] (init) {0};
- 			\node[wordnode,anchor=east] (init2) at ([xshift=-3.0em]init.west){};
+ 			\node[wordnode,anchor=east] (init2) at ([xshift=-3em]init.west){};
               {
                \node[rnnnode,fill=purple] (repr) at (enc4) {};
                \node[wordnode] (label) at ([yshift=2.5em]enc4.north) {
@@ -35,7 +35,7 @@

            % RNN Decoder
            \foreach \x in {1,2,...,4}
-                 \node[rnnnode,minimum height=0.5\base,fill=green!30!white,anchor=south] (demb\x) at ([xshift=9.5em,yshift=-3.9em]enc\x.north) {};
+                 \node[rnnnode,minimum height=0.5\base,fill=green!30!white,anchor=south] (demb\x) at ([xshift=12.5em,yshift=-3.9em]enc\x.north) {};
            \foreach \x in {1,2,...,4}
                \node[rnnnode,fill=blue!30!white,anchor=south] (dec\x) at ([yshift=0.5\base]demb\x.north) {};
            \foreach \x in {1,2,...,4}

--- a/Chapter10/Figures/figure-attention-of-source-and-target-words.tex
+++ b/Chapter10/Figures/figure-attention-of-source-and-target-words.tex
@@ -11,7 +11,7 @@
 %\setlength{\mystep}{1.6em}

 \foreach \x in {1,2,...,6}
-    \node[] (s\x) at (\x * 1.6em,0) {};
+    \node[] (s\x) at (\x * 2em,0) {};

 \node [] (ws1) at (s1) {\scriptsize{这}};
 \node [] (ws2) at (s2) {\scriptsize{是}};
@@ -21,7 +21,7 @@
 \node [] (ws6) at (s6) {\scriptsize{句子}};

 \foreach \x in {1,2,...,6}
-    \node[] (t\x) at (\x * 1.6em + 2.4in,0) {};
+    \node[] (t\x) at (\x * 2em + 2.8in,0) {};

 \node [] (wt1) at (t1) {\scriptsize{This}};
 \node [] (wt2) at (t2) {\scriptsize{is}};
@@ -30,9 +30,9 @@
 \node [] (wt5) at (t5) {\scriptsize{long}};
 \node [] (wt6) at ([xshift=1em]t6) {\scriptsize{sentence}};

-\node [anchor=south west,fill=red!30,minimum width=1.6in,minimum height=1.5em] (encoder) at ([yshift=1.0em]ws1.north west) {\footnotesize{Encoder}};
-\node [anchor=west,fill=blue!30,minimum width=1.9in,minimum height=1.5em] (decoder) at ([xshift=4.5em]encoder.east) {\footnotesize{Decoder}};
-\node [anchor=west,fill=green!30,minimum height=1.5em] (representation) at ([xshift=1em]encoder.east) {\footnotesize{表示}};
+\node [anchor=south west,fill=red!30,minimum width=1.9in,minimum height=1.5em] (encoder) at ([yshift=1.0em,xshift=-0.4em]ws1.north west) {\footnotesize{Encoder}};
+\node [anchor=west,fill=green!30,minimum height=1.5em] (representation) at ([xshift=2em]encoder.east) {\footnotesize{表示}};
+\node [anchor=west,fill=blue!30,minimum width=1.9in,minimum height=1.5em] (decoder) at ([xshift=2em]representation.east) {\footnotesize{Decoder}};
 \draw [->,thick] ([xshift=1pt]encoder.east)--([xshift=-1pt]representation.west);
 \draw [->,thick] ([xshift=1pt]representation.east)--([xshift=-1pt]decoder.west);


--- a/Chapter10/Figures/figure-beam-search-process.tex
+++ b/Chapter10/Figures/figure-beam-search-process.tex
@@ -5,7 +5,7 @@

 \begin{tikzpicture}
 \begin{scope}
-\tikzstyle{rnnnode} = [minimum height=1.1em,minimum width=3.5em,inner sep=2pt,rounded corners=1pt,draw,fill=red!20];
+\tikzstyle{rnnnode} = [minimum height=1.1em,minimum width=4em,inner sep=2pt,rounded corners=1pt,draw,fill=red!20];
 \tikzstyle{wnode} = [minimum height=1.0em,minimum width=3.5em,inner sep=2pt,rounded corners=1pt,draw,fill=white];


@@ -137,13 +137,13 @@
 }

 {
-\node [anchor=east] (vocab) at ([xshift=-5em]s1.west) {\tiny{$\begin{bmatrix} \textrm{Have} & 0.50 \\ \textrm{I} & 0.02 \\ \textrm{it} & 0.03 \\ \textrm{has} & 0.30 \\ \textrm{you} & 0.01 \\ \textrm{the} & 0.01 \\ \textrm{a} & 0.01 \\ \textrm{an} & 0.02 \\ \textrm{he} & 0.03 \\ \textrm{she} & 0.01 \\ \textrm{are} & 0.00 \\ \textrm{am} & 0.01 \\ ... & ... \end{bmatrix}$}};
+\node [anchor=east] (vocab) at ([xshift=-8em]s1.west) {\tiny{$\begin{bmatrix} \textrm{Have} & 0.50 \\ \textrm{I} & 0.02 \\ \textrm{it} & 0.03 \\ \textrm{has} & 0.30 \\ \textrm{you} & 0.01 \\ \textrm{the} & 0.01 \\ \textrm{a} & 0.01 \\ \textrm{an} & 0.02 \\ \textrm{he} & 0.03 \\ \textrm{she} & 0.01 \\ \textrm{are} & 0.00 \\ \textrm{am} & 0.01 \\ ... & ... \end{bmatrix}$}};
 \node [anchor=south] (vocablabel) at (vocab.north) {\scriptsize{单词的概率分布}};
 \draw [->,red,very thick,dotted] (o1.west) .. controls +(west:1) and +(east:2) .. ([yshift=1em]vocab.south east);
 }

 {
-\node [anchor=east,inner sep=1pt] (vocabtopn) at ([xshift=-0.5em,yshift=-0.5em]wo1.west) {\scriptsize{$\begin{bmatrix} \textrm{Have} \\ \textrm{has} \\ \textrm{it} \end{bmatrix}$}};
+\node [anchor=east,inner sep=1pt] (vocabtopn) at ([xshift=-1.5em,yshift=-0.5em]wo1.west) {\scriptsize{$\begin{bmatrix} \textrm{Have} \\ \textrm{has} \\ \textrm{it} \end{bmatrix}$}};
 \draw [->] ([yshift=-1.6em,xshift=-0.4em]vocab.north east) .. controls +(east:1) and +(west:1) ..  ([xshift=0.1em,yshift=0.4em]vocabtopn.west) node [pos=0.3,below] (topnlabel) {\scriptsize{top-3}};

 {

--- a/Chapter10/Figures/figure-bi-rnn.tex
+++ b/Chapter10/Figures/figure-bi-rnn.tex
@@ -13,7 +13,7 @@
            % RNN Encoder
            \coordinate (eemb0) at (0,0);
            \foreach \x [count=\y from 0] in {1,2,...,10}
-                \node[rnnnode,minimum height=0.5\base,fill=green!30!white,anchor=west] (eemb\x) at ([xshift=0.4\base]eemb\y.east) {};
+                \node[rnnnode,minimum height=0.5\base,fill=green!30!white,anchor=west] (eemb\x) at ([xshift=0.7\base]eemb\y.east) {};
            \foreach \x in {1,2,...,10}
                \node[rnnnode,fill=blue!30!white,anchor=south] (backenc\x) at ([yshift=0.5\base]eemb\x.north) {};
            \foreach \x in {1,2,...,10}

--- a/Chapter10/Figures/figure-calculation-process-of-context-vector-c.tex
+++ b/Chapter10/Figures/figure-calculation-process-of-context-vector-c.tex
@@ -8,25 +8,25 @@

 \begin{scope}

-\node [anchor=west,draw,fill=red!20!white,inner sep=3pt,minimum width=2em,minimum height=1.2em] (h1) at (0,0) {\scriptsize{$\mathbi{h}_1$}};
-\node [anchor=west,draw,fill=red!20!white,inner sep=3pt,minimum width=2em,minimum height=1.2em] (h2) at ([xshift=1em]h1.east) {\scriptsize{$\mathbi{h}_2$}};
-\node [anchor=west,inner sep=0pt,minimum width=3em] (h3) at ([xshift=0.5em]h2.east) {\scriptsize{...}};
-\node [anchor=west,draw,fill=red!20!white,inner sep=3pt,minimum width=2em,minimum height=1.2em] (h4) at ([xshift=0.5em]h3.east) {\scriptsize{$\mathbi{h}_m$}};
+\node [anchor=west,draw,fill=red!30!white,inner sep=3pt,minimum width=2em,minimum height=1.2em] (h1) at (0,0) {\scriptsize{$\mathbi{h}_1$}};
+\node [anchor=west,draw,fill=red!30!white,inner sep=3pt,minimum width=2em,minimum height=1.2em] (h2) at ([xshift=2em]h1.east) {\scriptsize{$\mathbi{h}_2$}};
+\node [anchor=west,inner sep=0pt,minimum width=3em] (h3) at ([xshift=1em]h2.east) {\scriptsize{...}};
+\node [anchor=west,draw,fill=red!30!white,inner sep=3pt,minimum width=2em,minimum height=1.2em] (h4) at ([xshift=1em]h3.east) {\scriptsize{$\mathbi{h}_m$}};

 \node [anchor=south,circle,minimum size=1.0em,draw,ublue,thick] (sum) at ([yshift=2em]h2.north east) {};
 \draw [thick,-,ublue] (sum.north) -- (sum.south);
 \draw [thick,-,ublue] (sum.west) -- (sum.east);

-\node [anchor=south,draw,fill=green!20!white,inner sep=3pt,minimum width=2em,minimum height=1.2em] (th1) at ([yshift=2em,xshift=-1em]sum.north west) {\scriptsize{$\mathbi{s}_{j-1}$}};
-\node [anchor=west,draw,fill=green!20!white,inner sep=3pt,minimum width=2em,minimum height=1.2em] (th2) at ([xshift=2em]th1.east) {\scriptsize{$\mathbi{s}_{j}$}};
+\node [anchor=south,draw,fill=green!30!white,inner sep=3pt,minimum width=2em,minimum height=1.2em] (th1) at ([yshift=2em,xshift=-1em]sum.north west) {\scriptsize{$\mathbi{s}_{j-1}$}};
+\node [anchor=west,draw,fill=green!30!white,inner sep=3pt,minimum width=2em,minimum height=1.2em] (th2) at ([xshift=2.5em]th1.east) {\scriptsize{$\mathbi{s}_{j}$}};

 \draw [->] (h1.north) .. controls +(north:0.8) and +(west:1) ..  (sum.190) node [pos=0.2,left] {\scriptsize{$\alpha_{1,j}$}};
 \draw [->] (h2.north) .. controls +(north:0.6) and +(220:0.2) ..  (sum.220) node [pos=0.2,right] {\scriptsize{$\alpha_{2,j}$}};
 \draw [->] (h4.north) .. controls +(north:0.8) and +(east:1) ..  (sum.-10) node [pos=0.1,left] (alphan) {\scriptsize{$\alpha_{m,j}$}};

-\draw [->] ([xshift=-1.5em]th1.west) -- ([xshift=-0.1em]th1.west);
+\draw [->] ([xshift=-2em]th1.west) -- ([xshift=-0.1em]th1.west);
 \draw [->] ([xshift=0.1em]th1.east) -- ([xshift=-0.1em]th2.west);
-\draw [->] ([xshift=0.1em]th2.east) -- ([xshift=1.5em]th2.east);
+\draw [->] ([xshift=0.1em]th2.east) -- ([xshift=2em]th2.east);
 \draw [->] (sum.north) .. controls +(north:0.8) and +(west:0.2) ..  ([yshift=-0.4em,xshift=-0.1em]th2.west) node [pos=0.2,right] (ci) {\scriptsize{$\mathbi{C}_{j}$}};

 \node [anchor=south,inner sep=1pt] (output) at ([yshift=0.8em]th2.north) {\scriptsize{输出层}};
@@ -39,7 +39,7 @@
 \node [anchor=north] (enc42) at ([yshift=0.5em]enc4.south) {\scriptsize{(位置$m$)}};

 {
-\node [anchor=west] (math1) at ([xshift=5em,yshift=1em]th2.east) {$\mathbi{C}_j = \sum_{i} \alpha_{i,j} \mathbi{h}_i $ \ \ \ \ };
+\node [anchor=west] (math1) at ([xshift=7em,yshift=1em]th2.east) {$\mathbi{C}_j = \sum_{i} \alpha_{i,j} \mathbi{h}_i $ \ \ \ \ };
 }
 {
 \node [anchor=north west] (math2) at ([yshift=-2em,xshift=-0.2em]math1.south west) {$\alpha_{i,j} = \frac{\exp(\beta_{i,j})}{\sum_{i'} \exp(\beta_{i',j})}$};
@@ -48,10 +48,10 @@

 \begin{pgfonlayer}{background}
 {
-\node [rectangle,inner sep=0.4em,rounded corners=1pt,fill=blue!10,drop shadow,minimum width=9.5em] [fit = (math1)] (box1) {};
+\node [rectangle,inner sep=0.4em,rounded corners=1pt,fill=blue!20,drop shadow,minimum width=9.5em] [fit = (math1)] (box1) {};
 }
 {
-\node [rectangle,inner sep=0.4em,rounded corners=1pt,fill=orange!10,drop shadow,minimum width=9.5em] [fit = (math2) (math3)] (box2) {};
+\node [rectangle,inner sep=0.4em,rounded corners=1pt,fill=orange!20,drop shadow,minimum width=9.5em] [fit = (math2) (math3)] (box2) {};
 }
 \end{pgfonlayer}


--- a/Chapter10/Figures/figure-data-parallel-process.tex
+++ b/Chapter10/Figures/figure-data-parallel-process.tex
@@ -17,7 +17,7 @@
                        \node [samplenode,anchor=south west,font=\scriptsize] (batch\i) at ([shift={(-1em,-0.5em)}]batch\j.south west) {句子 \k};
                    \draw [decorate,decoration={brace}] (batch1.south east) to node [auto,rotate=30,anchor=north,font=\scriptsize] {batch大小} (batch3.south east);

-                    \node [samplenode,anchor=west,font=\scriptsize] (sample2) at ([xshift=4em]batch2.east) {句子2};
+                    \node [samplenode,anchor=west,font=\scriptsize] (sample2) at ([xshift=5em]batch2.east) {句子2};
                    \node [samplenode,anchor=south,font=\scriptsize] (sample3) at ([yshift=3em]sample2.north) {句子3};
                    \node [samplenode,anchor=north,font=\scriptsize] (sample1) at ([yshift=-3em]sample2.south) {句子1};

@@ -26,18 +26,18 @@

                    \foreach \i in {1,2,3}
                    {
-                        \coordinate (start) at ([xshift=2em]sample\i.east);
+                        \coordinate (start) at ([xshift=3em]sample\i.east);
                        \node [wordnode,anchor=west] (rnn0) at (start) {$0$};
                        \foreach \j [count=\k from 0] in {1,2,3}
                        {
-                            \node [rnnnode,anchor=west] (rnn\j) at ([xshift=1em]rnn\k.east) {};
+                            \node [rnnnode,anchor=west] (rnn\j) at ([xshift=1.5em]rnn\k.east) {};
                            \draw [-latex'] (rnn\k) to (rnn\j);
                            \coordinate (in\j) at ([yshift=-1em]rnn\j.south);
                            \draw [-latex'] (in\j) to (rnn\j.south);
                            \coordinate (out\j) at ([yshift=1em]rnn\j.north);
                            \draw [-latex'] (rnn\j.north) to (out\j);
                        }
-                        \node [wordnode,anchor=west] (rnn4) at ([xshift=1em]rnn3.east) {$\cdots$};
+                        \node [wordnode,anchor=west] (rnn4) at ([xshift=1.5em]rnn3.east) {$\cdots$};
                        \draw [-latex'] (rnn3) to (rnn4);
                        \node [draw,densely dashed,thick,rounded corners=0.3em,fit=(start) (in3) (out3) (rnn4),label={[font=\footnotesize,rotate=90,anchor=north]0:设备\i}] (rnn) {};
                        \draw [->,double] ([xshift=3pt]sample\i.east) -- ([xshift=-3pt]rnn.west);

--- a/Chapter10/Figures/figure-decode-the-word-probability-distribution-at-the-first-position.tex
+++ b/Chapter10/Figures/figure-decode-the-word-probability-distribution-at-the-first-position.tex
@@ -42,7 +42,7 @@
 }

 {
-\node [anchor=east] (vocab) at ([xshift=-5em]s1.west) {\tiny{$\begin{bmatrix} \textrm{Have} & 0.50 \\ \textrm{I} & 0.02 \\ \textrm{it} & 0.03 \\ \textrm{has} & 0.30 \\ \textrm{you} & 0.01 \\ \textrm{the} & 0.01 \\ \textrm{a} & 0.01 \\ \textrm{an} & 0.02 \\ \textrm{he} & 0.03 \\ \textrm{she} & 0.01 \\ \textrm{are} & 0.00 \\ \textrm{am} & 0.01 \\ ... & ... \end{bmatrix}$}};
+\node [anchor=east] (vocab) at ([xshift=-7em]s1.west) {\tiny{$\begin{bmatrix} \textrm{Have} & 0.50 \\ \textrm{I} & 0.02 \\ \textrm{it} & 0.03 \\ \textrm{has} & 0.30 \\ \textrm{you} & 0.01 \\ \textrm{the} & 0.01 \\ \textrm{a} & 0.01 \\ \textrm{an} & 0.02 \\ \textrm{he} & 0.03 \\ \textrm{she} & 0.01 \\ \textrm{are} & 0.00 \\ \textrm{am} & 0.01 \\ ... & ... \end{bmatrix}$}};
 \node [anchor=south] (vocablabel) at (vocab.north) {\scriptsize{单词的概率分布}};
 \draw [->,red,very thick,dotted] (o1.west) .. controls +(west:1) and +(east:2) .. ([yshift=1em]vocab.south east);
 }

--- a/Chapter10/Figures/figure-decoding-process-based-on-greedy-method.tex
+++ b/Chapter10/Figures/figure-decoding-process-based-on-greedy-method.tex
 \begin{tikzpicture}
 \begin{scope}
-\tikzstyle{rnnnode} = [minimum height=1.1em,minimum width=2.1em,inner sep=2pt,rounded corners=1pt,draw,fill=red!20];
+\tikzstyle{rnnnode} = [minimum height=1.1em,minimum width=2.7em,inner sep=2pt,rounded corners=1pt,draw,fill=red!20];

 \node [rnnnode,anchor=west] (h1) at (0,0) {\tiny{$\mathbi{h}_1$}};
 \node [anchor=west] (h2) at ([xshift=1em]h1.east) {\tiny{...}};
@@ -139,7 +139,7 @@
 \draw [->] ([yshift=-0.3em]s1.west) .. controls +(west:2) and +(-50:0.3) .. (c2.-40);
 }
 {
-\draw [->] (c2.0) -- ([xshift=1.2in]c2.0) -- ([yshift=0.3em,xshift=-1.2em]s2.west) -- ([yshift=0.3em,xshift=-0.1em]s2.west);
+\draw [->] (c2.0) -- ([xshift=1.4in]c2.0) -- ([yshift=0.3em,xshift=-1em]s2.west) -- ([yshift=0.3em,xshift=-0.1em]s2.west);
 }

 {

--- a/Chapter10/Figures/figure-double-layer-rnn.tex
+++ b/Chapter10/Figures/figure-double-layer-rnn.tex
@@ -10,7 +10,7 @@
            % RNN Encoder
            \coordinate (eemb0) at (0,0);
            \foreach \x [count=\y from 0] in {1,2,...,10}
-                \node[rnnnode,minimum height=0.5\base,fill=green!30!white,anchor=west] (eemb\x) at ([xshift=0.4\base]eemb\y.east) {};
+                \node[rnnnode,minimum height=0.5\base,fill=green!30!white,anchor=west] (eemb\x) at ([xshift=0.7\base]eemb\y.east) {};
            \foreach \x in {1,2,...,10}
                \node[rnnnode,fill=blue!30!white,anchor=south] (enc1\x) at ([yshift=0.3\base]eemb\x.north) {};
            \foreach \x in {1,2,...,10}
@@ -121,10 +121,10 @@
            }

            \coordinate (bridge) at ([yshift=1.4\base]enc16.north west);
-            \draw[-latex'] (enc210.north) .. controls +(north:0.4\base) and +(east:1.5\base) .. (bridge) .. controls +(west:8.0\base) and +(south west:0.8\base)  .. (dec21.west);
+            \draw[-latex'] (enc210.north) .. controls +(north:0.6\base) and +(east:2\base) .. (bridge) .. controls +(west:8.0\base) and +(south west:2.8\base)  .. (dec21.west);

            \coordinate (bridge) at ([yshift=1.6\base]enc16.north west);
-            \draw[-latex'] (enc110.east) .. controls +(east:0.5\base) and +(east:8\base) .. (bridge) .. controls +(west:7.5\base) and +(south west:0.1\base) .. (dec11.west);
+            \draw[-latex'] ([yshift=-0.5em]enc110.east) .. controls +(north:1.5\base) and +(east:8\base) .. (bridge) .. controls +(west:7.5\base) and +(south west:1.2\base) .. (dec11.west);

            % stack RNN
            \begin{pgfonlayer}{background}

--- a/Chapter10/Figures/figure-encoder-decoder-with-attention.tex
+++ b/Chapter10/Figures/figure-encoder-decoder-with-attention.tex
@@ -11,7 +11,7 @@
 %%% a simple encoder-decoder model
 \begin{scope}
 \foreach \x in {1,2,...,6}
-    \node[] (s\x) at (\x * 1.6em,0) {};
+    \node[] (s\x) at (\x * 2em,0) {};

 \node [] (ws1) at (s1) {\scriptsize{这}};
 \node [] (ws2) at (s2) {\scriptsize{是}};
@@ -21,7 +21,7 @@
 \node [] (ws6) at (s6) {\scriptsize{句子}};

 \foreach \x in {1,2,...,6}
-    \node[] (t\x) at (\x * 1.6em + 2.4in,0) {};
+    \node[] (t\x) at (\x * 2em + 2.8in,0) {};

 \node [] (wt1) at (t1) {\scriptsize{This}};
 \node [] (wt2) at (t2) {\scriptsize{is}};
@@ -30,9 +30,9 @@
 \node [] (wt5) at (t5) {\scriptsize{long}};
 \node [] (wt6) at ([xshift=1em]t6) {\scriptsize{sentence}};

-\node [anchor=south west,fill=red!30,minimum width=1.6in,minimum height=1.5em] (encoder) at ([yshift=1.0em]ws1.north west) {\footnotesize{Encoder}};
-\node [anchor=west,fill=blue!30,minimum width=1.9in,minimum height=1.5em] (decoder) at ([xshift=4.5em]encoder.east) {\footnotesize{Decoder}};
-\node [anchor=west,fill=green!30,minimum height=1.5em] (representation) at ([xshift=1em]encoder.east) {\footnotesize{表示}};
+\node [anchor=south west,fill=red!30,minimum width=1.9in,minimum height=1.5em] (encoder) at ([yshift=1.0em,xshift=-0.4em]ws1.north west) {\footnotesize{Encoder}};
+\node [anchor=west,fill=green!30,minimum height=1.5em] (representation) at ([xshift=2em]encoder.east) {\footnotesize{表示}};
+\node [anchor=west,fill=blue!30,minimum width=1.9in,minimum height=1.5em] (decoder) at ([xshift=2em]representation.east) {\footnotesize{Decoder}};
 \draw [->,thick] ([xshift=1pt]encoder.east)--([xshift=-1pt]representation.west);
 \draw [->,thick] ([xshift=1pt]representation.east)--([xshift=-1pt]decoder.west);

@@ -42,7 +42,7 @@
 \foreach \x in {1,2,...,5}
    \draw[<-] ([yshift=0.1em]t\x.north) -- ([yshift=1.2em]t\x.north);

-\draw[<-] ([yshift=0.1em,xshift=1em]t6.north) -- ([yshift=1.2em,xshift=1em]t6.north);
+\draw[<-,thick] ([yshift=0.1em,xshift=1em]t6.north) -- ([yshift=1.2em,xshift=1em]t6.north);
 \node [anchor=north] (cap) at ([xshift=2em,yshift=-2.5em]encoder.south east) {\small{(a) 简单的编码器-解码器框架}};

 \end{scope}
@@ -50,7 +50,7 @@
 %%% a encoder-decoder model with attention
 \begin{scope}[yshift=-1.7in]
 \foreach \x in {1,2,...,6}
-    \node[] (s\x) at (\x * 1.6em,0) {};
+    \node[] (s\x) at (\x * 2em,0) {};

 \node [] (ws1) at (s1) {\scriptsize{这}};
 \node [] (ws2) at (s2) {\scriptsize{是}};
@@ -60,7 +60,7 @@
 \node [] (ws6) at (s6) {\scriptsize{句子}};

 \foreach \x in {1,2,...,6}
-    \node[] (t\x) at (\x * 1.6em + 2.4in,0) {};
+    \node[] (t\x) at (\x * 2em + 2.8in,0) {};

 \node [] (wt1) at (t1) {\scriptsize{This}};
 \node [] (wt2) at (t2) {\scriptsize{is}};
@@ -69,8 +69,8 @@
 \node [] (wt5) at (t5) {\scriptsize{long}};
 \node [] (wt6) at ([xshift=1em]t6) {\scriptsize{sentence}};

-\node [anchor=south west,fill=red!30,minimum width=1.6in,minimum height=1.5em] (encoder) at ([yshift=1.0em]ws1.north west) {\footnotesize{Encoder}};
-\node [anchor=west,fill=blue!30,minimum width=1.9in,minimum height=1.5em] (decoder) at ([xshift=4.5em]encoder.east) {\footnotesize{Decoder}};
+\node [anchor=south west,fill=red!30,minimum width=1.9in,minimum height=1.5em] (encoder) at ([yshift=1.0em,xshift=-0.4em]ws1.north west) {\footnotesize{Encoder}};
+\node [anchor=west,fill=blue!30,minimum width=1.9in,minimum height=1.5em] (decoder) at ([xshift=6.4em]encoder.east) {\footnotesize{Decoder}};

 \foreach \x in {1,2,...,6}
    \draw[->] ([yshift=0.1em]s\x.north) -- ([yshift=1.2em]s\x.north);
@@ -78,11 +78,11 @@
 \foreach \x in {1,2,...,5}
    \draw[<-] ([yshift=0.1em]t\x.north) -- ([yshift=1.2em]t\x.north);

-\draw[<-] ([yshift=0.1em,xshift=1em]t6.north) -- ([yshift=1.2em,xshift=1em]t6.north);
+\draw[<-,thick] ([yshift=0.1em,xshift=1em]t6.north) -- ([yshift=1.2em,xshift=1em]t6.north);

-\draw [->] ([yshift=3em]s6.north) -- ([yshift=4em]s6.north) -- ([yshift=4em]t1.north) node [pos=0.5,fill=green!30,inner sep=2pt] (c1) {\scriptsize{表示$\mathbi{C}_1$}} -- ([yshift=3em]t1.north) ;
-\draw [->] ([yshift=3em]s5.north) -- ([yshift=5.3em]s5.north) -- ([yshift=5.3em]t2.north) node [pos=0.5,fill=green!30,inner sep=2pt] (c2) {\scriptsize{表示$\mathbi{C}_2$}} -- ([yshift=3em]t2.north) ;
-\draw [->] ([yshift=3.5em]s3.north) -- ([yshift=6.6em]s3.north) -- ([yshift=6.6em]t4.north) node [pos=0.5,fill=green!30,inner sep=2pt] (c3) {\scriptsize{表示$\mathbi{C}_i$}} -- ([yshift=3.5em]t4.north) ;
+\draw [->,thick] ([yshift=3em]s6.north) -- ([yshift=4em]s6.north) -- ([yshift=4em]t1.north) node [pos=0.5,fill=green!30,inner sep=2pt] (c1) {\scriptsize{表示$\mathbi{C}_1$}} -- ([yshift=3em]t1.north) ;
+\draw [->,thick] ([yshift=3em]s5.north) -- ([yshift=5.3em]s5.north) -- ([yshift=5.3em]t2.north) node [pos=0.5,fill=green!30,inner sep=2pt] (c2) {\scriptsize{表示$\mathbi{C}_2$}} -- ([yshift=3em]t2.north) ;
+\draw [->,thick] ([yshift=3.5em]s3.north) -- ([yshift=6.6em]s3.north) -- ([yshift=6.6em]t4.north) node [pos=0.5,fill=green!30,inner sep=2pt] (c3) {\scriptsize{表示$\mathbi{C}_i$}} -- ([yshift=3.5em]t4.north) ;
 \node [anchor=north] (smore) at ([yshift=3.5em]s3.north) {...};
 \node [anchor=north] (tmore) at ([yshift=3.5em]t4.north) {...};


--- a/Chapter10/Figures/figure-model-structure-based-on-recurrent-neural-network-translation.tex
+++ b/Chapter10/Figures/figure-model-structure-based-on-recurrent-neural-network-translation.tex
@@ -11,7 +11,7 @@
            % RNN Encoder
            \coordinate (eemb0) at (0,0);
            \foreach \x [count=\y from 0] in {1,2,...,10}
-                \node[rnnnode,minimum height=0.5\base,fill=green!30!white,anchor=west] (eemb\x) at ([xshift=0.4\base]eemb\y.east) {};
+                \node[rnnnode,minimum height=0.5\base,fill=green!30!white,anchor=west] (eemb\x) at ([xshift=0.7\base]eemb\y.east) {};
            \foreach \x in {1,2,...,10}
                \node[rnnnode,fill=blue!30!white,anchor=south] (enc\x) at ([yshift=0.5\base]eemb\x.north) {};
            \node[wordnode,left=0.4\base of enc1] (init) {$0$};
@@ -114,7 +114,7 @@

        % legend
        \begin{scope}[shift={(10\base,2.5\base)}]
-            \node[rnnnode,minimum height=0.5\base,fill=green!30!white,label={[label distance=3pt,font=\scriptsize]0:词嵌入层}] (emb) at (0,0) {};
+            \node[rnnnode,minimum height=0.5\base,fill=green!30!white,label={[label distance=3pt,font=\scriptsize]0:词嵌入层}] (emb) at (3,0) {};
            \node[rnnnode,fill=blue!30!white,anchor=north west,label={[label distance=3pt,font=\scriptsize]0:循环单元}] (rnn) at ([yshift=2\base]emb.south west) {};
            \node[rnnnode,minimum height=0.5\base,fill=red!30!white,anchor=north west,label={[label distance=3pt,font=\scriptsize]0:输出层}] (softmax) at ([yshift=2\base]rnn.south west) {};
            \node [anchor=north west] (softmax2) at ([xshift=0.6\base]softmax.south west) {\scriptsize{Softmax}};

--- a/Chapter10/Figures/figure-numbers-of-wmt-systems.tex
+++ b/Chapter10/Figures/figure-numbers-of-wmt-systems.tex
@@ -5,7 +5,7 @@
  
    \begin{tikzpicture}
        \begin{scope}[local bounding box=WMT]
-            \draw[->,thick] (0.4,0) to (9.5,0);
+            \draw[->,thick] (0.4,0) to (11.5,0);
            \draw[->,thick] (0.4,-0) to (0.4,4.3);
            \draw[thick] (0.4,2) to (0.6,2);
            \draw[thick] (0.4,4) to (0.6,4);
@@ -13,38 +13,38 @@
            \node[font=\scriptsize] at (0,4) {20};

            % 2015
-            \node[minimum width=0.5cm,thick,minimum height=7*0.2cm,fill=blue!30!white,inner sep=0pt,outer sep=0pt,anchor=south west] (smt2015) at (1.5*0.7,0.5pt) {};
-            \node[minimum width=0.5cm,thick,minimum height=2*0.2cm,fill=red!30!white,inner sep=0pt,outer sep=0pt,anchor=south west] (nmt2015) at (smt2015.south east) {};
+            \node[minimum width=0.7cm,thick,minimum height=7*0.2cm,fill=blue!30!white,inner sep=0pt,outer sep=0pt,anchor=south west] (smt2015) at (1.5*0.7,0.5pt) {};
+            \node[minimum width=0.7cm,thick,minimum height=2*0.2cm,fill=red!30!white,inner sep=0pt,outer sep=0pt,anchor=south west] (nmt2015) at (smt2015.south east) {};
            \node[font=\scriptsize,anchor=north] () at ([yshift=-0.2em]smt2015.south east) {2015};
            % 2016
-            \node[minimum width=0.5cm,thick,minimum height=3*0.2cm,fill=blue!30!white,inner sep=0pt,outer sep=0pt,anchor=south west] (smt2016) at ($(nmt2015.south east)+(0.7,0)$) {};
-            \node[minimum width=0.5cm,thick,minimum height=8*0.2cm,fill=red!30!white,inner sep=0pt,outer sep=0pt,anchor=south west] (nmt2016) at (smt2016.south east) {};
+            \node[minimum width=0.7cm,thick,minimum height=3*0.2cm,fill=blue!30!white,inner sep=0pt,outer sep=0pt,anchor=south west] (smt2016) at ($(nmt2015.south east)+(0.7,0)$) {};
+            \node[minimum width=0.7cm,thick,minimum height=8*0.2cm,fill=red!30!white,inner sep=0pt,outer sep=0pt,anchor=south west] (nmt2016) at (smt2016.south east) {};
            \node[font=\scriptsize,anchor=north] () at ([yshift=-0.2em]smt2016.south east) {2016};
            % 2017
-            \node[minimum width=0.5cm,thick,minimum height=3*0.2cm,fill=blue!30!white,inner sep=0pt,outer sep=0pt,anchor=south west] (smt2017) at ($(nmt2016.south east)+(0.7,0)$) {};
-            \node[minimum width=0.5cm,thick,minimum height=13*0.2cm,fill=red!30!white,inner sep=0pt,outer sep=0pt,anchor=south west] (nmt2017) at (smt2017.south east) {};
+            \node[minimum width=0.7cm,thick,minimum height=3*0.2cm,fill=blue!30!white,inner sep=0pt,outer sep=0pt,anchor=south west] (smt2017) at ($(nmt2016.south east)+(0.7,0)$) {};
+            \node[minimum width=0.7cm,thick,minimum height=13*0.2cm,fill=red!30!white,inner sep=0pt,outer sep=0pt,anchor=south west] (nmt2017) at (smt2017.south east) {};
            \node[font=\scriptsize,anchor=north] () at ([yshift=-0.2em]smt2017.south east) {2017};
            % 2018
-            \node[minimum width=0.5cm,thick,minimum height=0cm,draw,fill=blue!30!white,inner sep=0pt,outer sep=0pt,anchor=south west] (smt2018) at ($(nmt2017.south east)+(0.7,0)$) {};
-            \node[minimum width=0.5cm,thick,minimum height=14*0.2cm,fill=red!30!white,inner sep=0pt,outer sep=0pt,anchor=south west] (nmt2018) at (smt2018.south east) {};
+            \node[minimum width=0.7cm,thick,minimum height=0cm,draw,fill=blue!30!white,inner sep=0pt,outer sep=0pt,anchor=south west] (smt2018) at ($(nmt2017.south east)+(0.7,0)$) {};
+            \node[minimum width=0.7cm,thick,minimum height=14*0.2cm,fill=red!30!white,inner sep=0pt,outer sep=0pt,anchor=south west] (nmt2018) at (smt2018.south east) {};
            \node[font=\scriptsize,anchor=north] () at ([yshift=-0.2em]smt2018.south east) {2018};
             % 2019
-            \node[minimum width=0.5cm,thick,minimum height=0cm,draw,fill=blue!30!white,inner sep=0pt,outer sep=0pt,anchor=south west] (smt2019) at ($(nmt2018.south east)+(0.7,0)$) {};
-            \node[minimum width=0.5cm,thick,minimum height=21*0.2cm,fill=red!30!white,inner sep=0pt,outer sep=0pt,anchor=south west] (nmt2019) at (smt2019.south east) {};
+            \node[minimum width=0.7cm,thick,minimum height=0cm,draw,fill=blue!30!white,inner sep=0pt,outer sep=0pt,anchor=south west] (smt2019) at ($(nmt2018.south east)+(0.7,0)$) {};
+            \node[minimum width=0.7cm,thick,minimum height=21*0.2cm,fill=red!30!white,inner sep=0pt,outer sep=0pt,anchor=south west] (nmt2019) at (smt2019.south east) {};
            \node[font=\scriptsize,anchor=north] () at ([yshift=-0.2em]smt2019.south east) {2019};
        \end{scope}

        % legend
        \ExtractX{$(nmt2015.west)$}
        \ExtractY{$(WMT.north)$}
-        \node[minimum width=0.5cm,rectangle,fill=blue!30!white,anchor=north west,label={[label distance=1pt,font=\scriptsize]0:统计机器翻译}] () at (\XCoord,\YCoord) {};
+        \node[minimum width=0.7cm,rectangle,fill=blue!30!white,anchor=north west,label={[label distance=1pt,font=\scriptsize]0:统计机器翻译}] () at (\XCoord,\YCoord) {};
        \ExtractX{$(nmt2017.west)$}
-        \node[minimum width=0.5cm,rectangle,fill=red!30!white,anchor=north west,label={[label distance=1pt,font=\scriptsize]0:神经机器翻译}] () at (\XCoord,\YCoord) {};
+        \node[minimum width=0.7cm,rectangle,fill=red!30!white,anchor=north west,label={[label distance=1pt,font=\scriptsize]0:神经机器翻译}] () at (\XCoord,\YCoord) {};

  
       % \node[font=\normalsize,rotate=90] () at ([xshift=-1em]WMT.west) {数量};
       \node[font=\normalsize] () at (0.4,4.5) {数量};
-        \node[font=\normalsize] () at (9.5,-0.3) {年份};
+        \node[font=\normalsize] () at (11.5,-0.3) {年份};
        
        
    \end{tikzpicture}
\ No newline at end of file
--- a/Chapter10/Figures/figure-presentation-space.tex
+++ b/Chapter10/Figures/figure-presentation-space.tex
@@ -19,13 +19,13 @@

 \end{scope}

-\begin{scope}[xshift=1.3in]
+\begin{scope}[xshift=1.8in]
 \node [anchor=south west,draw,thick,red,minimum width=0.9in,minimum height=0.7in] (space1) at (0,0) {};
 \node [anchor=south west,fill=blue,minimum width=0.1in,minimum height=0.1in] (unit1) at (0.2,0.8) {};
 \node [anchor=south west,fill=ugreen,minimum width=0.1in,minimum height=0.1in] (unit2) at (0.7,0.3) {};

-\node [anchor=south west,draw,thick,red,minimum width=0.9in,minimum height=0.7in] (space2) at (1.1in,0) {};
-\node [anchor=south west,circle,fill=orange,minimum width=0.1in,minimum height=0.1in] (unit3) at (1.5in,1.3) {};
+\node [anchor=south west,draw,thick,red,minimum width=0.9in,minimum height=0.7in] (space2) at (1.3in,0) {};
+\node [anchor=south west,circle,fill=orange,minimum width=0.1in,minimum height=0.1in] (unit3) at (1.7in,1.3) {};

 \draw [->] ([yshift=1pt]unit1.north) .. controls +(north:0.4) and +(west:2) .. ([yshift=0.0em,xshift=-1pt]unit3.west);
 \draw [->] ([xshift=1pt]unit2.east) .. controls +(east:1.5) and +(south:1) .. ([xshift=0.0em,yshift=-1pt]unit3.south);

--- a/Chapter10/Figures/figure-query-model-corresponding-to-attention-mechanism.tex
+++ b/Chapter10/Figures/figure-query-model-corresponding-to-attention-mechanism.tex
@@ -5,26 +5,26 @@
 \tikzstyle{rnode} = [draw,minimum width=3.5em,minimum height=1.2em]

 \node [rnode,anchor=south west,fill=red!20!white] (value1) at (0,0) {\scriptsize{${h}(\textrm{你})$}};
-\node [rnode,anchor=south west,fill=red!20!white] (value2) at ([xshift=1em]value1.south east) {\scriptsize{${h}(\textrm{什么})$}};
-\node [rnode,anchor=south west,fill=red!20!white] (value3) at ([xshift=1em]value2.south east) {\scriptsize{${h}(\textrm{也})$}};
-\node [rnode,anchor=south west,fill=red!20!white] (value4) at ([xshift=1em]value3.south east) {\scriptsize{${h}(\textrm{没})$}};
+\node [rnode,anchor=south west,fill=red!20!white] (value2) at ([xshift=2em]value1.south east) {\scriptsize{${h}(\textrm{什么})$}};
+\node [rnode,anchor=south west,fill=red!20!white] (value3) at ([xshift=2em]value2.south east) {\scriptsize{${h}(\textrm{也})$}};
+\node [rnode,anchor=south west,fill=red!20!white] (value4) at ([xshift=2em]value3.south east) {\scriptsize{${h}(\textrm{没})$}};

 \node [rnode,anchor=south west,fill=green!20!white] (key1) at ([yshift=0.2em]value1.north west) {\scriptsize{${h}(\textrm{你})$}};
 \node [rnode,anchor=south west,fill=green!20!white] (key2) at ([yshift=0.2em]value2.north west) {\scriptsize{${h}(\textrm{什么})$}};
 \node [rnode,anchor=south west,fill=green!20!white] (key3) at ([yshift=0.2em]value3.north west) {\scriptsize{${h}(\textrm{也})$}};
 \node [rnode,anchor=south west,fill=green!20!white] (key4) at ([yshift=0.2em]value4.north west) {\scriptsize{${h}(\textrm{没})$}};

-\node [rnode,anchor=east] (query) at ([xshift=-2em]key1.west) {\scriptsize{${s}(\textrm{you})$}};
+\node [rnode,anchor=east] (query) at ([xshift=-4em]key1.west) {\scriptsize{${s}(\textrm{you})$}};
 \node [anchor=east] (querylabel) at ([xshift=-0.2em]query.west) {\scriptsize{query}};

 \draw [->] ([yshift=1pt,xshift=6pt]query.north) .. controls +(90:1em) and +(90:1em) .. ([yshift=1pt]key1.north);
 \draw [->] ([yshift=1pt,xshift=3pt]query.north) .. controls +(90:1.5em) and +(90:1.5em) .. ([yshift=1pt]key2.north);
 \draw [->] ([yshift=1pt]query.north) .. controls +(90:2em) and +(90:2em) .. ([yshift=1pt]key3.north);
 \draw [->] ([yshift=1pt,xshift=-3pt]query.north) .. controls +(90:2.5em) and +(90:2.5em) .. ([yshift=1pt]key4.north);
-\node [anchor=south east] (alpha1) at ([xshift=1em]key1.north east) {\scriptsize{$\alpha_1=.4$}};
-\node [anchor=south east] (alpha2) at ([xshift=1em]key2.north east) {\scriptsize{$\alpha_2=.4$}};
-\node [anchor=south east] (alpha3) at ([xshift=1em]key3.north east) {\scriptsize{$\alpha_3=.1$}};
-\node [anchor=south east] (alpha4) at ([xshift=1em]key4.north east) {\scriptsize{$\alpha_4=.1$}};
+\node [anchor=south east] (alpha1) at ([xshift=1.5em]key1.north east) {\scriptsize{$\alpha_1=0.4$}};
+\node [anchor=south east] (alpha2) at ([xshift=1.5em]key2.north east) {\scriptsize{$\alpha_2=0.4$}};
+\node [anchor=south east] (alpha3) at ([xshift=1.5em]key3.north east) {\scriptsize{$\alpha_3=0.1$}};
+\node [anchor=south east] (alpha4) at ([xshift=1.5em]key4.north east) {\scriptsize{$\alpha_4=0.1$}};

 \end{scope}
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter10/Figures/figure-query-model-corresponding-to-traditional-query-model-vs-attention-mechanism.tex
+++ b/Chapter10/Figures/figure-query-model-corresponding-to-traditional-query-model-vs-attention-mechanism.tex
@@ -5,12 +5,12 @@
 \begin{tikzpicture}
 \begin{scope}

-\tikzstyle{rnode} = [draw,minimum width=3em,minimum height=1.2em]
+\tikzstyle{rnode} = [draw,minimum width=3.5em,minimum height=1.2em]

 \node [rnode,anchor=south west,fill=blue!20!white] (value1) at (0,0) {\scriptsize{value$_1$}};
-\node [rnode,anchor=south west,fill=blue!20!white] (value2) at ([xshift=1em]value1.south east) {\scriptsize{value$_2$}};
-\node [rnode,anchor=south west,fill=red!20!white] (value3) at ([xshift=1em]value2.south east) {\scriptsize{value$_3$}};
-\node [rnode,anchor=south west,fill=blue!20!white] (value4) at ([xshift=1em]value3.south east) {\scriptsize{value$_4$}};
+\node [rnode,anchor=south west,fill=blue!20!white] (value2) at ([xshift=2em]value1.south east) {\scriptsize{value$_2$}};
+\node [rnode,anchor=south west,fill=red!20!white] (value3) at ([xshift=2em]value2.south east) {\scriptsize{value$_3$}};
+\node [rnode,anchor=south west,fill=blue!20!white] (value4) at ([xshift=2em]value3.south east) {\scriptsize{value$_4$}};

 \node [rnode,anchor=south west,pattern=north east lines] (key1) at ([yshift=0.2em]value1.north west) {};
 \node [rnode,anchor=south west,pattern=dots] (key2) at ([yshift=0.2em]value2.north west) {};
@@ -21,7 +21,7 @@
 \node [fill=white,inner sep=1pt] (key1label) at (key3) {\scriptsize{key$_3$}};
 \node [fill=white,inner sep=1pt] (key1label) at (key4) {\scriptsize{key$_4$}};

-\node [rnode,anchor=east,pattern=horizontal lines] (query) at ([xshift=-3em]key1.west) {};
+\node [rnode,anchor=east,pattern=horizontal lines] (query) at ([xshift=-4em]key1.west) {};
 \node [anchor=east] (querylabel) at ([xshift=-0.2em]query.west) {\scriptsize{query}};

 \draw [->] ([yshift=1pt]query.north) .. controls +(90:2em) and +(90:2em) .. ([yshift=1pt]key3.north) node [pos=0.5,below,yshift=0.2em] {\scriptsize{匹配}};

--- a/Chapter10/Figures/figure-relationship-between-learning-rate-and-number-of-updates.tex
+++ b/Chapter10/Figures/figure-relationship-between-learning-rate-and-number-of-updates.tex
@@ -4,7 +4,7 @@
            \begin{tikzpicture}
            \footnotesize{
                \begin{axis}[
-                    width=.60\textwidth,
+                    width=.8\textwidth,
                    height=.40\textwidth,
                    legend style={at={(0.60,0.08)}, anchor=south west},
                    xlabel={\scriptsize{更新次数}},

--- a/Chapter10/Figures/figure-score-of-mter.tex
+++ b/Chapter10/Figures/figure-score-of-mter.tex
@@ -2,7 +2,7 @@
 %\definecolor{ublue}{rgb}{0.152,0.250,0.545}
 \begin{tikzpicture}
 \begin{axis}[ 
-width=10cm, height=5cm, 
+width=12cm, height=5cm, 
 symbolic x coords={1-15,16-25,26-35,>35},
 xtick=data,
 ytick={6,12,...,28},

--- a/Chapter10/Figures/figure-structure-of-a-recurrent-network-model.tex
+++ b/Chapter10/Figures/figure-structure-of-a-recurrent-network-model.tex
@@ -6,11 +6,11 @@

 \begin{scope}[scale=0.9]
 {\Large
-\tikzstyle{rnnnode} = [draw,inner sep=5pt,minimum width=3em,minimum height=0.8em,fill=green!30!white,blur shadow={shadow xshift=1pt,shadow yshift=-1pt}]
+\tikzstyle{rnnnode} = [draw,inner sep=5pt,minimum width=3.6em,minimum height=0.8em,fill=green!30!white,blur shadow={shadow xshift=1pt,shadow yshift=-1pt}]
 \node [anchor=west,rnnnode] (node11) at (0,0) {\scriptsize{RNN Cell}};
-\node [anchor=west,rnnnode] (node12) at ([xshift=1em]node11.east) {\scriptsize{RNN Cell}};
-\node [anchor=west,rnnnode] (node13) at ([xshift=1em]node12.east) {\scriptsize{RNN Cell}};
-\node [anchor=west,rnnnode] (node14) at ([xshift=1em]node13.east) {\scriptsize{RNN Cell}};
+\node [anchor=west,rnnnode] (node12) at ([xshift=2em]node11.east) {\scriptsize{RNN Cell}};
+\node [anchor=west,rnnnode] (node13) at ([xshift=2em]node12.east) {\scriptsize{RNN Cell}};
+\node [anchor=west,rnnnode] (node14) at ([xshift=2em]node13.east) {\scriptsize{RNN Cell}};

 \node [anchor=north,rnnnode,fill=blue!30!white] (e1) at ([yshift=-1em]node11.south) {\scriptsize{}};
 \node [anchor=north,rnnnode,fill=blue!30!white] (e2) at ([yshift=-1em]node12.south) {\scriptsize{}};

--- a/Chapter10/Figures/figure-structure-of-gnmt.tex
+++ b/Chapter10/Figures/figure-structure-of-gnmt.tex
@@ -12,9 +12,9 @@
        % Encoder
        \begin{scope}
            \node[rnnnode,fill=green!20] (encemb1) at (0,0) {};
-            \node[rnnnode,fill=green!20,right=\base of encemb1] (encemb2) {};
-            \node[rnnnode,draw=white,fill=white,right=\base of encemb2] (encemb3) {$\cdots$};
-            \node[rnnnode,fill=green!20,right=\base of encemb3] (encemb4) {};
+            \node[rnnnode,fill=green!20,right=1.8\base of encemb1] (encemb2) {};
+            \node[rnnnode,draw=white,fill=white,right=1.8\base of encemb2] (encemb3) {$\cdots$};
+            \node[rnnnode,fill=green!20,right=1.8\base of encemb3] (encemb4) {};

            \node[rnnnode,above=\base of encemb1] (enc11) {};
            \node[rnnnode,above=\base of encemb2] (enc12) {};
@@ -103,10 +103,10 @@
        \node[rnnnode,fill=orange!20,minimum width=3.5cm,anchor=south west] (attention) at ([yshift=\base]enc61.north west) {注意力机制};

        \begin{scope}
-            \node[rnnnode,fill=green!20,right=2.5cm of encemb4] (decemb1) {};
-            \node[rnnnode,fill=green!20,right=\base of decemb1] (decemb2) {};
-            \node[rnnnode,draw=white,fill=white,right=\base of decemb2] (decemb3) {$\cdots$};
-            \node[rnnnode,fill=green!20,right=\base of decemb3] (decemb4) {};
+            \node[rnnnode,fill=green!20,right=3cm of encemb4] (decemb1) {};
+            \node[rnnnode,fill=green!20,right=1.8\base of decemb1] (decemb2) {};
+            \node[rnnnode,draw=white,fill=white,right=1.8\base of decemb2] (decemb3) {$\cdots$};
+            \node[rnnnode,fill=green!20,right=1.8\base of decemb3] (decemb4) {};

            \node[rnnnode,above=\base of decemb1] (dec11) {};
            \node[rnnnode,above=\base of decemb2] (dec12) {};
@@ -189,7 +189,7 @@
        \end{scope}

        % attention connections
-        \draw[-latex',rounded corners=2pt] (dec11) -| ([xshift=-0.4cm]attention.south east);
+        \draw[-latex',rounded corners=2pt] (dec11) -| ([xshift=-0.2cm]attention.south east);

        \ExtractX{$([xshift=9pt]attention.east)$}
        \ExtractY{$([yshift=2pt]dec11.north)$}
@@ -224,11 +224,11 @@
        \draw[decorate,decoration={brace,mirror}] ([xshift=5pt]dec14.south east) to node [auto,swap,font=\scriptsize,name=label2] {8层} ([xshift=5pt]dec54.north east);
        \begin{pgfonlayer}{background}
            \coordinate (tmp) at ([xshift=-4pt]label1.west);
-            \node[draw,densely dashed,rounded corners=2pt,inner sep=2pt,fit=(label1) (encword1) (attention) (tmp)] (encoder) {};
+            \node[thick,draw,densely dashed,rounded corners=2pt,inner sep=2pt,fit=(label1) (encword1) (attention) (tmp)] (encoder) {};
            \ExtractX{$([xshift=4pt]label2.east)$}
            \ExtractY{$([yshift=6pt]decoutword4.north)$}
            \coordinate (tmp) at (\XCoord,\YCoord);
-            \node[draw,densely dashed,rounded corners=2pt,inner sep=2pt,fit=(label2) (decinword1) (decoutword4) (tmp)] (decoder) {};
+            \node[thick,draw,densely dashed,rounded corners=2pt,inner sep=2pt,fit=(label2) (decinword1) (decoutword4) (tmp)] (decoder) {};
        \end{pgfonlayer}
        \node[wnode,anchor=north west] () at (encoder.north west) {编码器};
        \node[wnode,anchor=north east] () at (decoder.north east) {解码器};

--- a/Chapter10/chapter10.tex
+++ b/Chapter10/chapter10.tex
@@ -23,7 +23,7 @@

 \chapter{基于循环神经网络的模型}

-\parinterval {\small\sffamily\bfseries{神经机器翻译}} \index{神经机器翻译}（Neural Machine Translation）\index{Neural Machine Translation}是机器翻译的前沿方法。近几年，随着深度学习技术的发展和在各领域中的深入应用，基于端到端表示学习的方法正在改变着我们处理自然语言的方式，神经机器翻译在这种趋势下应运而生。一方面，神经机器翻译仍然延续着统计建模和基于数据驱动的思想，因此在基本问题的定义上与前人的研究是一致的；另一方面，神经机器翻译脱离了统计机器翻译中对隐含翻译结构的假设，同时使用分布式表示来对文字序列进行建模，这使得它可以从一个全新的视角看待翻译问题。现在，神经机器翻译已经成为了机器翻译研究及应用的热点，译文质量得到了巨大的提升。
+\parinterval {\small\bfnew{神经机器翻译}} \index{神经机器翻译}（Neural Machine Translation）\index{Neural Machine Translation}是机器翻译的前沿方法。近几年，随着深度学习技术的发展和在各领域中的深入应用，基于端到端表示学习的方法正在改变着我们处理自然语言的方式，神经机器翻译在这种趋势下应运而生。一方面，神经机器翻译仍然延续着统计建模和基于数据驱动的思想，因此在基本问题的定义上与前人的研究是一致的；另一方面，神经机器翻译脱离了统计机器翻译中对隐含翻译结构的假设，同时使用分布式表示来对文字序列进行建模，这使得它可以从一个全新的视角看待翻译问题。现在，神经机器翻译已经成为了机器翻译研究及应用的热点，译文质量得到了巨大的提升。

 \parinterval 本章将介绍神经机器翻译中的一种基础模型\ \dash \ 基于循环神经网络的模型。该模型是神经机器翻译中最早被成功应用的模型之一。基于这个模型框架，研究人员进行了大量的探索和改进工作，包括使用LSTM等循环单元结构、引入注意力机制等。这些内容都会在本章进行讨论。

@@ -121,6 +121,8 @@

 \parinterval  在很多量化的评价中也可以看到神经机器翻译的优势。回忆一下{\chapterfour}提到的机器翻译质量的自动评估指标中，使用最广泛的一种指标是BLEU。2010年前，在由美国国家标准和科技机构（NIST）举办的汉英机器翻译评测中（比如汉英MT08数据集），30\%以上的BLEU值对于基于统计方法的翻译系统来说就已经是当时最顶尖的结果了。而现在的神经机器翻译系统，则可以轻松地将BLEU提高至45\%以上。

+\parinterval  同样，在机器翻译领域中著名评测比赛WMT（Workshop of Machine Translation）中，使用统计机器翻译方法的参赛系统也在逐年减少。而现在获得比赛冠军的系统中几乎没有只使用纯统计机器翻译模型的系统\footnote{但是，仍然有大量的统计机器翻译和神经机器翻译融合的方法。比如，在无指导机器翻译中，统计机器翻译仍然被作为初始模型。} 。图\ref{fig:10-3}展示了近年来WMT比赛冠军系统中神经机器翻译系统的占比，可见神经机器翻译系统的占比在逐年提高。
+
 %----------------------------------------------
 \begin{figure}[htp]
 \centering
@@ -130,17 +132,6 @@
 \end{figure}
 %----------------------------------------------

-\parinterval  同样，在机器翻译领域中著名评测比赛WMT（Workshop of Machine Translation）中，使用统计机器翻译方法的参赛系统也在逐年减少。而现在获得比赛冠军的系统中几乎没有只使用纯统计机器翻译模型的系统\footnote{但是，仍然有大量的统计机器翻译和神经机器翻译融合的方法。比如，在无指导机器翻译中，统计机器翻译仍然被作为初始模型。} 。图\ref{fig:10-3}展示了近年来WMT比赛冠军系统中神经机器翻译系统的占比，可见神经机器翻译系统的占比在逐年提高。
-
-%----------------------------------------------
-\begin{figure}[htp]
-\centering
-\input{./Chapter10/Figures/figure-score-of-mter}
-\caption{不同系统在不同长度句子上的mTER[\%]分值（得分越低越好）\upcite{Bentivogli2016NeuralVP}}
-\label{fig:10-4}
-\end{figure}
-%----------------------------------------------
-
 \parinterval  神经机器翻译在其他评价指标上的表现也全面超越统计机器翻译。比如，在IWSLT 2015英语-德语任务中，研究人员搭建了四个较为先进的机器翻译系统\upcite{Bentivogli2016NeuralVP}：

 \begin{itemize}
@@ -156,6 +147,15 @@

 \parinterval  与这些系统相比，神经机器翻译系统的mTER得分在不同长度句子上都有明显的下降，如图\ref{fig:10-4}\footnote{mTER、HTER等都是是错误率度量，值越低表明译文越好。}。其次，神经机器翻译的单词形态错误率和单词词义错误率（用HTER度量）都远低于统计机器翻译系统（表\ref{tab:10-1} ）。

+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter10/Figures/figure-score-of-mter}
+\caption{不同系统在不同长度句子上的mTER[\%]分值（得分越低越好）\upcite{Bentivogli2016NeuralVP}}
+\label{fig:10-4}
+\end{figure}
+%----------------------------------------------
+
 \vspace{0.5em}%全局布局使用
 %----------------------------------------------
 \begin{table}[htp]
@@ -325,9 +325,6 @@ NMT                     & 21.7          & 18.7           & -13.7      \\

 \noindent 这里令<eos>（End of Sequence）表示序列的终止，<sos>（Start of Sequence）表示序列的开始。

-
-\parinterval 翻译过程的神经网络结构如图\ref{fig:10-7}所示，其中左边是编码器，右边是解码器。编码器会顺序处理源语言单词，将每个单词都表示成一个实数向量，也就是每个单词的词嵌入结果（绿色方框）。在词嵌入的基础上运行循环神经网络（蓝色方框）。在编码下一个时间步状态的时候，上一个时间步的隐藏状态会作为历史信息传入循环神经网络。这样，句子中每个位置的信息都被向后传递，最后一个时间步的隐藏状态（红色方框）就包含了整个源语言句子的信息，也就得到了编码器的编码结果$\ \dash\ $源语言句子的分布式表示。
-
 %----------------------------------------------
 \begin{figure}[htp]
 \centering
@@ -337,6 +334,8 @@ NMT                     & 21.7          & 18.7           & -13.7      \\
 \end{figure}
 %----------------------------------------------

+\parinterval 翻译过程的神经网络结构如图\ref{fig:10-7}所示，其中左边是编码器，右边是解码器。编码器会顺序处理源语言单词，将每个单词都表示成一个实数向量，也就是每个单词的词嵌入结果（绿色方框）。在词嵌入的基础上运行循环神经网络（蓝色方框）。在编码下一个时间步状态的时候，上一个时间步的隐藏状态会作为历史信息传入循环神经网络。这样，句子中每个位置的信息都被向后传递，最后一个时间步的隐藏状态（红色方框）就包含了整个源语言句子的信息，也就得到了编码器的编码结果$\ \dash\ $源语言句子的分布式表示。
+
 \parinterval 解码器直接把源语言句子的分布式表示作为输入的隐层状态，之后像编码器一样依次读入目标语言单词，这是一个标准的循环神经网络的执行过程。与编码器不同的是，解码器会有一个输出层，用于根据当前时间步的隐层状态生成目标语言单词及其概率分布。可以看到，解码器当前时刻的输出单词与下一个时刻的输入单词是一样的。从这个角度说，解码器也是一种神经语言模型，只不过它会从另外一种语言（源语言）获得一些信息，而不是仅仅做单语句子的生成。具体来说，当生成第一个单词“I”时，解码器利用了源语言句子表示（红色方框）和目标语言的起始词“<sos>”。在生成第二个单词“am”时，解码器利用了上一个时间步的隐藏状态和已经生成的“I”的信息。这个过程会循环执行，直到生成完整的目标语言句子。

 \parinterval 从这个例子可以看出，神经机器翻译的流程其实并不复杂：首先通过编码器神经网络将源语言句子编码成实数向量，然后解码器神经网络利用这个向量逐词生成译文。现在几乎所有的神经机器翻译系统都采用类似的架构。
@@ -376,6 +375,7 @@ NMT                     & 21.7          & 18.7           & -13.7      \\
 %    NEW SECTION   10.3
 %----------------------------------------------------------------------------------------
 \sectionnewpage
+\vspace{-2em}
 \section{基于循环神经网络的翻译建模}

 \parinterval 早期神经机器翻译的进展主要来自两个方面：1）使用循环神经网络对单词序列进行建模；2）注意力机制的使用。表\ref{tab:10-6}列出了2013-2015年间有代表性的部分研究工作。从这些工作的内容上看，当时的研究重点还是如何有效地使用循环神经网络进行翻译建模以及使用注意力机制捕捉双语单词序列间的对应关系。
@@ -440,6 +440,7 @@ NMT                     & 21.7          & 18.7           & -13.7      \\
 %----------------------------------------------

 \parinterval 从数学模型上看，神经机器翻译模型与统计机器翻译的目标是一样的：在给定源语言句子$\seq{x}$的情况下，找出翻译概率最大的目标语言译文$\hat{\seq{y}}$，其计算如下式:
+\vspace{-1em}
 \begin{eqnarray}
 \hat{\seq{{y}}} &=& \argmax_{\seq{{y}}} \funp{P} (\seq{{y}} | \seq{{x}})
 \label{eq:10-1}
@@ -463,7 +464,7 @@ NMT                     & 21.7          & 18.7           & -13.7      \\
 \vspace{0.5em}
 \item	如何在词嵌入的基础上获取整个序列的表示，即句子的表示学习。可以把词嵌入的序列作为循环神经网络的输入，循环神经网络最后一个时刻的输出向量便是整个句子的表示结果。如图\ref{fig:10-10}中，编码器最后一个循环单元的输出$\mathbi{h}_m$被看作是一种包含了源语言句子信息的表示结果，记为$\mathbi{C}$。
 \vspace{0.5em}
-\item	如何得到每个目标语言单词的概率，即译文单词的{\small\sffamily\bfseries{生成}}\index{生成}（Generation）\index{Generation}。与神经语言模型一样，可以用一个Softmax输出层来获取当前时刻所有单词的分布，即利用Softmax 函数计算目标语言词表中每个单词的概率。令目标语言序列$j$时刻的循环神经网络的输出向量（或状态）为$\mathbi{s}_j$。根据循环神经网络的性质，$ y_j$ 的生成只依赖前一个状态$\mathbi{s}_{j-1}$和当前时刻的输入（即词嵌入$\textrm{e}_y (y_{j-1})$）。同时考虑源语言信息$\mathbi{C}$，$\funp{P}(y_j  | \seq{{y}}_{<j},\seq{{x}})$可以被重新定义为：
+\item	如何得到每个目标语言单词的概率，即译文单词的{\small\bfnew{生成}}\index{生成}（Generation）\index{Generation}。与神经语言模型一样，可以用一个Softmax输出层来获取当前时刻所有单词的分布，即利用Softmax 函数计算目标语言词表中每个单词的概率。令目标语言序列$j$时刻的循环神经网络的输出向量（或状态）为$\mathbi{s}_j$。根据循环神经网络的性质，$ y_j$ 的生成只依赖前一个状态$\mathbi{s}_{j-1}$和当前时刻的输入（即词嵌入$\textrm{e}_y (y_{j-1})$）。同时考虑源语言信息$\mathbi{C}$，$\funp{P}(y_j  | \seq{{y}}_{<j},\seq{{x}})$可以被重新定义为：
 \begin{eqnarray}
 \funp{P} (y_j | \seq{{y}}_{<j},\seq{{x}}) &=& \funp{P} ( {y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}} )
 \label{eq:10-3}
@@ -479,7 +480,7 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm
 \end{eqnarray}
 \vspace{0.5em}
 \end{itemize}
-
+\vspace{-2em}
 %----------------------------------------------
 \begin{figure}[htp]
 \centering
@@ -503,7 +504,7 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
 %----------------------------------------------------------------------------------------
-
+\vspace{-1em}
 \subsection{长短时记忆网络}
 \label{sec:lstm-cell}

@@ -527,7 +528,7 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm

 \begin{itemize}
 \vspace{0.5em}
-\item {\small\sffamily\bfseries{遗忘}}\index{遗忘}。顾名思义，遗忘的目的是忘记一些历史，在LSTM中通过遗忘门实现，其结构如图\ref{fig:10-11}(a)所示。$\mathbi{x}_{t}$表示时刻$t$的输入向量，$\mathbi{h}_{t-1}$是时刻$t-1$的循环单元的输出，$\mathbi{x}_{t}$和$\mathbi{h}_{t-1}$都作为$t$时刻循环单元的输入。$\sigma$将对$\mathbi{x}_{t}$和$\mathbi{h}_{t-1}$进行筛选，以决定遗忘的信息，其计算如下：
+\item {\small\bfnew{遗忘}}\index{遗忘}。顾名思义，遗忘的目的是忘记一些历史，在LSTM中通过遗忘门实现，其结构如图\ref{fig:10-11}(a)所示。$\mathbi{x}_{t}$表示时刻$t$的输入向量，$\mathbi{h}_{t-1}$是时刻$t-1$的循环单元的输出，$\mathbi{x}_{t}$和$\mathbi{h}_{t-1}$都作为$t$时刻循环单元的输入。$\sigma$将对$\mathbi{x}_{t}$和$\mathbi{h}_{t-1}$进行筛选，以决定遗忘的信息，其计算如下：
 \begin{eqnarray}
 \mathbi{f}_t &=& \sigma(\mathbi{W}_f [\mathbi{h}_{t-1},\mathbi{x}_{t}] + \mathbi{b}_f )
 \label{eq:10-6}
@@ -535,7 +536,7 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm

 这里，$\mathbi{W}_f$是权值，$\mathbi{b}_f$是偏置，$[\mathbi{h}_{t-1},\mathbi{x}_{t}]$表示两个向量的拼接。该公式可以解释为，对$[\mathbi{h}_{t-1},\mathbi{x}_{t}]$进行变换，并得到一个实数向量$\mathbi{f}_t$。$\mathbi{f}_t$的每一维都可以被理解为一个“门”，它决定可以有多少信息被留下（或遗忘）。
 \vspace{0.5em}
-\item {\small\sffamily\bfseries{记忆更新}}\index{记忆更新}。首先，要生成当前时刻需要新增加的信息，该部分由输入门完成，其结构如图\ref{fig:10-11}(b)红色线部分，图中“$\bigotimes$”表示进行点乘操作。输入门的计算分为两部分，首先利用$\sigma$决定门控参数$\mathbi{i}_t$，如公式\eqref{eq:10-7}，然后通过Tanh函数得到新的信息$\hat{\mathbi{c}}_t$，如公式\eqref{eq:10-8}：
+\item {\small\bfnew{记忆更新}}\index{记忆更新}。首先，要生成当前时刻需要新增加的信息，该部分由输入门完成，其结构如图\ref{fig:10-11}(b)红色线部分，图中“$\bigotimes$”表示进行点乘操作。输入门的计算分为两部分，首先利用$\sigma$决定门控参数$\mathbi{i}_t$，如公式\eqref{eq:10-7}，然后通过Tanh函数得到新的信息$\hat{\mathbi{c}}_t$，如公式\eqref{eq:10-8}：
 \begin{eqnarray}
 \mathbi{i}_t & = & \sigma (\mathbi{W}_i [\mathbi{h}_{t-1},\mathbi{x}_{t}] + \mathbi{b}_i ) \label{eq:10-7} \\
 \hat{\mathbi{c}}_t & = & \textrm{Tanh} (\mathbi{W}_c [\mathbi{h}_{t-1},\mathbi{x}_{t}] + \mathbi{b}_c ) \label{eq:10-8}
@@ -547,7 +548,7 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm
 \label{eq:10-9}
 \end{eqnarray}
 \vspace{-1.0em}
-\item {\small\sffamily\bfseries{输出}}\index{输出}。该部分使用输出门计算最终的输出信息$\mathbi{h}_t$，其结构如图\ref{fig:10-11}(d)红色线部分所示。在输出门中，首先将$\mathbi{x}_t$和$\mathbi{h}_{t-1}$通过$\sigma$函数变换得到$\mathbi{o}_t$，如公式\eqref{eq:10-10}。其次，将上一步得到的新记忆信息$\mathbi{c}_t$通过Tanh函数进行变换，得到值在[-1，1]范围的向量。最后将这两部分进行点乘，具体如公式\eqref{eq:10-11}：
+\item {\small\bfnew{输出}}\index{输出}。该部分使用输出门计算最终的输出信息$\mathbi{h}_t$，其结构如图\ref{fig:10-11}(d)红色线部分所示。在输出门中，首先将$\mathbi{x}_t$和$\mathbi{h}_{t-1}$通过$\sigma$函数变换得到$\mathbi{o}_t$，如公式\eqref{eq:10-10}。其次，将上一步得到的新记忆信息$\mathbi{c}_t$通过Tanh函数进行变换，得到值在[-1，1]范围的向量。最后将这两部分进行点乘，具体如公式\eqref{eq:10-11}：
 \begin{eqnarray}
 \mathbi{o}_t & = & \sigma (\mathbi{W}_o [\mathbi{h}_{t-1},\mathbi{x}_{t}] + \mathbi{b}_o ) \label{eq:10-10} \\
 \mathbi{h}_t & = & \mathbi{o}_t \cdot \textrm{Tanh} (\mathbi{c}_t) \label{eq:10-11}
@@ -725,7 +726,7 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm
 \label{eq:10-16}
 \end{eqnarray}

-\noindent 其中，$\alpha_{i,j}$是{\small\sffamily\bfseries{注意力权重}}\index{注意力权重}（Attention Weight）\index{Attention Weight}，它表示目标语言第$j$个位置与源语言第$i$个位置之间的相关性大小。这里，将每个时间步编码器的输出$\mathbi{h}_i$ 看作源语言位置$i$的表示结果。进行翻译时，解码器可以根据当前的位置$j$，通过控制不同$\mathbi{h}_i$的权重得到$\mathbi{C}_j$，使得对目标语言位置$j$贡献大的$\mathbi{h}_i$对$\mathbi{C}_j$的影响增大。也就是说，$\mathbi{C}_j$实际上就是\{${\mathbi{h}_1,...,\mathbi{h}_m}$\}的一种组合，只不过不同的$\mathbi{h}_i$会根据对目标端的贡献给予不同的权重。图\ref{fig:10-19}展示了上下文向量$\mathbi{C}_j$的计算过程。
+\noindent 其中，$\alpha_{i,j}$是{\small\bfnew{注意力权重}}\index{注意力权重}（Attention Weight）\index{Attention Weight}，它表示目标语言第$j$个位置与源语言第$i$个位置之间的相关性大小。这里，将每个时间步编码器的输出$\mathbi{h}_i$ 看作源语言位置$i$的表示结果。进行翻译时，解码器可以根据当前的位置$j$，通过控制不同$\mathbi{h}_i$的权重得到$\mathbi{C}_j$，使得对目标语言位置$j$贡献大的$\mathbi{h}_i$对$\mathbi{C}_j$的影响增大。也就是说，$\mathbi{C}_j$实际上就是\{${\mathbi{h}_1,...,\mathbi{h}_m}$\}的一种组合，只不过不同的$\mathbi{h}_i$会根据对目标端的贡献给予不同的权重。图\ref{fig:10-19}展示了上下文向量$\mathbi{C}_j$的计算过程。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -1093,18 +1094,17 @@ L(\widehat{\mathbi{Y}},\mathbi{Y}) &=& \sum_{j=1}^n L_{\textrm{ce}}(\hat{\mathbi
 \item {\small\bfnew{模型并行}}\index{模型并行}。另一种思路是，把较大的模型分成若干小模型，之后在不同设备上训练小模型。对于循环神经网络，不同层的网络天然就是一个相对独立的模型，因此非常适合使用这种方法。比如，对于$l$层的循环神经网络，把每层都看做一个小模型，然后分发到$l$个设备上并行计算。在序列较长的时候，该方法使其运算时间变为原来的${1}/{l}$。图\ref{fig:10-28}以三层循环网络为例展示了对句子“你\ 很\ 不错\ 。”进行模型并行的过程。其中，每一层网络都被放到了一个设备上。当模型根据已经生成的第一个词“你”，并预测下一个词时（图\ref{fig:10-28}(a)），同层的下一个时刻的计算和对“你”的第二层的计算就可以同时开展（图\ref{fig:10-28}(b)）。以此类推，就完成了模型的并行计算。
 \vspace{0.5em}
 \end{itemize}
-
 %-------------------------------------------
-\begin{figure}[htp]
-\centering
-\begin{tabular}{l l}
+%\begin{figure}[htp]
+%\centering
+%\begin{tabular}{l l}
 %\subfigure[]{\input{./Chapter10/Figures/figure-process01}} &\subfigure[]{\input{./Chapter10/Figures/figure-process02}} \\
 %\subfigure[]{\input{./Chapter10/Figures/figure-process03}}  &\subfigure[]{\input{./Chapter10/Figures/figure-process04}} \\
 %\subfigure[]{\input{./Chapter10/Figures/figure-process05}}  &\subfigure[]{\input{./Chapter10/Figures/figure-process06}}\\
-\end{tabular}
+%\end{tabular}
 %\caption{一个三层循环神经网络的模型并行过程}
 %\label{fig:10-28}
-\end{figure}
+%\end{figure}
 %----------------------------------------------
 %-------------------------------------------
 \begin{figure}[htp]
@@ -1169,8 +1169,6 @@ L(\widehat{\mathbi{Y}},\mathbi{Y}) &=& \sum_{j=1}^n L_{\textrm{ce}}(\hat{\mathbi
 \vspace{0.2em}
 \parinterval 解码器的每一步Softmax层会输出所有单词的概率，由于是基于贪心的方法，这里会选择概率最大（top-1）的单词作为输出。这个过程可以参考图\ref{fig:10-30}的内容。选择分布中概率最大的单词“Have”作为得到的第一个单词，并再次送入解码器，作为第二步的输入同时预测下一个单词。以此类推，直到生成句子的终止符为止，就得到了完整的译文。

-\parinterval 贪婪搜索的优点在于速度快。在对翻译速度有较高要求的场景中，贪婪搜索是一种十分有效的系统加速方法。而且贪婪搜索的原理非常简单，易于快速实现。不过，由于每一步只保留一个最好的局部结果，贪婪搜索往往会带来翻译品质上的损失。
-
 %----------------------------------------------
 \begin{figure}[htp]
 \centering
@@ -1180,6 +1178,8 @@ L(\widehat{\mathbi{Y}},\mathbi{Y}) &=& \sum_{j=1}^n L_{\textrm{ce}}(\hat{\mathbi
 \end{figure}
 %----------------------------------------------

+\parinterval 贪婪搜索的优点在于速度快。在对翻译速度有较高要求的场景中，贪婪搜索是一种十分有效的系统加速方法。而且贪婪搜索的原理非常简单，易于快速实现。不过，由于每一步只保留一个最好的局部结果，贪婪搜索往往会带来翻译品质上的损失。
+
 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
 %----------------------------------------------------------------------------------------

--- a/Chapter11/Figures/figure-average-pooling.tex
+++ b/Chapter11/Figures/figure-average-pooling.tex
@@ -2,32 +2,32 @@


 \begin{tikzpicture}[node distance = 0cm]
-\node(num1)[num,fill=red!10]{1};
-\node(num2)[num,below of = num1,yshift= -0.6cm,fill=red!10]{5};
-\node(num3)[num,right of = num1,xshift= 0.6cm,fill=red!10]{0};
-\node(num4)[num,below of = num3,yshift= -0.6cm,fill=red!10]{6};
-
-\node(num5)[num,right of = num3,xshift= 0.6cm,fill=green!10]{4};
-\node(num6)[num,below of = num5,yshift= -0.6cm,fill=green!10]{7};
-\node(num7)[num,right of = num5,xshift= 0.6cm,fill=green!10]{5};
-\node(num8)[num,below of = num7,yshift= -0.6cm,fill=green!10]{8};
-
-\node(num9)[num,below of = num2,yshift= -0.6cm,fill=yellow!10]{3};
-\node(num10)[num,below of = num9,yshift= -0.6cm,fill=yellow!10]{1};
-\node(num11)[num,right of = num9,xshift= 0.6cm,fill=yellow!10]{2};
-\node(num12)[num,below of = num11,yshift= -0.6cm,fill=yellow!10]{2};
-
-\node(num13)[num,right of = num11,xshift= 0.6cm,fill=blue!10]{1};
-\node(num14)[num,below of = num13,yshift= -0.6cm,fill=blue!10]{3};
-\node(num10)[num,right of = num13,xshift= 0.6cm,fill=blue!10]{0};
-\node(num16)[num,below of = num10,yshift= -0.6cm,fill=blue!10]{4};
+\node(num1)[num,fill=red!20]{1};
+\node(num2)[num,below of = num1,yshift= -0.6cm,fill=red!20]{5};
+\node(num3)[num,right of = num1,xshift= 0.6cm,fill=red!20]{0};
+\node(num4)[num,below of = num3,yshift= -0.6cm,fill=red!20]{6};
+
+\node(num5)[num,right of = num3,xshift= 0.6cm,fill=green!20]{4};
+\node(num6)[num,below of = num5,yshift= -0.6cm,fill=green!20]{7};
+\node(num7)[num,right of = num5,xshift= 0.6cm,fill=green!20]{5};
+\node(num8)[num,below of = num7,yshift= -0.6cm,fill=green!20]{8};
+
+\node(num9)[num,below of = num2,yshift= -0.6cm,fill=yellow!20]{3};
+\node(num20)[num,below of = num9,yshift= -0.6cm,fill=yellow!20]{1};
+\node(num11)[num,right of = num9,xshift= 0.6cm,fill=yellow!20]{2};
+\node(num12)[num,below of = num11,yshift= -0.6cm,fill=yellow!20]{2};
+
+\node(num13)[num,right of = num11,xshift= 0.6cm,fill=blue!20]{1};
+\node(num14)[num,below of = num13,yshift= -0.6cm,fill=blue!20]{3};
+\node(num20)[num,right of = num13,xshift= 0.6cm,fill=blue!20]{0};
+\node(num16)[num,below of = num20,yshift= -0.6cm,fill=blue!20]{4};

 \draw[->,thick]([xshift=0.4cm,yshift=-0.4cm]num8.east)--([xshift=1.5cm,yshift=-0.4cm]num8.east);

-\node(num17)[num,right of = num8,xshift= 2.5cm,fill=red!10]{3};
-\node(num18)[num,right of = num17,xshift= 0.6cm,fill=green!10]{6};
-\node(num19)[num,below of = num17,yshift=-0.6cm,fill=yellow!10]{2};
-\node(num20)[num,below of = num18,yshift= -0.6cm,fill=blue!10]{2};
+\node(num17)[num,right of = num8,xshift= 2.5cm,fill=red!20]{3};
+\node(num18)[num,right of = num17,xshift= 0.6cm,fill=green!20]{6};
+\node(num19)[num,below of = num17,yshift=-0.6cm,fill=yellow!20]{2};
+\node(num20)[num,below of = num18,yshift= -0.6cm,fill=blue!20]{2};

 \node [right of = num2,xshift= -0.7cm]{};


--- a/Chapter11/Figures/figure-convolution-kernel.tex
+++ b/Chapter11/Figures/figure-convolution-kernel.tex
 \usetikzlibrary{decorations.pathreplacing}
-\tikzstyle{num} = [rectangle, minimum width = 0.8cm, minimum height = 0.8cm, text centered,align=center,thick,draw = black,fill=green!5]
+\tikzstyle{num} = [rectangle, minimum width = 0.8cm, minimum height = 0.8cm, text centered,align=center,thick,draw = black,fill=green!15]
 \tikzstyle{att} = [rectangle, minimum width = 0.8cm, minimum height = 0.8cm, text centered,align=center]



--- a/Chapter11/Figures/figure-deep-vs-light.tex
+++ b/Chapter11/Figures/figure-deep-vs-light.tex
@@ -30,7 +30,7 @@
 	\node [anchor=north,font=\tiny,scale=1.1] at ([yshift=-0.2em]t2.south) {深度卷积};
 	\end{scope}

-	\begin{scope}[xshift=4cm]
+	\begin{scope}[xshift=5cm]
 	\foreach \point in {0,1,2,3}{
 	\node[unit] (l1_\point) at (0, 0+\point*1.5em){};
 	\node[unit] (l2_\point) at (0, 8em+\point*1.5em){};

--- a/Chapter11/Figures/figure-dimension-transformation.tex
+++ b/Chapter11/Figures/figure-dimension-transformation.tex
-\tikzstyle{num} = [minimum width = 0.6cm,minimum height = 0.6cm,draw,fill=green!10]
+\tikzstyle{num} = [minimum width = 0.6cm,minimum height = 0.6cm,draw,fill=green!20]
 \begin{tikzpicture}[node distance = 0]

 \node[num] at (0,0){1};
@@ -43,7 +43,7 @@
 \node[num] at (2.4,3){2};
 \node[num] at (3,3){1};

-\node[minimum width = 1.8cm,minimum height = 1.8cm,draw=purple!40,line width=0.08cm,fill=purple!40,fill opacity=0.4] at (0.6,2.4) {};
+\node[minimum width = 1.8cm,minimum height = 1.8cm,draw=purple!40,line width=0.08cm,fill=purple!50,fill opacity=0.4] at (0.6,2.4) {};

 %\fill (4,1.5) circle (2pt);
 \node [] at (4,1.5) {*};

--- a/Chapter11/Figures/figure-full-connection-vs-cnn-a.tex
+++ b/Chapter11/Figures/figure-full-connection-vs-cnn-a.tex

 \begin{tikzpicture}[node distance = 0]
 \foreach \x in {0,1,...,6}{
-\draw[color=ublue,fill=blue!15,thick]( \x,0 )circle( 0.3);
+\draw[color=ublue,fill=blue!25,thick]( \x,0 )circle( 0.3);
 }
 \foreach \x in {1,2,...,5}{
 \draw[color=ublue,thick]( \x,2)circle( 0.3);

--- a/Chapter11/Figures/figure-full-connection-vs-cnn-b.tex
+++ b/Chapter11/Figures/figure-full-connection-vs-cnn-b.tex

 \begin{tikzpicture}[node distance = 0]
 \foreach \x in {0,1,...,6}{
-\draw[color=ublue,fill=blue!15,thick]( \x,0 )circle( 0.3);
+\draw[color=ublue,fill=blue!25,thick]( \x,0 )circle( 0.3);
 }
 \foreach \x in {1,2,...,5}{
 \draw[color=ublue,thick]( \x,2)circle( 0.3);

--- a/Chapter11/Figures/figure-max-pooling.tex
+++ b/Chapter11/Figures/figure-max-pooling.tex
@@ -2,32 +2,32 @@


 \begin{tikzpicture}[node distance = 0cm]
-\node(num1)[num,fill=red!10]{1};
-\node(num2)[num,below of = num1,yshift= -0.6cm,fill=red!10]{5};
-\node(num3)[num,right of = num1,xshift= 0.6cm,fill=red!10]{0};
-\node(num4)[num,below of = num3,yshift= -0.6cm,fill=red!10]{6};
-
-\node(num5)[num,right of = num3,xshift= 0.6cm,fill=green!10]{4};
-\node(num6)[num,below of = num5,yshift= -0.6cm,fill=green!10]{7};
-\node(num7)[num,right of = num5,xshift= 0.6cm,fill=green!10]{5};
-\node(num8)[num,below of = num7,yshift= -0.6cm,fill=green!10]{8};
-
-\node(num9)[num,below of = num2,yshift= -0.6cm,fill=yellow!10]{3};
-\node(num10)[num,below of = num9,yshift= -0.6cm,fill=yellow!10]{1};
-\node(num11)[num,right of = num9,xshift= 0.6cm,fill=yellow!10]{2};
-\node(num12)[num,below of = num11,yshift= -0.6cm,fill=yellow!10]{2};
-
-\node(num13)[num,right of = num11,xshift= 0.6cm,fill=blue!10]{1};
-\node(num14)[num,below of = num13,yshift= -0.6cm,fill=blue!10]{3};
-\node(num10)[num,right of = num13,xshift= 0.6cm,fill=blue!10]{0};
-\node(num16)[num,below of = num10,yshift= -0.6cm,fill=blue!10]{4};
+\node(num1)[num,fill=red!20]{1};
+\node(num2)[num,below of = num1,yshift= -0.6cm,fill=red!20]{5};
+\node(num3)[num,right of = num1,xshift= 0.6cm,fill=red!20]{0};
+\node(num4)[num,below of = num3,yshift= -0.6cm,fill=red!20]{6};
+
+\node(num5)[num,right of = num3,xshift= 0.6cm,fill=green!20]{4};
+\node(num6)[num,below of = num5,yshift= -0.6cm,fill=green!20]{7};
+\node(num7)[num,right of = num5,xshift= 0.6cm,fill=green!20]{5};
+\node(num8)[num,below of = num7,yshift= -0.6cm,fill=green!20]{8};
+
+\node(num9)[num,below of = num2,yshift= -0.6cm,fill=yellow!20]{3};
+\node(num10)[num,below of = num9,yshift= -0.6cm,fill=yellow!20]{1};
+\node(num11)[num,right of = num9,xshift= 0.6cm,fill=yellow!20]{2};
+\node(num12)[num,below of = num11,yshift= -0.6cm,fill=yellow!20]{2};
+
+\node(num13)[num,right of = num11,xshift= 0.6cm,fill=blue!20]{1};
+\node(num14)[num,below of = num13,yshift= -0.6cm,fill=blue!20]{3};
+\node(num10)[num,right of = num13,xshift= 0.6cm,fill=blue!20]{0};
+\node(num16)[num,below of = num10,yshift= -0.6cm,fill=blue!20]{4};

 \draw[->,thick]([xshift=0.4cm,yshift=-0.4cm]num8.east)--([xshift=1.5cm,yshift=-0.4cm]num8.east);

-\node(num17)[num,right of = num8,xshift= 2.5cm,fill=red!10]{6};
-\node(num18)[num,right of = num17,xshift= 0.6cm,fill=green!10]{8};
-\node(num19)[num,below of = num17,yshift=-0.6cm,fill=yellow!10]{3};
-\node(num20)[num,below of = num18,yshift= -0.6cm,fill=blue!10]{4};
+\node(num17)[num,right of = num8,xshift= 2.5cm,fill=red!20]{6};
+\node(num18)[num,right of = num17,xshift= 0.6cm,fill=green!20]{8};
+\node(num19)[num,below of = num17,yshift=-0.6cm,fill=yellow!20]{3};
+\node(num20)[num,below of = num18,yshift= -0.6cm,fill=blue!20]{4};

 \node [right of = num20,xshift= 0.7cm]{};


--- a/Chapter11/Figures/figure-padding-and-conv.tex
+++ b/Chapter11/Figures/figure-padding-and-conv.tex
-\tikzstyle{num} = [minimum width = 0.6cm,minimum height = 0.6cm,draw,fill=green!10]
-\tikzstyle{pad} = [minimum width = 0.6cm,minimum height = 0.6cm,draw,fill=blue!10]
+\tikzstyle{num} = [minimum width = 0.6cm,minimum height = 0.6cm,draw,fill=green!20]
+\tikzstyle{pad} = [minimum width = 0.6cm,minimum height = 0.6cm,draw,fill=blue!20]
 \begin{tikzpicture}[node distance = 0]

 \node[pad] at (-0.6,-0.6){0};
@@ -74,8 +74,8 @@
 \node[pad] at (3,3.6){0};
 \node[pad] at (3.6,3.6){0};

-\node[minimum width = 1.8cm,minimum height = 1.8cm,draw=purple!40,line width=0.08cm,fill=purple!40,fill opacity=0.4] at (0,3) {};
-\node[minimum width = 1.8cm,minimum height = 1.8cm,draw=orange!40,line width=0.08cm,fill=orange!40,fill opacity=0.4] at (0.6,2.4) {};
+\node[minimum width = 1.8cm,minimum height = 1.8cm,draw=purple!40,line width=0.08cm,fill=purple!50,fill opacity=0.4] at (0,3) {};
+\node[minimum width = 1.8cm,minimum height = 1.8cm,draw=orange!40,line width=0.08cm,fill=orange!50,fill opacity=0.4] at (0.6,2.4) {};

 %\fill (4.55,1.5) circle (2pt);
 \node [] at (4.55,1.5) {*};

--- a/Chapter11/Figures/figure-single-glu.tex
+++ b/Chapter11/Figures/figure-single-glu.tex
@@ -6,16 +6,16 @@
 	\tikzstyle{word} = [inner sep=0pt, font=\scriptsize,minimum height=\bd]
 	
 	%\draw[fill=blue!8,xshift=0.3cm,yshift=0.5cm,line width=0.2pt] (0cm,0cm) rectangle (0cm+6*\bd,0cm+9*\bd);
-	\draw[fill=cyan!10] (0cm,0cm) -- (4*\bd,0cm) -- (4*\bd,8*\bd) -- (0cm, 8*\bd) -- (0cm,0cm);
+	\draw[fill=cyan!20] (0cm,0cm) -- (4*\bd,0cm) -- (4*\bd,8*\bd) -- (0cm, 8*\bd) -- (0cm,0cm);
 	\draw[cyan,step=\bd,line width=0.8pt] (0cm,0cm) grid (0cm+4*\bd,0cm+8*\bd);
-	\draw[fill=red!5] (0cm,1*\bd) -- (4*\bd,1*\bd) -- (4*\bd,7*\bd) -- (0cm, 7*\bd) -- (0cm,1*\bd);
-	\draw[red!50,step=\bd,xshift=0cm,yshift=\bd, line width=0.8pt] (0cm, 0cm) grid (0cm+4*\bd,0cm+6*\bd);
-	\draw[fill=ugreen!5] (7*\bd,\bd) -- (11*\bd,\bd) -- (11*\bd,7*\bd) -- (7*\bd, 7*\bd) -- (7*\bd,\bd);
-	\draw[ugreen!60,step=\bd,xshift=7*\bd,yshift=\bd, line width=0.8pt] (0cm, 0cm) grid (0cm+4*\bd,0cm+6*\bd);
-	\draw[fill=ugreen!5] (11.5*\bd,\bd) -- (15.5*\bd,\bd) -- (15.5*\bd,7*\bd) -- (11.5*\bd, 7*\bd) -- (11.5*\bd,\bd);
-	\draw[ugreen!60,step=\bd,xshift=11.5*\bd,yshift=\bd, line width=0.8pt] (0cm, 0cm) grid (0cm+4*\bd,0cm+6*\bd);
-	\draw[fill=blue!5] (18.5*\bd,\bd) -- (22.5*\bd,\bd) -- (22.5*\bd,7*\bd) -- (18.5*\bd, 7*\bd) -- (18.5*\bd,\bd);
-	\draw[blue!50,step=\bd,xshift=18.5*\bd,yshift=\bd, line width=0.8pt] (0cm, 0cm) grid (0cm+4*\bd,0cm+6*\bd);
+	\draw[fill=red!10] (0cm,1*\bd) -- (4*\bd,1*\bd) -- (4*\bd,7*\bd) -- (0cm, 7*\bd) -- (0cm,1*\bd);
+	\draw[red!60,step=\bd,xshift=0cm,yshift=\bd, line width=0.8pt] (0cm, 0cm) grid (0cm+4*\bd,0cm+6*\bd);
+	\draw[fill=ugreen!10] (7*\bd,\bd) -- (11*\bd,\bd) -- (11*\bd,7*\bd) -- (7*\bd, 7*\bd) -- (7*\bd,\bd);
+	\draw[ugreen!70,step=\bd,xshift=7*\bd,yshift=\bd, line width=0.8pt] (0cm, 0cm) grid (0cm+4*\bd,0cm+6*\bd);
+	\draw[fill=ugreen!10] (11.5*\bd,\bd) -- (15.5*\bd,\bd) -- (15.5*\bd,7*\bd) -- (11.5*\bd, 7*\bd) -- (11.5*\bd,\bd);
+	\draw[ugreen!70,step=\bd,xshift=11.5*\bd,yshift=\bd, line width=0.8pt] (0cm, 0cm) grid (0cm+4*\bd,0cm+6*\bd);
+	\draw[fill=blue!10] (18.5*\bd,\bd) -- (22.5*\bd,\bd) -- (22.5*\bd,7*\bd) -- (18.5*\bd, 7*\bd) -- (18.5*\bd,\bd);
+	\draw[blue!60,step=\bd,xshift=18.5*\bd,yshift=\bd, line width=0.8pt] (0cm, 0cm) grid (0cm+4*\bd,0cm+6*\bd);
 	
 	
 	\draw[-,line width=0.6pt] (4*\bd, 0.5*\bd) -- (5.5*\bd, 1.5*\bd) -- (4*\bd, 2.5*\bd);
@@ -44,7 +44,7 @@
 	\node[word] (w7) at ([yshift=-\bd]w6) {。};
 	\node[word] (w8) at ([yshift=-\bd]w7) {$<$p$>$};

-	\node[inner sep=2pt,word,fill=green!5,draw=ugreen,dotted,very thick,minimum width=6em,align=center,minimum height=4em] (sub) at ([xshift=11cm,yshift=-0.1cm]w1.east) {
+	\node[inner sep=2pt,word,fill=green!10,draw=ugreen,dotted,very thick,minimum width=6em,align=center,minimum height=4em] (sub) at ([xshift=11cm,yshift=-0.1cm]w1.east) {
 \begin{tabular}{r l}
 $<$p$>$：& 填充 \\
 $\sigma$：& sigmoid函数 \\

--- a/Chapter11/Figures/figure-standard-convolution-neural-network-module.tex
+++ b/Chapter11/Figures/figure-standard-convolution-neural-network-module.tex
-\tikzstyle{input} = [rectangle, minimum width = 1cm, minimum height = 3cm, text centered]
-\tikzstyle{output} = [rectangle, minimum width = 1cm, minimum height = 3cm, text centered]
-\tikzstyle{convolution} = [rectangle, minimum width = 0.7cm, minimum height = 2cm, text centered, fill = red!10, draw = black, thick]
-\tikzstyle{activation} = [rectangle, minimum width = 0.7cm, minimum height = 2cm, text centered, fill = blue!10, draw = black, thick]
-\tikzstyle{pooling} = [rectangle, thick, minimum width = 0.7cm, minimum height = 2cm, text centered, draw = black, fill = ugreen!10]
+\tikzstyle{input} = [rectangle, minimum width = 1.5cm, minimum height = 3cm, text centered]
+\tikzstyle{output} = [rectangle, minimum width = 1.5cm, minimum height = 3cm, text centered]
+\tikzstyle{convolution} = [rectangle, minimum width = 1cm, minimum height = 2cm, text centered, fill = red!20, draw = black, thick]
+\tikzstyle{activation} = [rectangle, minimum width = 1cm, minimum height = 2cm, text centered, fill = blue!20, draw = black, thick]
+\tikzstyle{pooling} = [rectangle, thick, minimum width = 1cm, minimum height = 2cm, text centered, draw = black, fill = ugreen!20]
 \tikzstyle{arrow} = [thick, ->, >=stealth]

 \begin{tikzpicture}[node distance = 0cm]
 \node(input)[input, align=center]{输\\入};
-\node(convolution)[convolution,right of = input,xshift = 2cm, align=center]{卷\\积\\层};
-\node(activation)[activation,right of = convolution,xshift = 2cm, align=center]{激\\活\\函\\数};
-\node(pooling)[pooling,right of = activation,xshift = 2cm, align=center]{池\\化\\层};
-\node(output)[output,right of = pooling,xshift= 2cm, align=center]{输\\出};
+\node(convolution)[convolution,right of = input,xshift = 2.5cm, align=center]{卷\\积\\层};
+\node(activation)[activation,right of = convolution,xshift = 2.5cm, align=center]{激\\活\\函\\数};
+\node(pooling)[pooling,right of = activation,xshift = 2.5cm, align=center]{池\\化\\层};
+\node(output)[output,right of = pooling,xshift= 2.5cm, align=center]{输\\出};

 \draw [arrow] (input) -- ([xshift=-0.15cm]convolution.180);
 \draw [arrow] ([xshift=0.15cm]convolution.0) -- ([xshift=-0.15cm]activation.180);

--- a/Chapter11/Figures/figure-standard.tex
+++ b/Chapter11/Figures/figure-standard.tex
@@ -43,7 +43,7 @@
 	\node [anchor=north,font=\tiny,scale=1.1] at ([yshift=-0.2em]t1.south) {(a) 标准卷积};
 	\end{scope}
 	
-	\begin{scope}[xshift=4cm]
+	\begin{scope}[xshift=5cm]
 	\foreach \point in {0,1,2}{
 	\node[unit] (l1_\point) at (0, 0+\point*1.5em){};
 	\node[unit] (l2_\point) at (0, 6em+\point*1.5em){};
@@ -77,7 +77,7 @@
 	\node [anchor=north,font=\tiny,scale=1.1] at ([yshift=-0.2em]t2.south) {(b) 深度卷积};
 	\end{scope}
 	
-	\begin{scope}[xshift=8cm]
+	\begin{scope}[xshift=10cm]
 	\foreach \point in {0,1,2}{
 	\node[unit] (l1_\point) at (0, 0+\point*1.5em){};
 	\node[unit] (l2_\point) at (0, 6em+\point*1.5em){};

--- a/Chapter11/Figures/figure-structural-comparison-a.tex
+++ b/Chapter11/Figures/figure-structural-comparison-a.tex
@@ -4,13 +4,13 @@

 \begin{tikzpicture}[node distance = 0cm]
 \node(num1)[num]{$\mathbi{e}_1$};
-\node(num2)[num,right of = num1,xshift = 1.2cm]{\textcolor{blue!70}{$\mathbi{e}_2$}};
-\node(num3)[num,right of = num2,xshift = 1.2cm]{\textcolor{blue!70}{$\mathbi{e}_3$}};
-\node(num4)[num,right of = num3,xshift = 1.2cm]{\textcolor{blue!70}{$\mathbi{e}_4$}};
-\node(num5)[num,right of = num4,xshift = 1.2cm]{\textcolor{blue!70}{$\mathbi{e}_5$}};
-\node(num6)[num,right of = num5,xshift = 1.2cm]{\textcolor{blue!70}{$\mathbi{e}_6$}};
-\node(num7)[num,right of = num6,xshift = 1.2cm]{\textcolor{blue!70}{$\mathbi{e}_7$}};
-\node(num8)[num,right of = num7,xshift = 1.2cm]{\textcolor{blue!70}{$\mathbi{e}_8$}};
+\node(num2)[num,right of = num1,xshift = 1.2cm]{\textcolor{blue!85}{$\mathbi{e}_2$}};
+\node(num3)[num,right of = num2,xshift = 1.2cm]{\textcolor{blue!85}{$\mathbi{e}_3$}};
+\node(num4)[num,right of = num3,xshift = 1.2cm]{\textcolor{blue!85}{$\mathbi{e}_4$}};
+\node(num5)[num,right of = num4,xshift = 1.2cm]{\textcolor{blue!85}{$\mathbi{e}_5$}};
+\node(num6)[num,right of = num5,xshift = 1.2cm]{\textcolor{blue!85}{$\mathbi{e}_6$}};
+\node(num7)[num,right of = num6,xshift = 1.2cm]{\textcolor{blue!85}{$\mathbi{e}_7$}};
+\node(num8)[num,right of = num7,xshift = 1.2cm]{\textcolor{blue!85}{$\mathbi{e}_8$}};
 \node(num9)[num,right of = num8,xshift = 1.2cm]{$\mathbi{e}_9$};
 %\node(A)[below of = num2,yshift = -0.6cm]{A};
 %\node(B)[below of = num8,yshift = -0.6cm]{B};
@@ -23,8 +23,8 @@
 \draw [->, thick, color = blue!80](num6.east)--(num7.west);
 \draw [->, thick, color = blue!80](num7.east)--(num8.west);

-\draw [->,thick,color = black!70] (num1) -- (num2);
-\draw [->,thick,color =black!70] (num8) -- (num9);
+\draw [->,thick,color = black!85] (num1) -- (num2);
+\draw [->,thick,color =black!85] (num8) -- (num9);


 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter11/Figures/figure-structural-comparison-b.tex
+++ b/Chapter11/Figures/figure-structural-comparison-b.tex
@@ -4,13 +4,13 @@
 \begin{tikzpicture}[node distance = 0cm]
 \node(num1_0)[num, fill = blue!40]{\textcolor{white}{$\mathbi{0}$}};
 \node(num1_1)[num,right of = num1_0,xshift = 1.2cm]{$\mathbi{e}_1$};
-\node(num1_2)[num,right of = num1_1,xshift = 1.2cm]{\textcolor{blue!70}{$\mathbi{e}_2$}};
-\node(num1_3)[num,right of = num1_2,xshift = 1.2cm]{\textcolor{blue!70}{$\mathbi{e}_3$}};
-\node(num1_4)[num,right of = num1_3,xshift = 1.2cm]{\textcolor{blue!70}{$\mathbi{e}_4$}};
-\node(num1_5)[num,right of = num1_4,xshift = 1.2cm]{\textcolor{blue!70}{$\mathbi{e}_5$}};
-\node(num1_6)[num,right of = num1_5,xshift = 1.2cm]{\textcolor{blue!70}{$\mathbi{e}_6$}};
-\node(num1_7)[num,right of = num1_6,xshift = 1.2cm]{\textcolor{blue!70}{$\mathbi{e}_7$}};
-\node(num1_8)[num,right of = num1_7,xshift = 1.2cm]{\textcolor{blue!70}{$\mathbi{e}_8$}};
+\node(num1_2)[num,right of = num1_1,xshift = 1.2cm]{\textcolor{blue!85}{$\mathbi{e}_2$}};
+\node(num1_3)[num,right of = num1_2,xshift = 1.2cm]{\textcolor{blue!85}{$\mathbi{e}_3$}};
+\node(num1_4)[num,right of = num1_3,xshift = 1.2cm]{\textcolor{blue!85}{$\mathbi{e}_4$}};
+\node(num1_5)[num,right of = num1_4,xshift = 1.2cm]{\textcolor{blue!85}{$\mathbi{e}_5$}};
+\node(num1_6)[num,right of = num1_5,xshift = 1.2cm]{\textcolor{blue!85}{$\mathbi{e}_6$}};
+\node(num1_7)[num,right of = num1_6,xshift = 1.2cm]{\textcolor{blue!85}{$\mathbi{e}_7$}};
+\node(num1_8)[num,right of = num1_7,xshift = 1.2cm]{\textcolor{blue!85}{$\mathbi{e}_8$}};
 \node(num1_9)[num,right of = num1_8,xshift = 1.2cm]{$\mathbi{e}_9$};
 \node(num1_10)[num,right of = num1_9,xshift = 1.2cm, fill = blue!40]{$\mathbi{0}$};
 %\node(A)[below of = num2,yshift = -0.6cm]{A};
@@ -19,11 +19,11 @@
 \node(num2_0)[num,above of = num1_0,yshift = 1.2cm, fill = blue!40]{\textcolor{white}{$\mathbi{0}$}};
 \node(num2_1)[num,right of = num2_0,xshift = 1.2cm]{\textbf2};
 \node(num2_2)[num,right of = num2_1,xshift = 1.2cm]{\textbf2};
-\node(num2_3)[num,right of = num2_2,xshift = 1.2cm]{\textbf{\textcolor{blue!70}2}};
-\node(num2_4)[num,right of = num2_3,xshift = 1.2cm]{\textbf{\textcolor{blue!70}2}};
-\node(num2_5)[num,right of = num2_4,xshift = 1.2cm]{\textbf{\textcolor{blue!70}2}};
-\node(num2_6)[num,right of = num2_5,xshift = 1.2cm]{\textbf{\textcolor{blue!70}2}};
-\node(num2_7)[num,right of = num2_6,xshift = 1.2cm]{\textbf{\textcolor{blue!70}2}};
+\node(num2_3)[num,right of = num2_2,xshift = 1.2cm]{\textbf{\textcolor{blue!85}2}};
+\node(num2_4)[num,right of = num2_3,xshift = 1.2cm]{\textbf{\textcolor{blue!85}2}};
+\node(num2_5)[num,right of = num2_4,xshift = 1.2cm]{\textbf{\textcolor{blue!85}2}};
+\node(num2_6)[num,right of = num2_5,xshift = 1.2cm]{\textbf{\textcolor{blue!85}2}};
+\node(num2_7)[num,right of = num2_6,xshift = 1.2cm]{\textbf{\textcolor{blue!85}2}};
 \node(num2_8)[num,right of = num2_7,xshift = 1.2cm]{\textbf2};
 \node(num2_9)[num,right of = num2_8,xshift = 1.2cm]{\textbf2};
 \node(num2_10)[num,right of = num2_9,xshift = 1.2cm, fill = blue!40]{$\mathbi{0}$};
@@ -32,9 +32,9 @@
 \node(num3_1)[num,right of = num3_0,xshift = 1.2cm]{\textbf3};
 \node(num3_2)[num,right of = num3_1,xshift = 1.2cm]{\textbf3};
 \node(num3_3)[num,right of = num3_2,xshift = 1.2cm]{\textbf3};
-\node(num3_4)[num,right of = num3_3,xshift = 1.2cm]{\textbf{\textcolor{blue!70}3}};
-\node(num3_5)[num,right of = num3_4,xshift = 1.2cm]{\textbf{\textcolor{blue!70}3}};
-\node(num3_6)[num,right of = num3_5,xshift = 1.2cm]{\textbf{\textcolor{blue!70}3}};
+\node(num3_4)[num,right of = num3_3,xshift = 1.2cm]{\textbf{\textcolor{blue!85}3}};
+\node(num3_5)[num,right of = num3_4,xshift = 1.2cm]{\textbf{\textcolor{blue!85}3}};
+\node(num3_6)[num,right of = num3_5,xshift = 1.2cm]{\textbf{\textcolor{blue!85}3}};
 \node(num3_7)[num,right of = num3_6,xshift = 1.2cm]{\textbf3};
 \node(num3_8)[num,right of = num3_7,xshift = 1.2cm]{\textbf3};
 \node(num3_9)[num,right of = num3_8,xshift = 1.2cm]{\textbf3};
@@ -45,7 +45,7 @@
 \node(num4_2)[num,right of = num4_1,xshift = 1.2cm]{\textbf4};
 \node(num4_3)[num,right of = num4_2,xshift = 1.2cm]{\textbf4};
 \node(num4_4)[num,right of = num4_3,xshift = 1.2cm]{\textbf4};
-\node(num4_5)[num,right of = num4_4,xshift = 1.2cm]{\textbf{\textcolor{blue!60}4}};
+\node(num4_5)[num,right of = num4_4,xshift = 1.2cm]{\textbf{\textcolor{blue!80}4}};
 \node(num4_6)[num,right of = num4_5,xshift = 1.2cm]{\textbf4};
 \node(num4_7)[num,right of = num4_6,xshift = 1.2cm]{\textbf4};
 \node(num4_8)[num,right of = num4_7,xshift = 1.2cm]{\textbf4};
@@ -58,19 +58,19 @@
 \draw [->, thick](num1_1.north)--([xshift=-0.1em,yshift=-0.1em]num2_2.south);
 \draw [->, thick](num2_1.north)--([xshift=-0.1em,yshift=-0.1em]num3_2.south);
 \draw [->, thick](num3_1.north)--([xshift=-0.1em,yshift=-0.1em]num4_2.south);
-\draw [->, thick, color = blue!60](num1_2.north)--([xshift=-0.1em,yshift=-0.1em]num2_3.south);
+\draw [->, thick, color = blue!80](num1_2.north)--([xshift=-0.1em,yshift=-0.1em]num2_3.south);
 \draw [->, thick](num2_2.north)--([xshift=-0.1em,yshift=-0.1em]num3_3.south);
 \draw [->, thick](num3_2.north)--([xshift=-0.1em,yshift=-0.1em]num4_3.south);
-\draw [->, thick, color = blue!60](num1_3.north)--([xshift=-0.1em,yshift=-0.1em]num2_4.south);
-\draw [->, thick, color = blue!60](num2_3.north)--([xshift=-0.1em,yshift=-0.1em]num3_4.south);
+\draw [->, thick, color = blue!80](num1_3.north)--([xshift=-0.1em,yshift=-0.1em]num2_4.south);
+\draw [->, thick, color = blue!80](num2_3.north)--([xshift=-0.1em,yshift=-0.1em]num3_4.south);
 \draw [->, thick](num3_3.north)--([xshift=-0.1em,yshift=-0.1em]num4_4.south);
-\draw [->, thick, color = blue!60](num1_4.north)--([xshift=-0.1em,yshift=-0.1em]num2_5.south);
-\draw [->, thick, color = blue!60](num2_4.north)--([xshift=-0.1em,yshift=-0.1em]num3_5.south);
-\draw [->, thick, color = blue!60](num3_4.north)--([xshift=-0.1em,yshift=-0.1em]num4_5.south);
-\draw [->, thick, color = blue!60](num1_5.north)--([xshift=-0.1em,yshift=-0.1em]num2_6.south);
-\draw [->, thick, color = blue!60](num2_5.north)--([xshift=-0.1em,yshift=-0.1em]num3_6.south);
+\draw [->, thick, color = blue!80](num1_4.north)--([xshift=-0.1em,yshift=-0.1em]num2_5.south);
+\draw [->, thick, color = blue!80](num2_4.north)--([xshift=-0.1em,yshift=-0.1em]num3_5.south);
+\draw [->, thick, color = blue!80](num3_4.north)--([xshift=-0.1em,yshift=-0.1em]num4_5.south);
+\draw [->, thick, color = blue!80](num1_5.north)--([xshift=-0.1em,yshift=-0.1em]num2_6.south);
+\draw [->, thick, color = blue!80](num2_5.north)--([xshift=-0.1em,yshift=-0.1em]num3_6.south);
 \draw [->, thick](num3_5.north)--([xshift=-0.1em,yshift=-0.1em]num4_6.south);
-\draw [->, thick, color = blue!60](num1_6.north)--([xshift=-0.1em,yshift=-0.1em]num2_7.south);
+\draw [->, thick, color = blue!80](num1_6.north)--([xshift=-0.1em,yshift=-0.1em]num2_7.south);
 \draw [->, thick](num2_6.north)--([xshift=-0.1em,yshift=-0.1em]num3_7.south);
 \draw [->, thick](num3_6.north)--([xshift=-0.1em,yshift=-0.1em]num4_7.south);
 \draw [->, thick](num1_7.north)--([xshift=-0.1em,yshift=-0.1em]num2_8.south);
@@ -86,19 +86,19 @@
 \draw [->, thick](num1_3.north)--([xshift=0.1em,yshift=-0.1em]num2_2.south);
 \draw [->, thick](num2_3.north)--([xshift=0.1em,yshift=-0.1em]num3_2.south);
 \draw [->, thick](num3_3.north)--([xshift=0.1em,yshift=-0.1em]num4_2.south);
-\draw [->, thick, color = blue!60](num1_4.north)--([xshift=0.1em,yshift=-0.1em]num2_3.south);
+\draw [->, thick, color = blue!80](num1_4.north)--([xshift=0.1em,yshift=-0.1em]num2_3.south);
 \draw [->, thick](num2_4.north)--([xshift=0.1em,yshift=-0.1em]num3_3.south);
 \draw [->, thick](num3_4.north)--([xshift=0.1em,yshift=-0.1em]num4_3.south);
-\draw [->, thick, color = blue!60](num1_5.north)--([xshift=0.1em,yshift=-0.1em]num2_4.south);
-\draw [->, thick, color = blue!60](num2_5.north)--([xshift=0.1em,yshift=-0.1em]num3_4.south);
+\draw [->, thick, color = blue!80](num1_5.north)--([xshift=0.1em,yshift=-0.1em]num2_4.south);
+\draw [->, thick, color = blue!80](num2_5.north)--([xshift=0.1em,yshift=-0.1em]num3_4.south);
 \draw [->, thick](num3_5.north)--([xshift=0.1em,yshift=-0.1em]num4_4.south);
-\draw [->, thick, color = blue!60](num1_6.north)--([xshift=0.1em,yshift=-0.1em]num2_5.south);
-\draw [->, thick, color = blue!60](num2_6.north)--([xshift=0.1em,yshift=-0.1em]num3_5.south);
-\draw [->, thick, color = blue!60](num3_6.north)--([xshift=0.1em,yshift=-0.1em]num4_5.south);
-\draw [->, thick, color = blue!60](num1_7.north)--([xshift=0.1em,yshift=-0.1em]num2_6.south);
-\draw [->, thick, color = blue!60](num2_7.north)--([xshift=0.1em,yshift=-0.1em]num3_6.south);
+\draw [->, thick, color = blue!80](num1_6.north)--([xshift=0.1em,yshift=-0.1em]num2_5.south);
+\draw [->, thick, color = blue!80](num2_6.north)--([xshift=0.1em,yshift=-0.1em]num3_5.south);
+\draw [->, thick, color = blue!80](num3_6.north)--([xshift=0.1em,yshift=-0.1em]num4_5.south);
+\draw [->, thick, color = blue!80](num1_7.north)--([xshift=0.1em,yshift=-0.1em]num2_6.south);
+\draw [->, thick, color = blue!80](num2_7.north)--([xshift=0.1em,yshift=-0.1em]num3_6.south);
 \draw [->, thick](num3_7.north)--([xshift=0.1em,yshift=-0.1em]num4_6.south);
-\draw [->, thick, color = blue!60](num1_8.north)--([xshift=0.1em,yshift=-0.1em]num2_7.south);
+\draw [->, thick, color = blue!80](num1_8.north)--([xshift=0.1em,yshift=-0.1em]num2_7.south);
 \draw [->, thick](num2_8.north)--([xshift=0.1em,yshift=-0.1em]num3_7.south);
 \draw [->, thick](num3_8.north)--([xshift=0.1em,yshift=-0.1em]num4_7.south);
 \draw [->, thick](num1_9.north)--([xshift=0.1em,yshift=-0.1em]num2_8.south);

--- a/Chapter11/Figures/figure-use-cnn-in-nmt.tex
+++ b/Chapter11/Figures/figure-use-cnn-in-nmt.tex
@@ -5,41 +5,41 @@
 \begin{scope}
 	%\tikzstyle{every node}=[scale=0.8]
 	\tikzstyle{line} = [dash pattern=on 2pt off 1pt,line width=0.5pt]
-	\tikzstyle{cir} = [thin,fill=blue!8,draw,circle,minimum size =0.5em,drop shadow={shadow xshift=0.15em, shadow yshift=-0.1em}]
+	\tikzstyle{cir} = [thin,fill=blue!15,draw,circle,minimum size =0.5em,drop shadow={shadow xshift=0.15em, shadow yshift=-0.1em}]
 	\tikzstyle{word} = [inner sep=0pt, font=\scriptsize,minimum height=\bcc]
 	
-	\draw[fill=red!8,line width=0.2pt] (0cm,0cm+1*\bcc) rectangle (0cm+4*\bcc,0cm+7*\bcc);
-	\draw[fill=cyan!14,line width=0.2pt] (0cm,0cm) rectangle (0cm+4*\bcc,0cm+1*\bcc);
-	\draw[fill=cyan!14,line width=0.2pt] (0cm,0cm+7*\bcc) rectangle (0cm+4*\bcc,0cm+8*\bcc);
+	\draw[fill=red!15,line width=0.2pt] (0cm,0cm+1*\bcc) rectangle (0cm+4*\bcc,0cm+7*\bcc);
+	\draw[fill=cyan!20,line width=0.2pt] (0cm,0cm) rectangle (0cm+4*\bcc,0cm+1*\bcc);
+	\draw[fill=cyan!20,line width=0.2pt] (0cm,0cm+7*\bcc) rectangle (0cm+4*\bcc,0cm+8*\bcc);
 	\draw[step=\bcc] (0cm,0cm) grid (0cm+4*\bcc,0cm+8*\bcc); 
 	%\draw[line width=0.7pt] (0cm,0cm) rectangle (0cm+4*\bcc,0cm+8*\bcc);
-	\draw[red!50,line width=1.8pt] (0cm,0cm+5*\bcc) rectangle (0cm+4*\bcc,0cm+8*\bcc);
-	\draw[ugreen!50,line width=1.8pt] (0cm,0cm+1*\bcc) rectangle (0cm+4*\bcc,0cm+4*\bcc);
+	\draw[red!70,line width=1.8pt] (0cm,0cm+5*\bcc) rectangle (0cm+4*\bcc,0cm+8*\bcc);
+	\draw[ugreen!70,line width=1.8pt] (0cm,0cm+1*\bcc) rectangle (0cm+4*\bcc,0cm+4*\bcc);

 	
-	\draw[fill=blue!8,xshift=5.0cm,yshift=1.0cm,line width=0.2pt] (0cm,0cm) rectangle (0cm+1*\bcc,0cm+6*\bcc);
+	\draw[fill=blue!15,xshift=5.0cm,yshift=1.0cm,line width=0.2pt] (0cm,0cm) rectangle (0cm+1*\bcc,0cm+6*\bcc);
 	\draw[step=\bcc,xshift=5.0cm,yshift=1.0cm] (0cm,0cm) grid (0cm+1*\bcc,0cm+6*\bcc);
 	%\draw[line width=0.7pt,xshift=5.0cm,yshift=1.0cm] (0cm,0cm) rectangle (0cm+1*\bcc,0cm+6*\bcc);
 	
-	\draw[fill=blue!8,xshift=5.2cm,yshift=0.8cm,line width=0.2pt] (0cm,0cm) rectangle (0cm+1*\bcc,0cm+6*\bcc);
+	\draw[fill=blue!15,xshift=5.2cm,yshift=0.8cm,line width=0.2pt] (0cm,0cm) rectangle (0cm+1*\bcc,0cm+6*\bcc);
 	\draw[step=\bcc,xshift=5.2cm,yshift=0.8cm] (0cm,0cm) grid (0cm+1*\bcc,0cm+6*\bcc); 
 	%\draw[line width=0.7pt,xshift=5.2cm,yshift=0.8cm] (0cm,0cm) rectangle (0cm+1*\bcc,0cm+6*\bcc);
-	\draw[ugreen!50,line width=2pt,xshift=5.2cm,yshift=0.8cm] (0cm,0cm+1*\bcc) rectangle (0cm+1*\bcc,0cm+2*\bcc);
+	\draw[ugreen!70,line width=2pt,xshift=5.2cm,yshift=0.8cm] (0cm,0cm+1*\bcc) rectangle (0cm+1*\bcc,0cm+2*\bcc);
 	
-	\draw[fill=blue!8,xshift=5.4cm,yshift=0.6cm,line width=0.2pt] (0cm,0cm) rectangle (0cm+1*\bcc,0cm+6*\bcc);
+	\draw[fill=blue!15,xshift=5.4cm,yshift=0.6cm,line width=0.2pt] (0cm,0cm) rectangle (0cm+1*\bcc,0cm+6*\bcc);
 	\draw[step=\bcc,xshift=5.4cm,yshift=0.6cm] (0cm,0cm) grid (0cm+1*\bcc,0cm+6*\bcc);
 	%\draw[line width=0.7pt,xshift=5.4cm,yshift=0.6cm] (0cm,0cm) rectangle (0cm+1*\bcc,0cm+6*\bcc); 
 	
-	\draw[fill=blue!8,xshift=5.6cm,yshift=0.4cm,line width=0.2pt] (0cm,0cm) rectangle (0cm+1*\bcc,0cm+6*\bcc);
+	\draw[fill=blue!15,xshift=5.6cm,yshift=0.4cm,line width=0.2pt] (0cm,0cm) rectangle (0cm+1*\bcc,0cm+6*\bcc);
 	\draw[step=\bcc,xshift=5.6cm,yshift=0.4cm] (0cm,0cm) grid (0cm+1*\bcc,0cm+6*\bcc); 
 	%\draw[line width=0.7pt,xshift=5.6cm,yshift=0.4cm] (0cm,0cm) rectangle (0cm+1*\bcc,0cm+6*\bcc);
-	\draw[red!50,line width=2pt,xshift=5.6cm,yshift=0.4cm] (0cm,0cm+5*\bcc) rectangle (0cm+1*\bcc,0cm+6*\bcc);
+	\draw[red!70,line width=2pt,xshift=5.6cm,yshift=0.4cm] (0cm,0cm+5*\bcc) rectangle (0cm+1*\bcc,0cm+6*\bcc);

-	\draw[red!50,line width=0.5pt] (0cm+4*\bcc,0cm+8*\bcc) -- ([xshift=5.6cm,yshift=0.4cm]0cm,0cm+6*\bcc);
-	\draw[red!50,line width=0.5pt] (0cm+4*\bcc,0cm+5*\bcc) -- ([xshift=5.6cm,yshift=0.4cm]0cm,0cm+5*\bcc);
+	\draw[red!70,line width=0.5pt] (0cm+4*\bcc,0cm+8*\bcc) -- ([xshift=5.6cm,yshift=0.4cm]0cm,0cm+6*\bcc);
+	\draw[red!70,line width=0.5pt] (0cm+4*\bcc,0cm+5*\bcc) -- ([xshift=5.6cm,yshift=0.4cm]0cm,0cm+5*\bcc);

-	\draw[ugreen!50,line width=0.5pt] (0cm+4*\bcc,0cm+4*\bcc) -- ([xshift=5.2cm,yshift=0.8cm]0cm,0cm+2*\bcc);
-	\draw[ugreen!50,line width=0.5pt] (0cm+4*\bcc,0cm+1*\bcc) -- ([xshift=5.2cm,yshift=0.8cm]0cm,0cm+1*\bcc);
+	\draw[ugreen!70,line width=0.5pt] (0cm+4*\bcc,0cm+4*\bcc) -- ([xshift=5.2cm,yshift=0.8cm]0cm,0cm+2*\bcc);
+	\draw[ugreen!70,line width=0.5pt] (0cm+4*\bcc,0cm+1*\bcc) -- ([xshift=5.2cm,yshift=0.8cm]0cm,0cm+1*\bcc);

 	\node[word] (w1) at (-0.5cm, 3.0cm) {$<$p$>$};
 	\node[word] (w2) at ([yshift=-\bcc]w1) {今天};

--- a/Chapter11/Figures/figure-use-cnn-in-sentence-classification.tex
+++ b/Chapter11/Figures/figure-use-cnn-in-sentence-classification.tex
@@ -5,7 +5,7 @@
 \begin{scope}
 	%\tikzstyle{every node}=[scale=0.8]
 	\tikzstyle{line} = [dash pattern=on 2pt off 1pt,line width=0.6pt]
-	\tikzstyle{cir} = [thin,fill=blue!8,draw,circle,minimum size =0.5em,drop shadow={shadow xshift=0.15em, shadow yshift=-0.1em}]
+	\tikzstyle{cir} = [thin,fill=blue!15,draw,circle,minimum size =0.5em,drop shadow={shadow xshift=0.15em, shadow yshift=-0.1em}]
 	\tikzstyle{word} = [inner sep=0pt, font=\footnotesize,minimum height=\bcc]
 	
 	%\draw[fill=blue!8,xshift=0.3cm,yshift=0.5cm,line width=0.6pt] (0cm,0cm) rectangle (0cm+6*\bcc,0cm+9*\bcc);
@@ -13,30 +13,30 @@
 	%\draw[red!60,line width=2pt,xshift=0.3cm,yshift=0.5cm] (0cm,0cm+2*\bcc) rectangle (0cm+6*\bcc,0cm+4*\bcc);
 	
 	% 输入矩阵
-	\draw[thick,fill=blue!8,line width=0.6pt] (0cm,0cm) rectangle (0cm+6*\bcc,0cm+9*\bcc);
+	\draw[thick,fill=blue!15,line width=0.6pt] (0cm,0cm) rectangle (0cm+6*\bcc,0cm+9*\bcc);
 	\draw[step=\bcc,gray] (0cm,0cm) grid (0cm+6*\bcc,0cm+9*\bcc); 
-	\draw[red!60,line width=2pt] (0cm,0cm) rectangle (0cm+6*\bcc,0cm+2*\bcc);
-	\draw[ugreen!60,line width=2pt] (0cm,0cm+3*\bcc) rectangle (0cm+6*\bcc,0cm+6*\bcc);
-	\draw[red!60,line width=2pt] (0cm,0cm+7*\bcc) rectangle (0cm+6*\bcc,0cm+9*\bcc);
+	\draw[red!80,line width=2pt] (0cm,0cm) rectangle (0cm+6*\bcc,0cm+2*\bcc);
+	\draw[ugreen!80,line width=2pt] (0cm,0cm+3*\bcc) rectangle (0cm+6*\bcc,0cm+6*\bcc);
+	\draw[red!80,line width=2pt] (0cm,0cm+7*\bcc) rectangle (0cm+6*\bcc,0cm+9*\bcc);

 	% 特征图
-	\draw[fill=blue!8,xshift=5.0cm,yshift=1.3cm,line width=0.6pt] (0cm,0cm-1*\bcc) rectangle (0cm+1*\bcc,0cm+6*\bcc);
+	\draw[fill=blue!15,xshift=5.0cm,yshift=1.3cm,line width=0.6pt] (0cm,0cm-1*\bcc) rectangle (0cm+1*\bcc,0cm+6*\bcc);
 	\draw[step=\bcc,gray,xshift=5.0cm,yshift=1.3cm] (0cm,0cm-1*\bcc) grid (0cm+1*\bcc,0cm+6*\bcc);
-	\draw[ugreen!60,line width=2pt,xshift=5.0cm,yshift=1.3cm] (0cm,0cm+2*\bcc) rectangle (0cm+1*\bcc,0cm+3*\bcc);
+	\draw[ugreen!80,line width=2pt,xshift=5.0cm,yshift=1.3cm] (0cm,0cm+2*\bcc) rectangle (0cm+1*\bcc,0cm+3*\bcc);
 	
 	%最大池化
-	\draw [gray,fill=blue!8,line width=0.6pt](8cm,2.2cm) -- (8.4cm, 2.2cm) -- (8.7cm,1.4cm) -- (8.3cm, 1.4cm) -- (8cm,2.2cm);
+	\draw [gray,fill=blue!15,line width=0.6pt](8cm,2.2cm) -- (8.4cm, 2.2cm) -- (8.7cm,1.4cm) -- (8.3cm, 1.4cm) -- (8cm,2.2cm);
 	\draw [gray](8.15cm,1.8cm) -- (8.55cm,1.8cm);
 	%\draw [gray](8.3cm,1.8cm) -- (8.7cm,1.8cm);
 	%\draw [gray](8.45cm,1.4cm) -- (8.85cm,1.4cm);
 	
 	%全连接层
-	\draw [gray,fill=blue!8,line width=0.6pt](11cm,2.2cm) -- (11.4cm, 2.2cm) -- (11.7cm,1.8cm) -- (11.3cm, 1.8cm) -- (11cm,2.2cm);
+	\draw [gray,fill=blue!15,line width=0.6pt](11cm,2.2cm) -- (11.4cm, 2.2cm) -- (11.7cm,1.8cm) -- (11.3cm, 1.8cm) -- (11cm,2.2cm);
 	%\draw [gray](11.15cm,1.8cm) -- (11.55cm,1.8cm);
 	
 	%最大池化
-	\draw[ugreen!60,line] ([xshift=5.0cm,yshift=1.3cm]0cm+1*\bcc,0cm+6*\bcc) -- (8cm,2.2cm);
-	\draw[ugreen!60,line] ([xshift=5.0cm,yshift=1.3cm]0cm+1*\bcc,0cm-1*\bcc) -- (8.15cm,1.8cm);
+	\draw[ugreen!80,line] ([xshift=5.0cm,yshift=1.3cm]0cm+1*\bcc,0cm+6*\bcc) -- (8cm,2.2cm);
+	\draw[ugreen!80,line] ([xshift=5.0cm,yshift=1.3cm]0cm+1*\bcc,0cm-1*\bcc) -- (8.15cm,1.8cm);

 	%特征图
 	%\draw[fill=blue!8,xshift=5.2cm,yshift=1.0cm,line width=0.6pt] (0cm,0cm) rectangle (0cm+1*\bcc,0cm+6*\bcc);
@@ -45,25 +45,25 @@
 	%\draw[fill=blue!8,xshift=5.4cm,yshift=0.3cm,line width=0.6pt] (0cm,0cm) rectangle (0cm+1*\bcc,0cm+7*\bcc);
 	%\draw[step=\bcc,gray,xshift=5.4cm,yshift=0.3cm] (0cm,0cm) grid (0cm+1*\bcc,0cm+7*\bcc);
 	
-	\draw[fill=blue!8,xshift=5.6cm,yshift=0cm,line width=0.6pt] (0cm,0cm) rectangle (0cm+1*\bcc,0cm+8*\bcc);
+	\draw[fill=blue!15,xshift=5.6cm,yshift=0cm,line width=0.6pt] (0cm,0cm) rectangle (0cm+1*\bcc,0cm+8*\bcc);
 	\draw[step=\bcc,gray,xshift=5.6cm,yshift=0cm] (0cm,0cm) grid (0cm+1*\bcc,0cm+8*\bcc); 
-	\draw[red!60,line width=2pt,xshift=5.6cm,yshift=0cm] (0cm,0cm) rectangle (0cm+1*\bcc,0cm+1*\bcc);
-	\draw[red!60,line width=2pt,xshift=5.6cm,yshift=0cm] (0cm,0cm+7*\bcc) rectangle (0cm+1*\bcc,0cm+8*\bcc);
+	\draw[red!80,line width=2pt,xshift=5.6cm,yshift=0cm] (0cm,0cm) rectangle (0cm+1*\bcc,0cm+1*\bcc);
+	\draw[red!80,line width=2pt,xshift=5.6cm,yshift=0cm] (0cm,0cm+7*\bcc) rectangle (0cm+1*\bcc,0cm+8*\bcc);
 	
 	% 全连接线
 	\draw[line] (8.4cm, 2.2cm) -- (11.2cm,2.2cm);
 	\draw[line] (8.7cm,1.4cm) -- (11.3cm, 1.8cm);
 	%全连接上面的红虚线
-	\draw[red!60,line] ([xshift=5.6cm,yshift=0cm]0cm+1*\bcc,0cm+7*\bcc) -- (8.15cm,1.8cm);
-	\draw[red!60,line] ([xshift=5.6cm,yshift=0cm]0cm+1*\bcc,0cm) -- (8.3cm, 1.4cm);
+	\draw[red!80,line] ([xshift=5.6cm,yshift=0cm]0cm+1*\bcc,0cm+7*\bcc) -- (8.15cm,1.8cm);
+	\draw[red!80,line] ([xshift=5.6cm,yshift=0cm]0cm+1*\bcc,0cm) -- (8.3cm, 1.4cm);
 	
 	% 特征图红色虚线
-	\draw[red!60,line] (0cm+6*\bcc,0cm+9*\bcc) -- ([xshift=5.6cm,yshift=0cm]0cm,0cm+8*\bcc);
-	\draw[red!60,line] (0cm+6*\bcc,0cm+7*\bcc) -- ([xshift=5.6cm,yshift=0cm]0cm,0cm+7*\bcc);
-	\draw[red!60,line] (0cm+6*\bcc,0cm+2*\bcc) -- ([xshift=5.6cm,yshift=0cm]0cm,0cm+1*\bcc);
-	\draw[red!60,line] (0cm+6*\bcc,0cm) -- ([xshift=5.6cm,yshift=0cm]0cm,0cm);
-	\draw[ugreen!60,line] (0cm+6*\bcc,0cm+6*\bcc) -- ([xshift=5.0cm,yshift=1.3cm]0cm,0cm+3*\bcc);
-	\draw[ugreen!60,line] (0cm+6*\bcc,0cm+3*\bcc) -- ([xshift=5.0cm,yshift=1.3cm]0cm,0cm+2*\bcc);
+	\draw[red!80,line] (0cm+6*\bcc,0cm+9*\bcc) -- ([xshift=5.6cm,yshift=0cm]0cm,0cm+8*\bcc);
+	\draw[red!80,line] (0cm+6*\bcc,0cm+7*\bcc) -- ([xshift=5.6cm,yshift=0cm]0cm,0cm+7*\bcc);
+	\draw[red!80,line] (0cm+6*\bcc,0cm+2*\bcc) -- ([xshift=5.6cm,yshift=0cm]0cm,0cm+1*\bcc);
+	\draw[red!80,line] (0cm+6*\bcc,0cm) -- ([xshift=5.6cm,yshift=0cm]0cm,0cm);
+	\draw[ugreen!80,line] (0cm+6*\bcc,0cm+6*\bcc) -- ([xshift=5.0cm,yshift=1.3cm]0cm,0cm+3*\bcc);
+	\draw[ugreen!80,line] (0cm+6*\bcc,0cm+3*\bcc) -- ([xshift=5.0cm,yshift=1.3cm]0cm,0cm+2*\bcc);
 	%\draw[red!60,line] ([xshift=0.3cm,yshift=0.5cm]0cm+6*\bcc,0cm+4*\bcc) -- ([xshift=5.6cm,yshift=0cm]0cm,0cm+3*\bcc);
 	%\draw[red!60,line] ([xshift=0.3cm,yshift=0.5cm]0cm+6*\bcc,0cm+2*\bcc) -- ([xshift=5.6cm,yshift=0cm]0cm,0cm+2*\bcc);
 	

--- a/Chapter11/chapter11.tex
+++ b/Chapter11/chapter11.tex
@@ -131,7 +131,7 @@
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
 %----------------------------------------------------------------------------------------
-
+\vspace{-2em}
 \subsection{步长与填充}

 \parinterval 在卷积操作中，步长是指卷积核每次滑动的距离，和卷积核的大小共同决定了卷积输出的大小，如图\ref{fig:11-6}所示。步长越大，对输入数据的压缩程度越高，其输出的维度越小；反之步长越小，对输入数据的压缩程度越低，同时输出的尺寸和输入越接近。比如使用一个$3 \times 3 \times 1$的卷积核在$6 \times 6 \times 1$的图像上进行卷积，如设置步长为1，其对应的输出大小就为$4 \times 4 \times 1$。这种做法最为简单，但是会导致两个问题；一是在输入数据中，由于边缘区域的像素只会被计算一次，相比于中心区域来说，这些像素被考虑的次数会更少一些，导致图像边缘信息的丢失；二是在经历多次卷积之后，其输出特征的维度会不断减小，影响模型的泛化能力。
@@ -161,7 +161,7 @@
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
 %----------------------------------------------------------------------------------------
-
+\vspace{-2em}
 \subsection{池化}

 \parinterval 在图\ref{fig:11-2}所示的网络结构中，卷积层输出会通过一个非线性的激活函数，之后会通过{\small\bfnew{池化层}}\index{池化层}（也称为汇聚层）。池化过程和卷积类似，都是根据设定的窗口进行滑动选取局部信息进行计算，不同的是，池化层的计算是无参数化的，不需要额外的权重矩阵。常见的池化操作有{\small\bfnew{最大池化}}\index{最大池化}（Max Pooling）\index{Max Pooling}和{\small\bfnew{平均池化}}\index{平均池化}（Average Pooling）\index{Average Pooling}。前者获取窗口内最大的值，后者则获取窗口内矩阵的平均值。图\ref{fig:11-8}展示了窗口大小为$2 \times 2$、步长为2的两种池化方法的计算过程。
@@ -200,7 +200,7 @@
 \label{fig:11-9}
 \end{figure}
 %----------------------------------------------
-
+\vspace{-1em}
 \parinterval 针对不定长序列，一种可行的方法是使用之前介绍过的循环神经网络进行信息提取，其本质也是基于权重共享的想法，在不同的时间步复用相同的循环神经网络单元进行处理。但是，循环神经网络最大的弊端在于每一时刻的计算都依赖于上一时刻的结果，因此只能对序列进行串行处理，无法充分利用硬件设备进行并行计算，导致效率相对较低。此外，在处理较长的序列时，这种串行的方式很难捕捉长距离的依赖关系。相比之下，卷积神经网络采用共享参数的方式处理固定大小窗口内的信息，且不同位置的卷积操作之间没有相互依赖，因此可以对序列进行高效地并行处理。同时，针对序列中距离较长的依赖关系，可以通过堆叠多层卷积层来扩大{\small\bfnew{感受野}}\index{感受野} (Receptive Field)\index{Receptive Field}  ，这里感受野指能够影响神经元输出的原始输入数据区域的大小。图\ref{fig:11-9}对比了这两种结构，可以看出，为了捕捉$\mathbi{e}_2$ 和$\mathbi{e}_8$ 之间的联系，串行结构需要顺序地进行6次操作，和序列长度相关。而该卷积神经网络中，卷积操作每次对三个词进行计算，仅需要4层卷积计算就能得到$\mathbi{e}_2$ 和$\mathbi{e}_8$之间的联系，其操作数和卷积核的大小相关，相比于串行的方式具有更短的路径和更少的非线性计算，更容易进行训练。因此，也有许多研究人员在许多自然语言处理任务上尝试使用卷积神经网络进行序列建模\upcite{Kim2014ConvolutionalNN,Santos2014DeepCN,Kalchbrenner2014ACN,DBLP:conf/naacl/Johnson015,DBLP:conf/naacl/NguyenG15}。

 \parinterval 区别于传统图像上的卷积操作，在面向序列的卷积操作中，卷积核只在序列这一维度进行移动，用来捕捉连续的多个词之间的特征。需要注意的是，由于单词通常由一个实数向量表示（词嵌入），因此可以将词嵌入的维度看作是卷积操作中的通道数。图\ref{fig:11-10}就是一个基于序列卷积的文本分类模型，模型的输入是维度大小为$m\times O $的句子表示，$m$表示句子长度，$O$表示卷积核通道数，其值等于词嵌入维度，模型使用多个不同（对应图中不同的颜色）的卷积核来对序列进行特征提取，得到了多个不同的特征序列。然后使用池化层降低表示维度，得到了一组和序列长度无关的特征表示。最后模型基于这组压缩过的特征表示，使用全连接网络和Softmax函数进行类别预测。在这过程中卷积层和池化层分别起到了特征提取和特征压缩的作用，将一个不定长的序列转化为一组固定大小的特征表示。
@@ -214,7 +214,7 @@
 \label{fig:11-10}
 \end{figure}
 %----------------------------------------------
-
+\vspace{-1em}
 \parinterval 和其它自然语言处理任务不同的是，机器翻译中需要对序列进行全局表示，换句话说，模型需要捕捉序列中各个位置之间的关系。因此，基于卷积神经网络的神经机器翻译模型需要堆叠多个卷积层进行远距离的依赖关系的建模。同时，为了在多层网络中维持序列的原有长度，需要在卷积操作前对输入序列进行填充。图\ref{fig:11-11}是一个简单的示例，针对一个长度$m=6$的句子，其隐层表示维度即卷积操作的输入通道数是$O=4$，卷积核大小为$K=3$。首先对序列进行填充，得到一个长度为8的序列，然后使用这些卷积核在这之上进行特征提取。一共使用了$N=4$个卷积核，整体的参数量为$K \times O \times N$，最后的卷积结果为$m \times N$的序列表示。

 %----------------------------------------------
@@ -235,6 +235,16 @@

 \parinterval 正如之前所讲，卷积神经网络可以用于序列建模，同时具有并行性高和易于学习的特点，一个很自然的想法就是将其用作神经机器翻译模型中的特征提取器。因此，在神经机器翻译被提出之初，研究人员就已经开始利用卷积神经网络对句子进行特征提取。比较经典的模型是使用卷积神经网络作为源语言句子的编码器，使用循环神经网络作为目标语言译文生成的解码器\upcite{kalchbrenner-blunsom-2013-recurrent,Gehring2017ACE}。之后也有研究人员提出完全基于卷积神经网络的翻译模型（ConvS2S）\upcite{DBLP:journals/corr/GehringAGYD17}，或者针对卷积层进行改进，提出效率更高、性能更好的模型\upcite{Kaiser2018DepthwiseSC,Wu2019PayLA}。本节将基于ConvS2S模型，阐述如何使用卷积神经网络搭建端到端神经机器翻译模型。

+%----------------------------------------------
+% 图12.
+\begin{figure}[htp]
+\centering
+\input{./Chapter11/Figures/figure-fairseq-0}
+\caption{ConvS2S模型结构}
+\label{fig:11-12}
+\end{figure}
+%----------------------------------------------
+
 \parinterval ConvS2S模型是一种高并行的序列到序列的神经计算模型。该模型利用卷积神经网络分别对源语言端与目标语言端的序列进行特征提取，并使用注意力机制来捕获两个序列之间映射关系。相比于基于多层循环神经网络的GNMT模型\upcite{Wu2016GooglesNM}，其主要优势在于每一层的网络计算是完全并行化的，避免了循环神经网络中计算顺序对时序的依赖。同时，利用多层卷积神经网络的层级结构可以有效地捕捉序列不同位置之间的依赖。即使是远距离依赖，也可以通过若干层卷积单元进行有效的捕捉，而且其信息传递的路径相比循环神经网络更短。除此之外，模型同时使用门控线性单元、残差网络和位置编码等技术来进一步提升模型性能，达到了和GNMT模型相媲美的翻译性能，同时大大缩短了训练时间。

 \parinterval 图\ref{fig:11-12}为ConvS2S模型的结构示意图，其内部由若干不同的模块组成，包括：
@@ -249,16 +259,6 @@
 \item {\small\bfnew{多跳注意力机制}}\index{多跳注意力机制}（Multi-step Attention/Multi-hop Attention）\index{Multi-step Attention}\index{Multi-hop Attention}：蓝色框内部展示了基于多跳结构的注意力机制模块\upcite{Sukhbaatar2015EndToEndMN}。ConvS2S模型同样使用注意力机制来捕捉两个序列之间不同位置的对应关系。区别于之前的做法，多跳注意力在解码器端每一个层都会执行注意力操作。下面将以此模型为例对基于卷积神经网络的机器翻译模型进行介绍。
 \end{itemize}

-%----------------------------------------------
-% 图12.
-\begin{figure}[htp]
-\centering
-\input{./Chapter11/Figures/figure-fairseq-0}
-\caption{ConvS2S模型结构}
-\label{fig:11-12}
-\end{figure}
-%----------------------------------------------
-
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
 %----------------------------------------------------------------------------------------
@@ -353,7 +353,6 @@
 \label{eq:11-7}
 \end{eqnarray}

-
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
 %----------------------------------------------------------------------------------------
@@ -362,6 +361,15 @@

 \parinterval ConvS2S模型也采用了注意力机制来获取每个目标语言位置相应的源语言上下文信息。其仍然沿用传统的点乘注意力机制\upcite{DBLP:journals/corr/LuongPM15}，其中图\ref{fig:11-16}蓝色框代表了多跳自注意力机制在模型中的位置。

+\parinterval 在基于循环神经网络的翻译模型中，注意力机制已经被广泛使用\upcite{bahdanau2014neural}，并用于避免循环神经网络将源语言序列压缩成一个固定维度的向量表示带来的信息损失。另一方面，注意力同样能够帮助解码器区分源语言中不同位置对当前目标语言位置的贡献度，其具体的计算过程如公式\eqref{eq:11-8}和\eqref{eq:11-9}所示：
+
+\begin{eqnarray}
+\mathbi{C}_j &=& \sum_i \alpha_{i,j} \mathbi{h}_i \label{eq:11-8} \\
+\alpha_{i,j} &=& \frac{ \textrm{exp}(\funp{a} (\mathbi{s}_{j-1},\mathbi{h}_i))  }{\sum_{i'} \textrm{exp}( \funp{a} (\mathbi{s}_{j-1},\mathbi{h}_{i'}))} \label{eq:11-9}
+\end{eqnarray}
+
+\noindent 其中，$\mathbi{h}_i$表示源语言端第$i$个位置的隐层状态，即编码器在第$i$个位置的输出。$\mathbi{s}_j$表示目标端第$j$个位置的隐层状态。给定$\mathbi{s}_j$和$\mathbi{h}_i$，注意力机制通过函数$\funp{a}(\cdot)$计算目标语言表示$\mathbi{s}_j$与源语言表示$\mathbi{h}_i$之间的注意力权重$\alpha_{i,j}$，通过加权平均得到当前目标语言端位置所需的上下文表示$\mathbi{C}_j$。其中$\funp{a}(\cdot)$的具体计算方式在{\chapterten}已经详细讨论。
+
 %----------------------------------------------
 % 图16.
 \begin{figure}[htp]
@@ -372,15 +380,6 @@
 \end{figure}
 %----------------------------------------------

-\parinterval 在基于循环神经网络的翻译模型中，注意力机制已经被广泛使用\upcite{bahdanau2014neural}，并用于避免循环神经网络将源语言序列压缩成一个固定维度的向量表示带来的信息损失。另一方面，注意力同样能够帮助解码器区分源语言中不同位置对当前目标语言位置的贡献度，其具体的计算过程如公式\eqref{eq:11-8}和\eqref{eq:11-9}所示：
-
-\begin{eqnarray}
-\mathbi{C}_j &=& \sum_i \alpha_{i,j} \mathbi{h}_i \label{eq:11-8} \\
-\alpha_{i,j} &=& \frac{ \textrm{exp}(\funp{a} (\mathbi{s}_{j-1},\mathbi{h}_i))  }{\sum_{i'} \textrm{exp}( \funp{a} (\mathbi{s}_{j-1},\mathbi{h}_{i'}))} \label{eq:11-9}
-\end{eqnarray}
-
-\noindent 其中，$\mathbi{h}_i$表示源语言端第$i$个位置的隐层状态，即编码器在第$i$个位置的输出。$\mathbi{s}_j$表示目标端第$j$个位置的隐层状态。给定$\mathbi{s}_j$和$\mathbi{h}_i$，注意力机制通过函数$\funp{a}(\cdot)$计算目标语言表示$\mathbi{s}_j$与源语言表示$\mathbi{h}_i$之间的注意力权重$\alpha_{i,j}$，通过加权平均得到当前目标语言端位置所需的上下文表示$\mathbi{C}_j$。其中$\funp{a}(\cdot)$的具体计算方式在{\chapterten}已经详细讨论。
-
 \parinterval 在ConvS2S模型中，解码器同样采用堆叠的多层门控卷积网络来对目标语言进行序列建模。区别于编码器，解码器在每一层卷积网络之后引入了注意力机制，用来参考源语言信息。ConvS2S选用了点乘注意力，并且通过类似残差连接的方式将注意力操作的输入与输出同时作用于下一层计算，称为多跳注意力。其具体计算方式如下：
 \begin{eqnarray}
 \alpha_{ij}^l &=& \frac{ \textrm{exp} (\mathbi{d}_{j}^l\mathbi{h}_i) }{\sum_{i^{'}=1}^m \textrm{exp} (\mathbi{d}_{j}^l\mathbi{h}_{i^{'}})}

--- a/Chapter12/Figures/figure-a-combination-of-position-encoding-and-word-encoding.tex
+++ b/Chapter12/Figures/figure-a-combination-of-position-encoding-and-word-encoding.tex

 \begin{tikzpicture}
 \begin{scope}
-\tikzstyle{rnode} = [draw,minimum width=3.5em,minimum height=1.2em]
+\tikzstyle{rnode} = [draw,minimum width=4em,minimum height=1.2em]

-\node [rnode,anchor=south west,fill=red!20!white] (e1) at (0,0) {\scriptsize{$e(\textrm{沈阳})$}};
-\node [rnode,anchor=south west,fill=red!20!white] (e2) at ([xshift=1em]e1.south east) {\scriptsize{$e(\textrm{到})$}};
-\node [rnode,anchor=south west,fill=red!20!white] (e3) at ([xshift=1em]e2.south east) {\scriptsize{$e(\textrm{广州})$}};
-\node [rnode,anchor=south west,fill=red!20!white] (e4) at ([xshift=1em]e3.south east) {\scriptsize{$e(\textrm{的})$}};
-\node [rnode,anchor=south west,fill=red!20!white] (e5) at ([xshift=1em]e4.south east) {\scriptsize{$e(\textrm{机票})$}};
+\node [rnode,anchor=south west,fill=red!30!white] (e1) at (0,0) {\scriptsize{$e(\textrm{沈阳})$}};
+\node [rnode,anchor=south west,fill=red!30!white] (e2) at ([xshift=1.5em]e1.south east) {\scriptsize{$e(\textrm{到})$}};
+\node [rnode,anchor=south west,fill=red!30!white] (e3) at ([xshift=1.5em]e2.south east) {\scriptsize{$e(\textrm{广州})$}};
+\node [rnode,anchor=south west,fill=red!30!white] (e4) at ([xshift=1.5em]e3.south east) {\scriptsize{$e(\textrm{的})$}};
+\node [rnode,anchor=south west,fill=red!30!white] (e5) at ([xshift=1.5em]e4.south east) {\scriptsize{$e(\textrm{机票})$}};

-\node [rnode,anchor=south west,fill=green!20!white] (h1) at ([yshift=1.5em]e1.north west) {\scriptsize{$h(\textrm{沈阳})$}};
-\node [rnode,anchor=south west,fill=green!20!white] (h2) at ([yshift=1.5em]e2.north west) {\scriptsize{$h(\textrm{到})$}};
-\node [rnode,anchor=south west,fill=green!20!white] (h3) at ([yshift=1.5em]e3.north west) {\scriptsize{$h(\textrm{广州})$}};
-\node [rnode,anchor=south west,fill=green!20!white] (h4) at ([yshift=1.5em]e4.north west) {\scriptsize{$h(\textrm{的})$}};
-\node [rnode,anchor=south west,fill=green!20!white] (h5) at ([yshift=1.5em]e5.north west) {\scriptsize{$h(\textrm{机票})$}};
+\node [rnode,anchor=south west,fill=green!30!white] (h1) at ([yshift=1.5em]e1.north west) {\scriptsize{$h(\textrm{沈阳})$}};
+\node [rnode,anchor=south west,fill=green!30!white] (h2) at ([yshift=1.5em]e2.north west) {\scriptsize{$h(\textrm{到})$}};
+\node [rnode,anchor=south west,fill=green!30!white] (h3) at ([yshift=1.5em]e3.north west) {\scriptsize{$h(\textrm{广州})$}};
+\node [rnode,anchor=south west,fill=green!30!white] (h4) at ([yshift=1.5em]e4.north west) {\scriptsize{$h(\textrm{的})$}};
+\node [rnode,anchor=south west,fill=green!30!white] (h5) at ([yshift=1.5em]e5.north west) {\scriptsize{$h(\textrm{机票})$}};

 \foreach \x in {1,2,3,4,5}{
 	\node [anchor=north] (plus\x) at ([yshift=-0em]e\x.south) {\scriptsize{$\mathbf{\oplus}$}};
 }

-\node [rnode,anchor=north,fill=yellow!20!white] (pos1) at ([yshift=-1.1em]e1.south) {\scriptsize{$\textrm{PE}(1)$}};
-\node [rnode,anchor=north,fill=yellow!20!white] (pos2) at ([yshift=-1.1em]e2.south) {\scriptsize{$\textrm{PE}(2)$}};
-\node [rnode,anchor=north,fill=yellow!20!white] (pos3) at ([yshift=-1.1em]e3.south) {\scriptsize{$\textrm{PE}(3)$}};
-\node [rnode,anchor=north,fill=yellow!20!white] (pos4) at ([yshift=-1.1em]e4.south) {\scriptsize{$\textrm{PE}(4)$}};
-\node [rnode,anchor=north,fill=yellow!20!white] (pos5) at ([yshift=-1.1em]e5.south) {\scriptsize{$\textrm{PE}(5)$}};
+\node [rnode,anchor=north,fill=yellow!30!white] (pos1) at ([yshift=-1.1em]e1.south) {\scriptsize{$\textrm{PE}(1)$}};
+\node [rnode,anchor=north,fill=yellow!30!white] (pos2) at ([yshift=-1.1em]e2.south) {\scriptsize{$\textrm{PE}(2)$}};
+\node [rnode,anchor=north,fill=yellow!30!white] (pos3) at ([yshift=-1.1em]e3.south) {\scriptsize{$\textrm{PE}(3)$}};
+\node [rnode,anchor=north,fill=yellow!30!white] (pos4) at ([yshift=-1.1em]e4.south) {\scriptsize{$\textrm{PE}(4)$}};
+\node [rnode,anchor=north,fill=yellow!30!white] (pos5) at ([yshift=-1.1em]e5.south) {\scriptsize{$\textrm{PE}(5)$}};


 \foreach \x in {1,2,3,4,5}{

--- a/Chapter12/Figures/figure-calculation-of-context-vector-c.tex
+++ b/Chapter12/Figures/figure-calculation-of-context-vector-c.tex
@@ -2,13 +2,13 @@
 \begin{tikzpicture}
 \begin{scope}

-\tikzstyle{rnode} = [draw,minimum width=3.5em,minimum height=1.2em]
+\tikzstyle{rnode} = [draw,minimum width=4em,minimum height=1.2em]

-\node [rnode,anchor=south west,fill=green!20!white] (key1) at (0,0) {\scriptsize{$h(\textrm{沈阳})$}};
-\node [rnode,anchor=south west,fill=green!20!white] (key2) at ([xshift=1em]key1.south east) {\scriptsize{$h(\textrm{到})$}};
-\node [rnode,anchor=south west,fill=green!20!white] (key3) at ([xshift=1em]key2.south east) {\scriptsize{$h(\textrm{广州})$}};
-\node [rnode,anchor=south west,fill=green!20!white] (key4) at ([xshift=2em]key3.south east) {\scriptsize{$h(\textrm{机票})$}};
-\node [rnode,anchor=south west] (key5) at ([xshift=1em]key4.south east) {\scriptsize{$h(\textrm{机票})$}};
+\node [rnode,anchor=south west,fill=green!30!white] (key1) at (0,0) {\scriptsize{$h(\textrm{沈阳})$}};
+\node [rnode,anchor=south west,fill=green!30!white] (key2) at ([xshift=1.5em]key1.south east) {\scriptsize{$h(\textrm{到})$}};
+\node [rnode,anchor=south west,fill=green!30!white] (key3) at ([xshift=1.5em]key2.south east) {\scriptsize{$h(\textrm{广州})$}};
+\node [rnode,anchor=south west,fill=green!30!white] (key4) at ([xshift=3em]key3.south east) {\scriptsize{$h(\textrm{机票})$}};
+\node [rnode,anchor=south west] (key5) at ([xshift=1.5em]key4.south east) {\scriptsize{$h(\textrm{机票})$}};

 \node [anchor=west] (sep1) at ([xshift=0.3em]key3.east) {\scriptsize{$\textbf{...}$}};

@@ -17,17 +17,17 @@
 \draw [->] ([yshift=1pt,xshift=3pt]key5.north) .. controls +(90:1.8em) and +(90:1.8em) .. ([yshift=1pt]key2.north);
 \draw [->] ([yshift=1pt,xshift=6pt]key5.north) .. controls +(90:2.2em) and +(90:2.2em) .. ([yshift=1pt]key1.north);

-\node [anchor=south west] (alpha1) at ([xshift=-1em]key1.north west) {\scriptsize{$\alpha_1=.2$}};
-\node [anchor=south west] (alpha2) at ([xshift=-1em]key2.north west) {\scriptsize{$\alpha_2=.3$}};
-\node [anchor=south west] (alpha3) at ([xshift=-1em]key3.north west) {\scriptsize{$\alpha_3=.1$}};
-\node [anchor=south west] (alpha4) at ([xshift=-1em]key4.north west) {\scriptsize{$\alpha_i=.3$}};
+\node [anchor=south west] (alpha1) at ([xshift=-1.2em]key1.north west) {\scriptsize{$\alpha_1=0.2$}};
+\node [anchor=south west] (alpha2) at ([xshift=-1.2em]key2.north west) {\scriptsize{$\alpha_2=0.3$}};
+\node [anchor=south west] (alpha3) at ([xshift=-1.2em]key3.north west) {\scriptsize{$\alpha_3=0.1$}};
+\node [anchor=south west] (alpha4) at ([xshift=-1.2em]key4.north west) {\scriptsize{$\alpha_i=0.3$}};

 \vspace{0.5em}

-\node [rnode,anchor=south west,fill=green!20!white] (key6) at ([yshift=2em]key1.north west) {\scriptsize{$h(\textrm{广州})$}};
-\node [rnode,anchor=south west,fill=green!20!white] (key7) at ([yshift=2em]key2.north west) {\scriptsize{$h(\textrm{到})$}};
-\node [rnode,anchor=south west,fill=green!20!white] (key8) at ([yshift=2em]key3.north west) {\scriptsize{$h(\textrm{沈阳})$}};
-\node [rnode,anchor=south west,fill=green!20!white] (key9) at ([yshift=2em]key4.north west) {\scriptsize{$h(\textrm{机票})$}};
+\node [rnode,anchor=south west,fill=green!30!white] (key6) at ([yshift=2em]key1.north west) {\scriptsize{$h(\textrm{广州})$}};
+\node [rnode,anchor=south west,fill=green!30!white] (key7) at ([yshift=2em]key2.north west) {\scriptsize{$h(\textrm{到})$}};
+\node [rnode,anchor=south west,fill=green!30!white] (key8) at ([yshift=2em]key3.north west) {\scriptsize{$h(\textrm{沈阳})$}};
+\node [rnode,anchor=south west,fill=green!30!white] (key9) at ([yshift=2em]key4.north west) {\scriptsize{$h(\textrm{机票})$}};
 \node [rnode,anchor=south west] (key10) at ([yshift=2em]key5.north west) {\scriptsize{$h(\textrm{机票})$}};

 \node [anchor=west] (sep1) at ([xshift=0.3em]key8.east) {\scriptsize{$\textbf{...}$}};
@@ -37,10 +37,10 @@
 \draw [->] ([yshift=1pt,xshift=3pt]key10.north) .. controls +(90:1.8em) and +(90:1.8em) .. ([yshift=1pt]key7.north);
 \draw [->] ([yshift=1pt,xshift=6pt]key10.north) .. controls +(90:2.2em) and +(90:2.2em) .. ([yshift=1pt]key6.north);

-\node [anchor=south west] (alpha5) at ([xshift=-1em]key6.north west) {\scriptsize{$\alpha_1=.1$}};
-\node [anchor=south west] (alpha6) at ([xshift=-1em]key7.north west) {\scriptsize{$\alpha_2=.3$}};
-\node [anchor=south west] (alpha7) at ([xshift=-1em]key8.north west) {\scriptsize{$\alpha_3=.2$}};
-\node [anchor=south west] (alpha8) at ([xshift=-1em]key9.north west) {\scriptsize{$\alpha_i=.3$}};
+\node [anchor=south west] (alpha5) at ([xshift=-1.2em]key6.north west) {\scriptsize{$\alpha_1=0.1$}};
+\node [anchor=south west] (alpha6) at ([xshift=-1.2em]key7.north west) {\scriptsize{$\alpha_2=0.3$}};
+\node [anchor=south west] (alpha7) at ([xshift=-1.2em]key8.north west) {\scriptsize{$\alpha_3=0.2$}};
+\node [anchor=south west] (alpha8) at ([xshift=-1.2em]key9.north west) {\scriptsize{$\alpha_i=0.3$}};

 \end{scope}
 \end{tikzpicture}

--- a/Chapter12/Figures/figure-comparison-of-the-number-of-padding-in-batch.tex
+++ b/Chapter12/Figures/figure-comparison-of-the-number-of-padding-in-batch.tex
@@ -4,7 +4,7 @@

 \begin{scope}[scale=1.5]
 {\Large
-\tikzstyle{snode} = [draw,inner sep=1pt,minimum width=3em,minimum height=0.5em,rounded corners=1pt,fill=green!30!white]
+\tikzstyle{snode} = [draw,inner sep=1pt,minimum width=3em,minimum height=0.5em,rounded corners=1pt,fill=green!40!white]
 \tikzstyle{pnode} = [draw,inner sep=1pt,minimum width=1em,minimum height=0.5em,rounded corners=1pt]
 \node [anchor=west,snode] (s1) at (0,0) {};
 \node [anchor=north west,snode,minimum width=6.5em] (s2) at ([yshift=-0.3em]s1.south west) {};
@@ -15,7 +15,7 @@
 \node [anchor=west,pnode,minimum width=3em] (p1) at ([xshift=0.3em]s1.east) {};
 \node [anchor=west,pnode,minimum width=4em] (p3) at ([xshift=0.3em]s3.east) {};

-\node [anchor=west,snode,minimum width=5em] (s4) at ([xshift=4em]p1.east) {};
+\node [anchor=west,snode,minimum width=5em] (s4) at ([xshift=5em]p1.east) {};
 \node [anchor=north west,snode,minimum width=5em] (s5) at ([yshift=-0.3em]s4.south west) {};
 \node [anchor=north west,snode,minimum width=6.5em] (s6) at ([yshift=-0.3em]s5.south west) {};


--- a/Chapter12/Figures/figure-decode-of-transformer.tex
+++ b/Chapter12/Figures/figure-decode-of-transformer.tex
@@ -3,14 +3,14 @@

   \begin{tikzpicture}
    \begin{scope}
-    \tikzstyle{rnnnode} = [minimum height=1.1em,minimum width=2.1em,inner sep=2pt,rounded corners=1pt,draw,fill=red!20];
+    \tikzstyle{rnnnode} = [minimum height=1.1em,minimum width=2.7em,inner sep=2pt,rounded corners=1pt,draw,fill=red!20];

    \node [rnnnode,anchor=west] (h1) at (0,0) {\tiny{$\mathbi{h}_1$}};
-    \node [rnnnode,anchor=west] (h2) at ([xshift=1em]h1.east) {\tiny{$\mathbi{h}_2$}};
-    \node [rnnnode,anchor=west] (h3) at ([xshift=1em]h2.east) {\tiny{$\mathbi{h}_3$}};
+    \node [rnnnode,anchor=west] (h2) at ([xshift=1.5em]h1.east) {\tiny{$\mathbi{h}_2$}};
+    \node [rnnnode,anchor=west] (h3) at ([xshift=1.5em]h2.east) {\tiny{$\mathbi{h}_3$}};
    \node [rnnnode,anchor=north,fill=green!20] (e1) at ([yshift=-1em]h1.south) {\tiny{$e_x()$}};
-    \node [rnnnode,anchor=west,fill=green!20] (e2) at ([xshift=1em]e1.east) {\tiny{$e_x()$}};
-    \node [rnnnode,anchor=west,fill=green!20] (e3) at ([xshift=1em]e2.east) {\tiny{$e_x()$}};
+    \node [rnnnode,anchor=west,fill=green!20] (e2) at ([xshift=1.5em]e1.east) {\tiny{$e_x()$}};
+    \node [rnnnode,anchor=west,fill=green!20] (e3) at ([xshift=1.5em]e2.east) {\tiny{$e_x()$}};
    \node [anchor=north,inner sep=2pt] (w1) at ([yshift=-0.6em]e1.south) {\tiny{你}};
    \node [anchor=north,inner sep=2pt] (w2) at ([yshift=-0.6em]e2.south) {\tiny{好}};
    \node [anchor=north,inner sep=2pt] (w3) at ([yshift=-0.6em]e3.south) {\tiny{$\langle$eos$\rangle$}};
@@ -33,14 +33,14 @@
    \node [anchor=south] (encoder) at ([xshift=-0.2em]h1.north west) {\scriptsize{\textbf{编码器}}};

 {
-    \node [rnnnode,anchor=west,fill=green!20] (t1) at ([xshift=3em]e3.east) {\tiny{$e_y()$}};
+    \node [rnnnode,anchor=west,fill=green!20] (t1) at ([xshift=3.5em]e3.east) {\tiny{$e_y()$}};
    }
 {
-    \node [rnnnode,anchor=west,fill=green!20] (t2) at ([xshift=1.5em]t1.east) {\tiny{$e_y()$}};
+    \node [rnnnode,anchor=west,fill=green!20] (t2) at ([xshift=1.8em]t1.east) {\tiny{$e_y()$}};
    }
 {
-    \node [rnnnode,anchor=west,fill=green!20] (t3) at ([xshift=1.5em]t2.east) {\tiny{$e_y()$}};
-    \node [rnnnode,anchor=west,fill=green!20] (t4) at ([xshift=1.5em]t3.east) {\tiny{$e_y()$}};
+    \node [rnnnode,anchor=west,fill=green!20] (t3) at ([xshift=1.8em]t2.east) {\tiny{$e_y()$}};
+    \node [rnnnode,anchor=west,fill=green!20] (t4) at ([xshift=1.8em]t3.east) {\tiny{$e_y()$}};
    %\node [anchor=west,inner sep=2pt] (t5) at ([xshift=0.3em]t4.east) {\tiny{...}};
    }
 {

--- a/Chapter12/Figures/figure-dependencies-between-words-in-a-recurrent-neural-network.tex
+++ b/Chapter12/Figures/figure-dependencies-between-words-in-a-recurrent-neural-network.tex
@@ -3,11 +3,11 @@
 \begin{tikzpicture}
 \begin{scope}
 \node [anchor=west] (w0) at (0,0) {$w_1$};
-\node [anchor=west] (w1) at ([xshift=0.5em]w0.east) {$w_2$};
-\node [anchor=west] (w2) at ([xshift=0.5em]w1.east) {$w_3$};
-\node [anchor=west] (w3) at ([xshift=0.5em]w2.east) {$...$};
-\node [anchor=west] (w4) at ([xshift=0.5em]w3.east) {$w_{m-1}$};
-\node [anchor=west,fill=green!20!white] (w5) at ([xshift=0.5em]w4.east) {$w_{m}$};
+\node [anchor=west] (w1) at ([xshift=1em]w0.east) {$w_2$};
+\node [anchor=west] (w2) at ([xshift=1em]w1.east) {$w_3$};
+\node [anchor=west] (w3) at ([xshift=1em]w2.east) {$...$};
+\node [anchor=west] (w4) at ([xshift=1em]w3.east) {$w_{m-1}$};
+\node [anchor=west,fill=green!30!white] (w5) at ([xshift=0.5em]w4.east) {$w_{m}$};
 \draw [->,thick,red] (w1.north).. controls +(130:0.5) and +(50:0.5) .. (w0.north);
 \draw [->,thick,red] (w2.north).. controls +(130:0.5) and +(50:0.5) .. (w1.north);
 \draw [->,thick,red] ([yshift=0.2em]w3.north).. controls +(130:0.5) and +(50:0.5) .. (w2.north);

--- a/Chapter12/Figures/figure-dependencies-between-words-of-attention.tex
+++ b/Chapter12/Figures/figure-dependencies-between-words-of-attention.tex
@@ -3,11 +3,11 @@
 \begin{tikzpicture}
 \begin{scope}
 \node [anchor=west] (w0) at (0,-2) {$w_1$};
-\node [anchor=west] (w1) at ([xshift=0.5em]w0.east) {$w_2$};
-\node [anchor=west] (w2) at ([xshift=0.5em]w1.east) {$w_3$};
-\node [anchor=west] (w3) at ([xshift=0.5em]w2.east) {$...$};
-\node [anchor=west] (w4) at ([xshift=0.5em]w3.east) {$w_{m-1}$};
-\node [anchor=west,fill=green!20!white] (w5) at ([xshift=0.5em]w4.east) {$w_{m}$};
+\node [anchor=west] (w1) at ([xshift=1em]w0.east) {$w_2$};
+\node [anchor=west] (w2) at ([xshift=1em]w1.east) {$w_3$};
+\node [anchor=west] (w3) at ([xshift=1em]w2.east) {$...$};
+\node [anchor=west] (w4) at ([xshift=1em]w3.east) {$w_{m-1}$};
+\node [anchor=west,fill=green!30!white] (w5) at ([xshift=0.5em]w4.east) {$w_{m}$};
 \draw [->,thick,red] (w5.north).. controls +(100:0.85) and +(50:0.85) .. (w0.north);
 \draw [->,thick,red] (w5.north).. controls +(110:0.75) and +(50:0.75) .. (w1.north);
 \draw [->,thick,red] (w5.north).. controls +(110:0.75) and +(50:0.75) .. (w2.north);

--- a/Chapter12/Figures/figure-different-regularization-methods.tex
+++ b/Chapter12/Figures/figure-different-regularization-methods.tex
@@ -3,12 +3,12 @@

 \begin{tikzpicture}
 \begin{scope}
-\tikzstyle{lnode} = [minimum height=1.5em,minimum width=3em,inner sep=3pt,rounded corners=1.5pt,draw,fill=orange!20];
+\tikzstyle{lnode} = [minimum height=1.5em,minimum width=3em,inner sep=3pt,rounded corners=1.5pt,draw,fill=orange!30];
 \tikzstyle{standard} = [rounded corners=3pt]

 \node [lnode,anchor=west] (l1) at (0,0) {\scriptsize{子层}};
 \node [lnode,anchor=west] (l2) at ([xshift=3em]l1.east) {\scriptsize{层标准化}};
-\node [lnode,anchor=west] (l3) at ([xshift=4em]l2.east) {\scriptsize{层标准化}};
+\node [lnode,anchor=west] (l3) at ([xshift=6em]l2.east) {\scriptsize{层标准化}};
 \node [lnode,anchor=west] (l4) at ([xshift=1.5em]l3.east) {\scriptsize{子层}};

 \node [anchor=west] (plus1) at ([xshift=0.9em]l1.east) {\scriptsize{$\mathbf{\oplus}$}};

--- a/Chapter12/Figures/figure-example-of-self-attention-mechanism-calculation.tex
+++ b/Chapter12/Figures/figure-example-of-self-attention-mechanism-calculation.tex
@@ -3,15 +3,15 @@
 \begin{tikzpicture}
 \begin{scope}

-\tikzstyle{rnode} = [draw,minimum width=2.8em,minimum height=1.2em]
+\tikzstyle{rnode} = [draw,minimum width=3.5em,minimum height=1.2em]



-\node [rnode,anchor=south west,fill=green!20!white] (key11) at (0,0) {\scriptsize{$h(\textrm{他})$}};
-\node [rnode,anchor=south west,fill=green!20!white] (key12) at ([xshift=0.8em]key11.south east) {\scriptsize{$h(\textrm{什么})$}};
-\node [rnode,anchor=south west,fill=green!20!white] (key13) at ([xshift=0.8em]key12.south east) {\scriptsize{$h(\textrm{也})$}};
-\node [rnode,anchor=south west,fill=green!20!white] (key14) at ([xshift=0.8em]key13.south east) {\scriptsize{$h(\textrm{没})$}};
-\node [rnode,anchor=south west,fill=green!20!white] (key15) at ([xshift=0.8em]key14.south east) {\scriptsize{$h(\textrm{学})$}};
+\node [rnode,anchor=south west,fill=green!30!white] (key11) at (0,0) {\scriptsize{$h(\textrm{他})$}};
+\node [rnode,anchor=south west,fill=green!30!white] (key12) at ([xshift=1.5em]key11.south east) {\scriptsize{$h(\textrm{什么})$}};
+\node [rnode,anchor=south west,fill=green!30!white] (key13) at ([xshift=1.5em]key12.south east) {\scriptsize{$h(\textrm{也})$}};
+\node [rnode,anchor=south west,fill=green!30!white] (key14) at ([xshift=1.5em]key13.south east) {\scriptsize{$h(\textrm{没})$}};
+\node [rnode,anchor=south west,fill=green!30!white] (key15) at ([xshift=1.5em]key14.south east) {\scriptsize{$h(\textrm{学})$}};

 \node [rnode,anchor=east] (query1) at ([xshift=-1em]key11.west) {\scriptsize{$h(\textrm{他})$}};


--- a/Chapter12/Figures/figure-lrate-of-transformer.tex
+++ b/Chapter12/Figures/figure-lrate-of-transformer.tex
@@ -4,7 +4,7 @@
  \begin{tikzpicture}
    \footnotesize{
      \begin{axis}[
-      width=.60\textwidth,
+      width=.8\textwidth,
      height=.40\textwidth,
      legend style={at={(0.60,0.08)}, anchor=south west},
      xlabel={\footnotesize{更新步数  (10k)}},

--- a/Chapter12/Figures/figure-multi-head-attention-model.tex
+++ b/Chapter12/Figures/figure-multi-head-attention-model.tex
@@ -2,29 +2,31 @@

 \begin{tikzpicture}
 \begin{scope}
-
-\node [anchor=west,draw=black!30,inner sep=4pt,fill=ugreen!20!white,text=ugreen!20!white] (Linear0) at (0,0) {\footnotesize{Linear}};
-\node [anchor=south west,draw=black!50,fill=ugreen!20!white,draw,inner sep=4pt,text=ugreen!20!white] (Linear01) at ([shift={(-0.2em,-0.2em)}]Linear0.south west) {\footnotesize{Linear}};
-\node [anchor=south west,fill=ugreen!20!white,draw,inner sep=4pt] (Linear02) at ([shift={(-0.2em,-0.2em)}]Linear01.south west) {\footnotesize{Linear}};
+%
+\node [anchor=west,draw=black!30,inner sep=4pt,fill=ugreen!30!white,text=ugreen!30!white,minimum width=4em] (Linear0) at (0,0) {\footnotesize{Linear}};
+\node [anchor=south west,draw=black!50,fill=ugreen!30!white,draw,inner sep=4pt,text=ugreen!30!white,minimum width=4em] (Linear01) at ([shift={(-0.2em,-0.2em)}]Linear0.south west) {\footnotesize{Linear}};
+\node [anchor=south west,fill=ugreen!30!white,draw,inner sep=4pt,minimum width=4em] (Linear02) at ([shift={(-0.2em,-0.2em)}]Linear01.south west) {\footnotesize{Linear}};
 \node [anchor=north] (Q) at ([xshift=0em,yshift=-1em]Linear02.south) {\footnotesize{$\mathbi{Q}$}};

-\node [anchor=west,draw=black!30,inner sep=4pt,fill=ugreen!20!white,text=ugreen!20!white] (Linear1) at ([xshift=1.5em]Linear0.east) {\footnotesize{Linear}};
-\node [anchor=south west,draw=black!50,fill=ugreen!20!white,draw,inner sep=4pt,text=ugreen!20!white] (Linear11) at ([shift={(-0.2em,-0.2em)}]Linear1.south west) {\footnotesize{Linear}};
-\node [anchor=south west,fill=ugreen!20!white,draw,inner sep=4pt] (Linear12) at ([shift={(-0.2em,-0.2em)}]Linear11.south west) {\footnotesize{Linear}};
+\node [anchor=west,draw=black!30,inner sep=4pt,fill=ugreen!30!white,text=ugreen!30!white,minimum width=4em] (Linear1) at ([xshift=1.5em]Linear0.east) {\footnotesize{Linear}};
+\node [anchor=south west,draw=black!50,fill=ugreen!30!white,draw,inner sep=4pt,text=ugreen!30!white,minimum width=4em] (Linear11) at ([shift={(-0.2em,-0.2em)}]Linear1.south west) {\footnotesize{Linear}};
+\node [anchor=south west,fill=ugreen!30!white,draw,inner sep=4pt,minimum width=4em] (Linear12) at ([shift={(-0.2em,-0.2em)}]Linear11.south west) {\footnotesize{Linear}};
 \node [anchor=north] (K) at ([xshift=0em,yshift=-1em]Linear12.south) {\footnotesize{$\mathbi{K}$}};

-\node [anchor=west,draw=black!30,inner sep=4pt,fill=ugreen!20!white,text=ugreen!20!white] (Linear2) at ([xshift=1.5em]Linear1.east) {\footnotesize{Linear}};
-\node [anchor=south west,draw=black!50,fill=ugreen!20!white,draw,inner sep=4pt,text=ugreen!20!white] (Linear21) at ([shift={(-0.2em,-0.2em)}]Linear2.south west) {\footnotesize{Linear}};
-\node [anchor=south west,fill=ugreen!20!white,draw,inner sep=4pt] (Linear22) at ([shift={(-0.2em,-0.2em)}]Linear21.south west) {\footnotesize{Linear}};
+\node [anchor=west,draw=black!30,inner sep=4pt,fill=ugreen!30!white,text=ugreen!30!white,minimum width=4em] (Linear2) at ([xshift=1.5em]Linear1.east) {\footnotesize{Linear}};
+\node [anchor=south west,draw=black!50,fill=ugreen!30!white,draw,inner sep=4pt,text=ugreen!30!white,minimum width=4em] (Linear21) at ([shift={(-0.2em,-0.2em)}]Linear2.south west) {\footnotesize{Linear}};
+\node [anchor=south west,fill=ugreen!30!white,draw,inner sep=4pt,minimum width=4em] (Linear22) at ([shift={(-0.2em,-0.2em)}]Linear21.south west) {\footnotesize{Linear}};
 \node [anchor=north] (V) at ([xshift=0em,yshift=-1em]Linear22.south) {\footnotesize{$\mathbi{V}$}};

-\node [anchor=south,draw=black!30,minimum width=12em,minimum height=2em,inner sep=4pt,fill=blue!20!white] (Scale) at ([yshift=1em]Linear1.north) {\footnotesize{}};
-\node [anchor=south west,draw=black!50,minimum width=12em,minimum height=2em,fill=blue!20!white,draw,inner sep=4pt] (Scale1) at ([shift={(-0.2em,-0.2em)}]Scale.south west) {\footnotesize{}};
-\node [anchor=south west,fill=blue!20!white,draw,minimum width=12em,minimum height=2em,inner sep=4pt] (Scale2) at ([shift={(-0.2em,-0.2em)}]Scale1.south west) {\footnotesize{Scaled Dot-Product Attention}};
+% scaled dot-product attention
+\node [anchor=south,draw=black!30,minimum width=16em,minimum height=2em,inner sep=4pt,fill=blue!30!white] (Scale) at ([yshift=1em]Linear1.north) {\footnotesize{}};
+\node [anchor=south west,draw=black!50,minimum width=16em,minimum height=2em,fill=blue!30!white,draw,inner sep=4pt] (Scale1) at ([shift={(-0.2em,-0.2em)}]Scale.south west) {\footnotesize{}};
+\node [anchor=south west,fill=blue!30!white,draw,minimum width=16em,minimum height=2em,inner sep=4pt] (Scale2) at ([shift={(-0.2em,-0.2em)}]Scale1.south west) {\footnotesize{Scaled Dot-Product Attention}};

-\node [anchor=south,draw,minimum width=4em,inner sep=4pt,fill=yellow!30] (Concat) at ([yshift=1em]Scale2.north) {\footnotesize{Concat}};
+%
+\node [anchor=south,draw,minimum width=6em,inner sep=4pt,fill=yellow!30] (Concat) at ([yshift=1em]Scale2.north) {\footnotesize{Concat}};

-\node [anchor=south,draw,minimum width=4em,inner sep=4pt,fill=ugreen!20!white] (Linear) at ([yshift=1em]Concat.north) {\footnotesize{Linear}};
+\node [anchor=south,draw,minimum width=6em,inner sep=4pt,fill=ugreen!30!white] (Linear) at ([yshift=1em]Concat.north) {\footnotesize{Linear}};


 \draw [->] ([yshift=0.1em]Q.north) -- ([yshift=-0.1em]Linear02.south);

--- a/Chapter12/Figures/figure-point-product-attention-model.tex
+++ b/Chapter12/Figures/figure-point-product-attention-model.tex
@@ -6,13 +6,13 @@
 \begin{tikzpicture}
 \begin{scope}

-\node [anchor=south west,fill=white,draw,inner sep=4pt,minimum width=4em,fill=blue!20!white] (MatMul) at (0,0) {\tiny{MatMul}};
+\node [anchor=south west,fill=white,draw,inner sep=4pt,minimum width=4em,fill=blue!25!white] (MatMul) at (0,0) {\tiny{MatMul}};
 \node [anchor=north] (Q1) at ([xshift=-1.4em,yshift=-1em]MatMul.south) {\footnotesize{$\mathbi{Q}$}};
 \node [anchor=north] (K1) at ([xshift=1.4em,yshift=-1em]MatMul.south) {\footnotesize{$\mathbi{K}$}};
-\node [anchor=south,draw,inner sep=4pt,fill=yellow!30,minimum width=2.5em] (Scale3) at ([yshift=1em]MatMul.north) {\tiny{Scale}};
-\node [anchor=south,draw,inner sep=4pt,fill=purple!20,minimum width=3.5em] (Mask) at ([yshift=0.8em]Scale3.north) {\tiny{Mask(opt.)}};
-\node [anchor=south,draw,inner sep=4pt,fill=ugreen!20!white] (SoftMax) at ([yshift=1em]Mask.north) {\tiny{SoftMax}};
-\node [anchor=south,draw,minimum width=4em,inner sep=4pt,fill=blue!20!white] (MatMul1) at ([xshift=1.7em,yshift=1em]SoftMax.north) {\tiny{MatMul}};
+\node [anchor=south,draw,inner sep=4pt,fill=yellow!25,minimum width=2.5em] (Scale3) at ([yshift=1em]MatMul.north) {\tiny{Scale}};
+\node [anchor=south,draw,inner sep=4pt,fill=purple!25,minimum width=3.5em] (Mask) at ([yshift=0.8em]Scale3.north) {\tiny{Mask(opt.)}};
+\node [anchor=south,draw,inner sep=4pt,fill=ugreen!25!white] (SoftMax) at ([yshift=1em]Mask.north) {\tiny{SoftMax}};
+\node [anchor=south,draw,minimum width=4em,inner sep=4pt,fill=blue!25!white] (MatMul1) at ([xshift=1.7em,yshift=1em]SoftMax.north) {\tiny{MatMul}};
 \node [anchor=north] (V1) at ([xshift=2em]K1.north) {\footnotesize{$\mathbi{V}$}};
 \node [anchor=north] (null) at ([yshift=0.8em]MatMul1.north) {};

@@ -26,13 +26,13 @@
 \draw [->] ([yshift=0.1em]MatMul1.north) -- ([yshift=0.8em]MatMul1.north);

 {
-\node [anchor=east] (line1) at ([xshift=-4em,yshift=1em]MatMul.west) {\scriptsize{自注意力机制的Query}};
+\node [anchor=east] (line1) at ([xshift=-5em,yshift=1em]MatMul.west) {\scriptsize{自注意力机制的Query}};
 \node [anchor=north west] (line2) at ([yshift=0.3em]line1.south west) {\scriptsize{Key和Value均来自同一句}};
 \node [anchor=north west] (line3) at ([yshift=0.3em]line2.south west) {\scriptsize{子，编码-解码注意力机制}};
 \node [anchor=north west] (line4) at ([yshift=0.3em]line3.south west) {\scriptsize{与前面讲的一样}};
 }
 {
-\node [anchor=west] (line11) at ([xshift=3em,yshift=0em]MatMul.east) {\scriptsize{Query和Key的转置进}};
+\node [anchor=west] (line11) at ([xshift=5em,yshift=0em]MatMul.east) {\scriptsize{Query和Key的转置进}};
 \node [anchor=north west] (line12) at ([yshift=0.3em]line11.south west) {\scriptsize{行点积,得到句子内部}};
 \node [anchor=north west] (line13) at ([yshift=0.3em]line12.south west) {\scriptsize{各个位置的相关性}};
 }
@@ -57,28 +57,28 @@

 \begin{pgfonlayer}{background}
 {
-\node [rectangle,inner sep=0.2em,rounded corners=1pt,fill=green!10,drop shadow,draw=ugreen,minimum width=10em] [fit = (line1) (line2) (line3) (line4)] (box1) {};
+\node [rectangle,inner sep=0.2em,rounded corners=1pt,fill=green!20,drop shadow,draw=ugreen,minimum width=10em] [fit = (line1) (line2) (line3) (line4)] (box1) {};
 \node [rectangle,inner sep=0.1em,rounded corners=1pt,very thick,dotted,draw=ugreen] [fit = (Q1) (K1) (V1)] (box0) {};
-\draw [->,dotted,very thick,ugreen] ([yshift=-1.5em,xshift=1.8em]box1.east) -- ([yshift=-1.5em,xshift=0.1em]box1.east);
+\draw [->,dotted,very thick,ugreen] ([yshift=-1.5em,xshift=2.8em]box1.east) -- ([yshift=-1.5em,xshift=0.1em]box1.east);
 }
 {
-\node [rectangle,inner sep=0.2em,rounded corners=1pt,fill=blue!20!white,drop shadow,draw=blue] [fit = (line11) (line12) (line13)] (box2) {};
-\draw [->,dotted,very thick,blue] ([yshift=1em,xshift=-2.8em]box2.west) -- ([yshift=1em,xshift=-0.1em]box2.west);
+\node [rectangle,inner sep=0.2em,rounded corners=1pt,fill=blue!25!white,drop shadow,draw=blue] [fit = (line11) (line12) (line13)] (box2) {};
+\draw [->,dotted,very thick,blue] ([yshift=1em,xshift=-4.8em]box2.west) -- ([yshift=1em,xshift=-0.1em]box2.west);
 }

 {
-\node [rectangle,inner sep=0.2em,rounded corners=1pt,fill=yellow!20,drop shadow,draw=black] [fit = (line21) (line22) (line23)] (box3) {};
+\node [rectangle,inner sep=0.2em,rounded corners=1pt,fill=yellow!25,drop shadow,draw=black] [fit = (line21) (line22) (line23)] (box3) {};
 \draw [->,dotted,very thick,black] ([xshift=0.1em]Scale3.east) .. controls +(east:1) and +(west:1) .. ([yshift=1.0em]box3.west) ;
 }

 {
-\node [rectangle,inner sep=0.2em,rounded corners=1pt,fill=red!10,drop shadow,draw=red] [fit = (line31) (line32) (line33) (line34)] (box4) {};
-\draw [->,dotted,very thick,red] ([yshift=-1.2em,xshift=2.2em]box4.east) -- ([yshift=-1.2em,xshift=0.1em]box4.east);
+\node [rectangle,inner sep=0.2em,rounded corners=1pt,fill=red!20,drop shadow,draw=red] [fit = (line31) (line32) (line33) (line34)] (box4) {};
+\draw [->,dotted,very thick,red] ([yshift=-1.2em,xshift=3.2em]box4.east) -- ([yshift=-1.2em,xshift=0.1em]box4.east);
 }

 {
-\node [rectangle,inner sep=0.2em,rounded corners=1pt,fill=blue!20!white,drop shadow,draw=blue] [fit = (line41) (line42)] (box5) {};
-\draw [->,dotted,very thick,blue] ([yshift=-0.3em,xshift=-1em]box5.west) -- ([yshift=-0.3em,xshift=-0.1em]box5.west);
+\node [rectangle,inner sep=0.2em,rounded corners=1pt,fill=blue!25!white,drop shadow,draw=blue] [fit = (line41) (line42)] (box5) {};
+\draw [->,dotted,very thick,blue] ([yshift=-0.3em,xshift=-3em]box5.west) -- ([yshift=-0.3em,xshift=-0.1em]box5.west);
 }					
 \end{pgfonlayer}


--- a/Chapter12/Figures/figure-position-of-difference-and-layer-regularization-in-the-model.tex
+++ b/Chapter12/Figures/figure-position-of-difference-and-layer-regularization-in-the-model.tex
@@ -4,12 +4,12 @@

 \begin{tikzpicture}
 \begin{scope}
-\tikzstyle{Sanode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=orange!20];
-\tikzstyle{Resnode} = [minimum height=1.1em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=yellow!20];
+\tikzstyle{Sanode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=orange!30];
+\tikzstyle{Resnode} = [minimum height=1.1em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=yellow!30];
 \tikzstyle{ffnnode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw];
 \tikzstyle{outputnode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw];
-\tikzstyle{inputnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=red!10];
-\tikzstyle{posnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=black!5!white];
+\tikzstyle{inputnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=red!20];
+\tikzstyle{posnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=black!10!white];
 \tikzstyle{standard} = [rounded corners=3pt]

 \node [Sanode,anchor=west] (sa1) at (0,0) {\tiny{$\textbf{Self-Attention}$}};

--- a/Chapter12/Figures/figure-position-of-feedforward-neural-network-in-the-model.tex
+++ b/Chapter12/Figures/figure-position-of-feedforward-neural-network-in-the-model.tex
@@ -2,12 +2,12 @@

 \begin{tikzpicture}
 \begin{scope}
-\tikzstyle{Sanode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=orange!20];
-\tikzstyle{Resnode} = [minimum height=1.1em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=yellow!20];
-\tikzstyle{ffnnode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=blue!20];
+\tikzstyle{Sanode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=orange!30];
+\tikzstyle{Resnode} = [minimum height=1.1em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=yellow!30];
+\tikzstyle{ffnnode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=blue!30];
 \tikzstyle{outputnode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw];
-\tikzstyle{inputnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=red!10];
-\tikzstyle{posnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=black!5!white];
+\tikzstyle{inputnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=red!20];
+\tikzstyle{posnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=black!10!white];
 \tikzstyle{standard} = [rounded corners=3pt]

 \node [Sanode,anchor=west] (sa1) at (0,0) {\tiny{$\textbf{Self-Attention}$}};

--- a/Chapter12/Figures/figure-position-of-self-attention-mechanism-in-the-model.tex
+++ b/Chapter12/Figures/figure-position-of-self-attention-mechanism-in-the-model.tex
@@ -3,12 +3,12 @@

 \begin{tikzpicture}
 \begin{scope}
-\tikzstyle{Sanode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=orange!20];
+\tikzstyle{Sanode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=orange!30];
 \tikzstyle{Resnode} = [minimum height=1.1em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw];
 \tikzstyle{ffnnode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw];
 \tikzstyle{outputnode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw];
-\tikzstyle{inputnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=red!10];
-\tikzstyle{posnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=black!5!white];
+\tikzstyle{inputnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=red!20];
+\tikzstyle{posnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=black!10!white];
 \tikzstyle{standard} = [rounded corners=3pt]

 \node [Sanode,anchor=west] (sa1) at (0,0) {\tiny{$\textbf{Self-Attention}$}};

--- a/Chapter12/Figures/figure-process-of-5.tex
+++ b/Chapter12/Figures/figure-process-of-5.tex
@@ -7,9 +7,9 @@
 \node(tbq) at ([xshift=0.5em,yshift=0]atten.east){
 \begin{tabular}{|c|}
 \hline
-\rowcolor{yellow!20}  \\ \hline 
-\rowcolor{yellow!20}  \\ \hline
-\rowcolor{yellow!20}  \\ \hline 
+\rowcolor{yellow!30}  \\ \hline 
+\rowcolor{yellow!30}  \\ \hline
+\rowcolor{yellow!30}  \\ \hline 
 \end{tabular}
 };
 \node at  ([xshift=0em,yshift=0.5em]tbq.north){$\mathbi{Q}$};
@@ -20,9 +20,9 @@
 \node(tbk) at ([xshift=1em,yshift=0]tbq.east){
 \begin{tabular}{|c|}
 \hline
-\rowcolor{blue!20}  \\ \hline 
-\rowcolor{blue!20}  \\ \hline
-\rowcolor{blue!20}  \\ \hline 
+\rowcolor{blue!30}  \\ \hline 
+\rowcolor{blue!30}  \\ \hline
+\rowcolor{blue!30}  \\ \hline 
 \end{tabular}
 };
 \node at  ([xshift=0em,yshift=0.5em]tbk.north){$\mathbi{K}$};
@@ -33,9 +33,9 @@
 \node(tbv) at ([xshift=1em,yshift=0]tbk.east){
 \begin{tabular}{|c|}
 \hline
-\rowcolor{orange!20}  \\ \hline 
-\rowcolor{orange!20}  \\ \hline
-\rowcolor{orange!20}  \\ \hline 
+\rowcolor{orange!30}  \\ \hline 
+\rowcolor{orange!30}  \\ \hline
+\rowcolor{orange!30}  \\ \hline 
 \end{tabular}
 };
 \node at  ([xshift=0em,yshift=0.5em]tbv.north){$\mathbi{V}$};
@@ -51,9 +51,9 @@
 \node(tbq2) at ([xshift=0.5em,yshift=2em]sof1.east){
 \begin{tabular}{|c|}
 \hline
-\rowcolor{yellow!20}  \\ \hline 
-\rowcolor{yellow!20}  \\ \hline
-\rowcolor{yellow!20}  \\ \hline 
+\rowcolor{yellow!30}  \\ \hline 
+\rowcolor{yellow!30}  \\ \hline
+\rowcolor{yellow!30}  \\ \hline 
 \end{tabular}
 };
 \node at  ([xshift=0em,yshift=0.5em]tbq2.north){$\mathbi{Q}$};
@@ -66,7 +66,7 @@
 \node(tbk2) at ([xshift=2em,yshift=0em]times.east){
 \begin{tabular}{|l|l|l|}
 \hline
-\cellcolor{blue!20} & \cellcolor{blue!20} &\cellcolor{blue!20}  \\ \hline
+\cellcolor{blue!30} & \cellcolor{blue!30} &\cellcolor{blue!30}  \\ \hline
 \end{tabular}
 };
 \node at  ([xshift=0em,yshift=0.5em]tbk2.north){$\mathbi{K}^{\mathrm{T}}$};
@@ -78,9 +78,9 @@
 \node(mask) at  ([xshift=3em,yshift=-2em]tbk2.east){
 \begin{tabular}{|l|l|l|}
 \hline
-\cellcolor{green!20} &\cellcolor{green!20}   &\cellcolor{green!20}   \\ \hline
- \cellcolor{green!20} &\cellcolor{green!20}   &\cellcolor{green!20}   \\ \hline
- \cellcolor{green!20} &\cellcolor{green!20}   &\cellcolor{green!20}   \\ \hline
+\cellcolor{green!30} &\cellcolor{green!30}   &\cellcolor{green!30}   \\ \hline
+ \cellcolor{green!30} &\cellcolor{green!30}   &\cellcolor{green!30}   \\ \hline
+ \cellcolor{green!30} &\cellcolor{green!30}   &\cellcolor{green!30}   \\ \hline
 \end{tabular}
 };
 \node at  ([xshift=0em,yshift=0.5em]mask.north){$\mathbi{Mask}$};
@@ -93,9 +93,9 @@
 \node(tbv2) at ([xshift=1.2em,yshift=0]mask.east){
 \begin{tabular}{|c|}
 \hline
-\rowcolor{orange!20}  \\ \hline 
-\rowcolor{orange!20}  \\ \hline
-\rowcolor{orange!20}  \\ \hline 
+\rowcolor{orange!30}  \\ \hline 
+\rowcolor{orange!30}  \\ \hline
+\rowcolor{orange!30}  \\ \hline 
 \end{tabular}
 };
 \node at  ([xshift=0em,yshift=0.5em]tbv2.north){$\mathbi{V}$};
@@ -108,9 +108,9 @@
 \node(mid) at  ([xshift=1.5em,yshift=0em]sof2.east){
 \begin{tabular}{|l|l|l|}
 \hline
-\cellcolor{pink!30} &\cellcolor{pink!30}   &\cellcolor{pink!30}   \\ \hline
- \cellcolor{pink!30} &\cellcolor{pink!30}   &\cellcolor{pink!30}   \\ \hline
- \cellcolor{pink!30} &\cellcolor{pink!30}   &\cellcolor{pink!30}   \\ \hline
+\cellcolor{pink!40} &\cellcolor{pink!40}   &\cellcolor{pink!40}   \\ \hline
+ \cellcolor{pink!40} &\cellcolor{pink!40}   &\cellcolor{pink!40}   \\ \hline
+ \cellcolor{pink!40} &\cellcolor{pink!40}   &\cellcolor{pink!40}   \\ \hline
 \end{tabular}
 };
 % )
@@ -126,9 +126,9 @@
 \node(tbv3) at ([xshift=0.5em,yshift=0]bra2.east){
 \begin{tabular}{|c|}
 \hline
-\rowcolor{orange!20}  \\ \hline 
-\rowcolor{orange!20}  \\ \hline
-\rowcolor{orange!20}  \\ \hline 
+\rowcolor{orange!30}  \\ \hline 
+\rowcolor{orange!30}  \\ \hline
+\rowcolor{orange!30}  \\ \hline 
 \end{tabular}
 };
 \node at  ([xshift=0em,yshift=0.5em]tbv3.north){$\mathbi{V}$};
@@ -140,9 +140,9 @@
 \node(result) at ([xshift=2em,yshift=0]eq3.east){
 \begin{tabular}{|l|l|l|}
 \hline
-\cellcolor{red!20} &\cellcolor{red!20}   &\cellcolor{red!20}   \\ \hline
-\cellcolor{red!20}&\cellcolor{red!20}   &\cellcolor{red!20}   \\ \hline
-\cellcolor{red!20} &\cellcolor{red!20}  &\cellcolor{red!20}   \\ \hline
+\cellcolor{red!30} &\cellcolor{red!30}   &\cellcolor{red!30}   \\ \hline
+\cellcolor{red!30}&\cellcolor{red!30}   &\cellcolor{red!30}   \\ \hline
+\cellcolor{red!30} &\cellcolor{red!30}  &\cellcolor{red!30}   \\ \hline
 \end{tabular}
 };
 % x
@@ -151,9 +151,9 @@
 \node(tbv4) at ([xshift=0.5em,yshift=0]times.east){
 \begin{tabular}{|c|}
 \hline
-\rowcolor{orange!20}  \\ \hline 
-\rowcolor{orange!20}  \\ \hline
-\rowcolor{orange!20}  \\ \hline 
+\rowcolor{orange!30}  \\ \hline 
+\rowcolor{orange!30}  \\ \hline
+\rowcolor{orange!30}  \\ \hline 
 \end{tabular}
 };
 \node at  ([xshift=0em,yshift=0.5em]tbv4.north){$\mathbi{V}$};

--- a/Chapter12/Figures/figure-self-att-vs-enco-deco-att.tex
+++ b/Chapter12/Figures/figure-self-att-vs-enco-deco-att.tex
 \begin{tikzpicture}
   

-\node[rounded corners=1pt,minimum width=11.0em,minimum height=2.0em,fill=pink!30,draw=black](p1) at (0,0) {\small{Self-Attention}};
+\node[rounded corners=1pt,minimum width=11.0em,minimum height=2.0em,fill=pink!60,draw=black](p1) at (0,0) {\small{Self-Attention}};

 \node[anchor=north](word1) at ([xshift=0.0em,yshift=-2.0em]p1.south) {\small \mathbi{K}};
 \node[anchor=west](word2) at ([xshift=2.2em]word1.east) {\small \mathbi{V}};
@@ -19,7 +19,7 @@

 \node[anchor=north](caption1) at ([xshift=0.0em,yshift=-9.5em]p1.south){\small{(a) Self-Attention的输入}};
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\node[anchor=west,rounded corners=1pt,minimum width=14.0em,minimum height=2.0em,fill=pink!30,draw=black](p2) at ([xshift=5.0em]p1.east){\small{Encoder-Decoder Attention}};
+\node[anchor=west,rounded corners=1pt,minimum width=14.0em,minimum height=2.0em,fill=pink!50,draw=black](p2) at ([xshift=5.0em]p1.east){\small{Encoder-Decoder Attention}};

 \node[anchor=north](word1-2) at ([xshift=0.0em,yshift=-2.0em]p2.south) {\small \mathbi{K}};
 \node[anchor=west](word2-2) at ([xshift=2.2em]word1-2.east) {\small \mathbi{V}};

--- a/Chapter12/Figures/figure-transformer-input-and-position-encoding.tex
+++ b/Chapter12/Figures/figure-transformer-input-and-position-encoding.tex
@@ -6,8 +6,8 @@
 \tikzstyle{Resnode} = [minimum height=1.1em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw];
 \tikzstyle{ffnnode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw];
 \tikzstyle{outputnode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw];
-\tikzstyle{inputnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=red!10];
-\tikzstyle{posnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=black!5!white];
+\tikzstyle{inputnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=red!20];
+\tikzstyle{posnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=black!10!white];
 \tikzstyle{standard} = [rounded corners=3pt]

 \node [Sanode,anchor=west] (sa1) at (0,0) {\tiny{$\textbf{Self-Attention}$}};

--- a/Chapter12/Figures/figure-transformer.tex
+++ b/Chapter12/Figures/figure-transformer.tex
@@ -2,12 +2,12 @@

 \begin{tikzpicture}
 \begin{scope}
-\tikzstyle{Sanode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=orange!20];
-\tikzstyle{Resnode} = [minimum height=1.1em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=yellow!20];
-\tikzstyle{ffnnode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=blue!10];
-\tikzstyle{outputnode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=blue!30];
-\tikzstyle{inputnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=red!10];
-\tikzstyle{posnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=black!5!white];
+\tikzstyle{Sanode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=orange!30];
+\tikzstyle{Resnode} = [minimum height=1.1em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=yellow!30];
+\tikzstyle{ffnnode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=blue!20];
+\tikzstyle{outputnode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=blue!40];
+\tikzstyle{inputnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=red!20];
+\tikzstyle{posnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=black!10!white];
 \tikzstyle{standard} = [rounded corners=3pt]

 \node [Sanode,anchor=west] (sa1) at (0,0) {\tiny{$\textbf{Self-Attention}$}};

--- a/Chapter12/chapter12.tex
+++ b/Chapter12/chapter12.tex
@@ -132,7 +132,7 @@
 \multicolumn{1}{l|}{ConvS2S}             & 25.16           & 40.46           & 1.5$\times 10^{20}$                   \\
 \multicolumn{1}{l|}{MoE}                 & 26.03           & 40.56           & 1.2$\times 10^{20}$                   \\
 \multicolumn{1}{l|}{Transformer (Base Model) }                 & 27.3           &38.1           & 3.3$\times 10^{18}$                   \\
-\multicolumn{1}{l|}{Transformer (Big Model)}    & {\small\sffamily\bfseries{28.4}}   & {\small\sffamily\bfseries{41.8}}   & 2.3$\times 10^{19}$                   \\
+\multicolumn{1}{l|}{Transformer (Big Model)}    & {\small\bfnew{28.4}}   & {\small\bfnew{41.8}}   & 2.3$\times 10^{19}$                   \\
 \end{tabular}
 \end{table}
 %----------------------------------------------
@@ -158,19 +158,19 @@

 \begin{itemize}
 \vspace{0.5em}
-\item {\small\sffamily\bfseries{自注意力子层}}\index{自注意力子层}（Self-Attention Sub-layer）\index{Self-Attention Sub-layer}：使用自注意力机制对输入的序列进行新的表示；
+\item {\small\bfnew{自注意力子层}}\index{自注意力子层}（Self-Attention Sub-layer）\index{Self-Attention Sub-layer}：使用自注意力机制对输入的序列进行新的表示；
 \vspace{0.5em}
-\item {\small\sffamily\bfseries{前馈神经网络子层}}\index{前馈神经网络子层}（Feed-Forward Sub-layer）\index{Feed-Forward Sub-layer}：使用全连接的前馈神经网络对输入向量序列进行进一步变换；
+\item {\small\bfnew{前馈神经网络子层}}\index{前馈神经网络子层}（Feed-Forward Sub-layer）\index{Feed-Forward Sub-layer}：使用全连接的前馈神经网络对输入向量序列进行进一步变换；
 \vspace{0.5em}
-\item {\small\sffamily\bfseries{残差连接}}（标记为“Add”）：对于自注意力子层和前馈神经网络子层，都有一个从输入直接到输出的额外连接，也就是一个跨子层的直连。残差连接可以使深层网络的信息传递更为有效；
+\item {\small\bfnew{残差连接}}（标记为“Add”）：对于自注意力子层和前馈神经网络子层，都有一个从输入直接到输出的额外连接，也就是一个跨子层的直连。残差连接可以使深层网络的信息传递更为有效；
 \vspace{0.5em}
-\item {\small\sffamily\bfseries{层标准化}}（Layer Normalization）：自注意力子层和前馈神经网络子层进行最终输出之前，会对输出的向量进行层标准化，规范结果向量取值范围，这样易于后面进一步的处理。
+\item {\small\bfnew{层标准化}}（Layer Normalization）：自注意力子层和前馈神经网络子层进行最终输出之前，会对输出的向量进行层标准化，规范结果向量取值范围，这样易于后面进一步的处理。
 \vspace{0.5em}
 \end{itemize}

 \parinterval 以上操作就构成了Transformer的一层，各个模块执行的顺序可以简单描述为：Self-Attention $\to$ Residual Connection $\to$ Layer Normalization $\to$ Feed Forward Network $\to$ Residual Connection $\to$ Layer Normalization。编码器可以包含多个这样的层，比如，可以构建一个六层编码器，每层都执行上面的操作。最上层的结果作为整个编码的结果，会被传入解码器。

-\parinterval 解码器的结构与编码器十分类似。它也是由若干层组成，每一层包含编码器中的所有结构，即：自注意力子层、前馈神经网络子层、残差连接和层标准化模块。此外，为了捕捉源语言的信息，解码器又引入了一个额外的{\small\sffamily\bfseries{编码-解码注意力子层}}\index{编码-解码注意力子层}（Encoder-Decoder Attention Sub-layer）\index{Encoder-Decoder Attention Sub-layer}。这个新的子层，可以帮助模型使用源语言句子的表示信息生成目标语言不同位置的表示。编码-解码注意力子层仍然基于自注意力机制，因此它和自注意力子层的结构是相同的，只是$\mathrm{query}$、$\mathrm{key}$、$\mathrm{value}$的定义不同。比如，在解码器端，自注意力子层的$\mathrm{query}$、$\mathrm{key}$、$\mathrm{value}$是相同的，它们都等于解码器每个位置的表示。而在编码-解码注意力子层中，$\mathrm{query}$是解码器每个位置的表示，此时$\mathrm{key}$和$\mathrm{value}$是相同的，等于编码器每个位置的表示。图\ref{fig:12-5}给出了这两种不同注意力子层输入的区别。
+\parinterval 解码器的结构与编码器十分类似。它也是由若干层组成，每一层包含编码器中的所有结构，即：自注意力子层、前馈神经网络子层、残差连接和层标准化模块。此外，为了捕捉源语言的信息，解码器又引入了一个额外的{\small\bfnew{编码-解码注意力子层}}\index{编码-解码注意力子层}（Encoder-Decoder Attention Sub-layer）\index{Encoder-Decoder Attention Sub-layer}。这个新的子层，可以帮助模型使用源语言句子的表示信息生成目标语言不同位置的表示。编码-解码注意力子层仍然基于自注意力机制，因此它和自注意力子层的结构是相同的，只是$\mathrm{query}$、$\mathrm{key}$、$\mathrm{value}$的定义不同。比如，在解码器端，自注意力子层的$\mathrm{query}$、$\mathrm{key}$、$\mathrm{value}$是相同的，它们都等于解码器每个位置的表示。而在编码-解码注意力子层中，$\mathrm{query}$是解码器每个位置的表示，此时$\mathrm{key}$和$\mathrm{value}$是相同的，等于编码器每个位置的表示。图\ref{fig:12-5}给出了这两种不同注意力子层输入的区别。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -319,7 +319,7 @@

 \subsection{多头注意力机制}

-\parinterval Transformer中使用的另一项重要技术是{\small\sffamily\bfseries{多头注意力机制}}\index{多头注意力机制}（Multi-head Attention）\index{Multi-head Attention}。“多头”可以理解成将原来的$\mathbi{Q}$、$\mathbi{K}$、$\mathbi{V}$按照隐层维度平均切分成多份。假设切分$h$份，那么最终会得到$\mathbi{Q} = \{ \mathbi{Q}_1,...,\mathbi{Q}_h \}$，$\mathbi{K}=\{ \mathbi{K}_1,...,\mathbi{K}_h \}$，$\mathbi{V}=\{ \mathbi{V}_1,...,\mathbi{V}_h \}$。多头注意力就是用每一个切分得到的$\mathbi{Q}$，$\mathbi{K}$，$\mathbi{V}$独立的进行注意力计算，即第$i$个头的注意力计算结果$\mathbi{head}_i = \textrm{Attention}(\mathbi{Q}_i,\mathbi{K}_i, \mathbi{V}_i)$。
+\parinterval Transformer中使用的另一项重要技术是{\small\bfnew{多头注意力机制}}\index{多头注意力机制}（Multi-head Attention）\index{Multi-head Attention}。“多头”可以理解成将原来的$\mathbi{Q}$、$\mathbi{K}$、$\mathbi{V}$按照隐层维度平均切分成多份。假设切分$h$份，那么最终会得到$\mathbi{Q} = \{ \mathbi{Q}_1,...,\mathbi{Q}_h \}$，$\mathbi{K}=\{ \mathbi{K}_1,...,\mathbi{K}_h \}$，$\mathbi{V}=\{ \mathbi{V}_1,...,\mathbi{V}_h \}$。多头注意力就是用每一个切分得到的$\mathbi{Q}$，$\mathbi{K}$，$\mathbi{V}$独立的进行注意力计算，即第$i$个头的注意力计算结果$\mathbi{head}_i = \textrm{Attention}(\mathbi{Q}_i,\mathbi{K}_i, \mathbi{V}_i)$。

 \parinterval 下面根据图\ref{fig:12-12}详细介绍多头注意力的计算过程：


--- a/Chapter15/Figures/figure-introducing-rnn-mechanism-into-transformer.jpg
+++ b/Chapter15/Figures/figure-introducing-rnn-mechanism-into-transformer.jpg
--- a/Chapter15/Figures/figure-learning-rate.tex
+++ b/Chapter15/Figures/figure-learning-rate.tex
@@ -15,7 +15,7 @@
      ]

      \addplot[red,line width=1.25pt] coordinates {(0,0) (1.6,2) (1.8,1.888) (2,1.787) (2.5,1.606) (3,1.462) (3.5,1.3549) (4,1.266) (4.5,1.193) (5,1.131)};
-      \addlegendentry{\scriptsize 原始学习率}
+      \addlegendentry{\scriptsize 原始的学习率}
      %\addplot[red,line width=1.25pt] coordinates {(0,0) (8000,0.002) (10000,0.00179) (12000,0.00163) (12950,0.001572)};
      \addplot[blue,line width=1.25pt] coordinates {(0,0) (0.8,2) (0.9906,1.7983)};
      %\addplot[red,line width=1.25pt] coordinates {(0,0) (8000,0.002) (9906,0.0017983)};

--- a/Chapter15/Figures/figure-relative-position-coding-and-absolute-position-coding.jpg
+++ b/Chapter15/Figures/figure-relative-position-coding-and-absolute-position-coding.jpg
--- a/Chapter15/Figures/figure-relative-position-weight.tex
+++ b/Chapter15/Figures/figure-relative-position-weight.tex
 \begin{tikzpicture}

 \tikzstyle{node1} = [anchor=center,draw,minimum height=2em,minimum width=2em,inner sep=0pt,fill=green!80]
-\tikzstyle{node2} = [anchor=center,draw,minimum height=2em,minimum width=2em,inner sep=0pt,fill=green!40]
+\tikzstyle{node2} = [anchor=center,draw,minimum height=2em,minimum width=2em,inner sep=0pt,fill=green!50]
 \tikzstyle{node3} = [anchor=center,draw,minimum height=2em,minimum width=2em,inner sep=0pt,fill=green!20]
 \tikzstyle{node4} = [anchor=center,draw,minimum height=2em,minimum width=2em,inner sep=0pt]
 \tikzstyle{node5} = [anchor=center,draw,minimum height=2em,minimum width=2em,inner sep=0pt,fill=red!20]
-\tikzstyle{node6} = [anchor=center,draw,minimum height=2em,minimum width=2em,inner sep=0pt,fill=red!40]
+\tikzstyle{node6} = [anchor=center,draw,minimum height=2em,minimum width=2em,inner sep=0pt,fill=red!50]
 \tikzstyle{node7} = [anchor=center,draw,minimum height=2em,minimum width=2em,inner sep=0pt,fill=red!80]

 \begin{scope}[scale=1.0]

--- a/Chapter15/Figures/figure-weight-visualization-of-convergence-DLCL-network.tex
+++ b/Chapter15/Figures/figure-weight-visualization-of-convergence-DLCL-network.tex
 \begin{tikzpicture}[node distance = 0,scale = 1]
 \tikzstyle{every node}=[scale=1]
-\node[draw=white] (input) at (0,0){\includegraphics[width=0.62\textwidth]{./Chapter15/Figures/DLCL-picture.png}};(1.9,-1.4);
-\node[scale = 2]  at (4.5,3.6){4};
-\node[scale = 2]  at (4.5,1.8){2};
-\node[scale = 2]  at (4.5,0){0};
-\node[scale = 2]  at (4.5,-1.8){-2};
-\node[scale = 2]  at (4.5,-3.6){-4};
-\node[scale = 1.5]  at (-4.5,3.75){$\rm x_{1}$};
-\node[scale = 1.5]  at (-4.5,2.5){$\rm x_{6}$};
-\node[scale = 1.5]  at (-4.5,1.4){$\rm x_{11}$};
-\node[scale = 1.5]  at (-4.5,0.1){$\rm x_{16}$};
-\node[scale = 1.5]  at (-4.5,-1.1){$\rm x_{21}$};
-\node[scale = 1.5]  at (-4.5,-2.3){$\rm x_{26}$};
-\node[scale = 1.5]  at (-4.5,-3.4){$\rm x_{31}$};
-\node[scale = 1.5]  at (-3.8,-4){$\rm y_{0}$};
-\node[scale = 1.5]  at (-2.7,-4){$\rm y_{5}$};
-\node[scale = 1.5]  at (-1.5,-4){$\rm y_{10}$};
-\node[scale = 1.5]  at (-0.3,-4){$\rm y_{15}$};
-\node[scale = 1.5]  at (0.9,-4){$\rm y_{20}$};
-\node[scale = 1.5]  at (2.1,-4){$\rm y_{25}$};
-\node[scale = 1.5]  at (3.3,-4){$\rm y_{30}$};
+\node[draw=white] (input) at (0,0){\includegraphics[width=0.62\textwidth]{./Chapter15/Figures/DLCL-picture.png}};
+{\footnotesize
+\node[anchor=center]  at (4.1,3){4};
+\node[anchor=center]  at (4.1,1.5){2};
+\node[anchor=center]  at (4.1,0){0};
+\node[anchor=center]  at (4.1,-1.5){-2};
+\node[anchor=center]  at (4.1,-3){-4};
+\node[anchor=center]  at (-4.2,3.6){$ x_{1}$};
+\node[anchor=center]  at (-4.2,2.45){$ x_{6}$};
+\node[anchor=center]  at (-4.2,1.3){$ x_{11}$};
+\node[anchor=center]  at (-4.2,0.15){$ x_{16}$};
+\node[anchor=center]  at (-4.2,-1){$ x_{21}$};
+\node[anchor=center]  at (-4.2,-2.15){$ x_{26}$};
+\node[anchor=center]  at (-4.2,-3.3){$ x_{31}$};
+\node[anchor=center]  at (-3.75,-3.8){$ y_{0}$};
+\node[anchor=center]  at (-2.6,-3.8){$ y_{5}$};
+\node[anchor=center]  at (-1.45,-3.8){$ y_{10}$};
+\node[anchor=center]  at (-0.3,-3.8){$ y_{15}$};
+\node[anchor=center]  at (0.85,-3.8){$ y_{20}$};
+\node[anchor=center]  at (2,-3.8){$ y_{25}$};
+\node[anchor=center]  at (3.15,-3.8){$ y_{30}$};
+}
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter15/chapter15.tex
+++ b/Chapter15/chapter15.tex
@@ -46,7 +46,16 @@

 \parinterval 但是，Transformer模型中的自注意力机制本身并不具有这种性质，而且它直接忽略了输入单元之间的位置关系。虽然，Transformer中引入了基于正余弦函数的绝对位置编码（见{\chaptertwelve}），但是该方法仍然无法显性区分局部依赖与长距离依赖\footnote[1]{局部依赖指当前位置与局部的相邻位置之间的联系。}。

-\parinterval 针对上述问题，研究人员设计了“相对位置”编码，对原有的“绝对位置”编码进行补充，强化了局部依赖\upcite{Dai2019TransformerXLAL,Shaw2018SelfAttentionWR}。此外，由于模型中每一层均存在自注意力机制计算，因此模型捕获位置信息的能力也逐渐减弱，这种现象在深层模型中尤为明显。而利用相对位置编码能够把位置信息显性加入到每一层的注意力机制的计算中，进而强化深层模型中局部位置的表示能力\upcite{li2020shallow}。{\color{red} 图XXX对比了Transformer中相对位置编码和绝对位置编码方法。}
+\parinterval 针对上述问题，研究人员设计了“相对位置”编码，对原有的“绝对位置”编码进行补充，强化了局部依赖\upcite{Dai2019TransformerXLAL,Shaw2018SelfAttentionWR}。此外，由于模型中每一层均存在自注意力机制计算，因此模型捕获位置信息的能力也逐渐减弱，这种现象在深层模型中尤为明显。而利用相对位置编码能够把位置信息显性加入到每一层的注意力机制的计算中，进而强化深层模型中局部位相对位置编码和绝对位置编码方法置的表示能力\upcite{li2020shallow}。图\ref{fig:15-1}对比了Transformer中相对位置编码和绝对位置编码方法。
+
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\includegraphics[scale=0.5]{./Chapter15/Figures/figure-relative-position-coding-and-absolute-position-coding.jpg}
+\caption{绝对位置编码（左）和相对位置编码（右）}
+\label{fig:15-1}
+\end{figure}
+%-------------------------------------------

 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -94,12 +103,22 @@
 \label{eq:15-9}
 \end{eqnarray}

-\noindent 其中，$\mathbi{w}^K \in \mathbb{R}^{d_k}$和$\mathbi{w}^V \in \mathbb{R}^{d_k}$是模型中可学习的参数矩阵；$\textrm{clip}(\cdot,\cdot)$表示截断操作，由公式\eqref{eq:15-9}定义。可以看出，$\mathbi{a}^K$ 与$\mathbi{a}^V$ 是根据输入的相对位置信息（由$\textrm{clip}(j-i,k)$确定）对$\mathbi{w}^K$和$\mathbi{w}^V$进行查表得到的向量，即相对位置表示。这里通过预先设定的最大相对位置$k$，强化模型对当前词为中心的左右各$k$ 个词的注意力计算。因此，最终的窗口大小为$2k + 1$。 对于边缘位置窗口大小不足$2k$的单词，采用了裁剪的机制，即只对有效的临近词进行建模。此时，注意力模型的计算可以调整为：
+\noindent 其中，$\mathbi{w}^K \in \mathbb{R}^{d_k}$和$\mathbi{w}^V \in \mathbb{R}^{d_k}$是模型中可学习的参数矩阵；$\textrm{clip}(\cdot,\cdot)$表示截断操作，由公式\eqref{eq:15-9}定义。可以看出，$\mathbi{a}^K$ 与$\mathbi{a}^V$ 是根据输入的相对位置信息（由$\textrm{clip}(j-i,k)$确定）对$\mathbi{w}^K$和$\mathbi{w}^V$进行查表得到的向量，即相对位置表示，如图\ref{fig:15-2}所示。这里通过预先设定的最大相对位置$k$，强化模型对当前词为中心的左右各$k$ 个词的注意力计算。因此，最终的窗口大小为$2k + 1$。 对于边缘位置窗口大小不足$2k$的单词，采用了裁剪的机制，即只对有效的临近词进行建模。此时，注意力模型的计算可以调整为：
 \begin{eqnarray}
 \mathbi{z}_{i} &=& \sum_{j=1}^m \alpha_{ij}(\mathbi{x}_j \mathbi{W}_V + \mathbi{a}_{ij}^V)
 \label{eq:15-10}
 \end{eqnarray}

+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter15/Figures/figure-relative-position-weight}
+\caption{相对位置权重$\mathbi{a}_{ij}$}
+\label{fig:15-2}
+\end{figure}
+%-------------------------------------------
+
+
 \noindent 相比于公式\eqref{eq:15-4}，公式\eqref{eq:15-10}在计算$\mathbi{z}_i$时引入了额外的向量$\mathbi{a}_{ij}^V$，用它来表示位置$i$与位置$j$之间的相对位置信息。同时在计算注意力权重时对$\mathbi{K}$进行修改，同样引入了$\mathbi{a}_{ij}^K$向量表示位置$i$与位置$j$之间的相对位置。在公式\eqref{eq:15-6}的基础上，注意力权重的计算方式调整为：
 \begin{eqnarray}
 \mathbi{e}_{ij} &=& \frac{\mathbi{x}_i \mathbi{W}_Q{(\mathbi{x}_j \mathbi{W}_K + \mathbi{a}_{ij}^K )}^{T}}{\sqrt{d_k}} \nonumber \\
@@ -109,15 +128,6 @@

 \noindent 可以注意到，公式\eqref{eq:15-10}和公式\eqref{eq:15-11}将位置编码信息直接暴露给每一层注意力机制的计算，而不是像标准Transformer中只将其作为整个模型的输入。

-%----------------------------------------------
-\begin{figure}[htp]
-\centering
-\input{./Chapter15/Figures/figure-relative-position-weight}
-\caption{相对位置权重$\mathbi{a}_{ij}$}
-\label{fig:15-1}
-\end{figure}
-%-------------------------------------------
-
 \vspace{0.5em}
 \item Transformer-XL\upcite{Dai2019TransformerXLAL}。在Transformer中，模型的输入由词嵌入表示与绝对位置编码组成，例如，对于输入层有，$x_i = \mathbi{E}_{x_i} + \mathbi{U}_i$，$x_j=\mathbi{E}_{x_j} + \mathbi{U}_j$，其中$\mathbi{E}_{x_i} $和$\mathbi{E}_{x_j} $表示词嵌入，$\mathbi{U}_i$和$\mathbi{U}_j$表示绝对位置编码（正余弦函数）。将其代入公式\eqref{eq:15-6}中可以得到：
 \begin{eqnarray}
@@ -159,12 +169,21 @@ A_{ij}^{\rm rel} &=& \underbrace{\mathbi{E}_{x_i}\mathbi{W}_Q\mathbi{W}_{K}^{T}\

 \begin{itemize}
 \vspace{0.5em}
-\item {\small\bfnew{引入高斯约束}}\upcite{Yang2018ModelingLF}。如图\ref{fig:15-2}所示，这类方法的核心思想是引入可学习的高斯分布$\mathbi{G}$作为局部约束，与注意力权重进行融合，具体的形式如下：
+\item {\small\bfnew{引入高斯约束}}\upcite{Yang2018ModelingLF}。如图\ref{fig:15-3}所示，这类方法的核心思想是引入可学习的高斯分布$\mathbi{G}$作为局部约束，与注意力权重进行融合，具体的形式如下：
 \begin{eqnarray}
 \mathbi{e}_{ij} &=& \frac{(\mathbi{x}_i \mathbi{W}_Q){(\mathbi{x}_j \mathbi{W}_K)}^{T}}{\sqrt{d_k}} + \mathbi{G}
 \label{eq:15-15}
 \end{eqnarray}

+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter15/Figures/figure-attention-distribution-based-on-gaussian-distribution}
+\caption{融合高斯分布的注意力分布}
+\label{fig:15-3}
+\end{figure}
+%-------------------------------------------
+
 \noindent 其中，$\mathbi{G} \in \mathbb{R}^{m\times m}$。$\mathbi{G}$中的每个元素$G_{ij}$表示位置$j$和预测的中心位置$P_i$之间的关联程度，计算公式如下：
 \begin{eqnarray}
 G_{ij} &=& - \frac{{(j - P_i)}^2}{2\sigma_i^2}
@@ -186,24 +205,15 @@ v_i &=& \mathbi{I}_d^T\textrm{Tanh}(\mathbi{W}_d\mathbi{Q}_i)

 \noindent 其中，$\mathbi{W}_p$、$\mathbi{W}_d$、$\mathbi{I}_p$、$\mathbi{I}_d$均为模型中可学习的参数矩阵。

-%----------------------------------------------
-\begin{figure}[htp]
-\centering
-\input{./Chapter15/Figures/figure-attention-distribution-based-on-gaussian-distribution}
-\caption{融合高斯分布的注意力分布}
-\label{fig:15-2}
-\end{figure}
-%-------------------------------------------
-
 \vspace{0.5em}
-\item {\small\bfnew{多尺度局部建模}}\upcite{DBLP:conf/aaai/GuoQLXZ20}。不同于上述方法直接作用于注意力权重，多尺度局部建模通过赋予多头不一样的局部感受野，间接地引入局部约束（图\ref{fig:15-3}）。
+\item {\small\bfnew{多尺度局部建模}}\upcite{DBLP:conf/aaai/GuoQLXZ20}。不同于上述方法直接作用于注意力权重，多尺度局部建模通过赋予多头不一样的局部感受野，间接地引入局部约束（图\ref{fig:15-4}）。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter15/Figures/figure-multi-scale-local-modeling}
 \caption{多尺度局部建模\upcite{DBLP:conf/aaai/GuoQLXZ20}}
-\label{fig:15-3}
+\label{fig:15-4}
 \end{figure}
 %-------------------------------------------

@@ -242,16 +252,16 @@ C(\mathbi{x}_j \mathbi{W}_K,\omega) &=& (\mathbi{x}_{j-\omega},\ldots,\mathbi{x}
 \vspace{0.5em}
 \item {\small\bfnew{使用轻量卷积和动态卷积神经网络}}\upcite{Wu2019PayLA,DBLP:conf/interspeech/GulatiQCPZYHWZW20}。比如，分别在编码端和解码端利用轻量卷积或动态卷积神经网络（见{\chapternine}）替换Transformer的自注意力机制，同时保留解码端的编码-解码注意力机制，一定程度上加强了模型对局部信息的建模能力，同时提高了计算效率。
 \vspace{0.5em}
-\item {\small\bfnew{使用1维卷积注意力网络}}（图\ref{fig:15-4}(b)）。可以使用一维的卷积自注意力网络（1D-CSAN）将关注的范围限制在相近的元素窗口中。其形式上十分简单，只需预先设定好局部建模的窗口大小$D$，并在进行注意力权重计算和对Value值进行加权求和时，将其限制在设定好的窗口范围内即可。
+\item {\small\bfnew{使用1维卷积注意力网络}}（图\ref{fig:15-5}(b)）。可以使用一维的卷积自注意力网络（1D-CSAN）将关注的范围限制在相近的元素窗口中。其形式上十分简单，只需预先设定好局部建模的窗口大小$D$，并在进行注意力权重计算和对Value值进行加权求和时，将其限制在设定好的窗口范围内即可。
 \vspace{0.5em}
-\item {\small\bfnew{使用2维卷积注意力网络}}（图\ref{fig:15-4}(c)）。在一维卷积注意力网络的基础上，对多个注意力头之间的信息进行交互建模，打破了注意力头之间的界限。 1D-CDAN的关注区域为$1\times D$，当将其扩展为二维矩形$D \times N$，长和宽分别为局部窗口的大小和参与建模的自注意力头的个数。这样，模型可以计算某个头中的第$i$个元素和另一个头中的第$j$个元素之间的相关性系数。实现了对不同子空间之间关系的建模，所得到的注意力分布表示了头之间的依赖关系。
+\item {\small\bfnew{使用2维卷积注意力网络}}（图\ref{fig:15-5}(c)）。在一维卷积注意力网络的基础上，对多个注意力头之间的信息进行交互建模，打破了注意力头之间的界限。 1D-CDAN的关注区域为$1\times D$，当将其扩展为二维矩形$D \times N$，长和宽分别为局部窗口的大小和参与建模的自注意力头的个数。这样，模型可以计算某个头中的第$i$个元素和另一个头中的第$j$个元素之间的相关性系数。实现了对不同子空间之间关系的建模，所得到的注意力分布表示了头之间的依赖关系。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter15/Figures/figure-convolutional-attention-network}
 \caption{卷积注意力模型示意图\upcite{DBLP:journals/corr/abs-1904-03107}}
-\label{fig:15-4}
+\label{fig:15-5}
 \end{figure}
 %-------------------------------------------
 \end{itemize}
@@ -272,14 +282,14 @@ C(\mathbi{x}_j \mathbi{W}_K,\omega) &=& (\mathbi{x}_{j-\omega},\ldots,\mathbi{x}
 \vspace{0.5em}
 \item Weighted Transformer\upcite{DBLP:journals/corr/abs-1711-02132}。其主要思想是在多头自注意力机制的基础上保留不同表示空间的特征。传统方法使用级联操作并通过线性映射矩阵来融合不同头之间的信息，而Weighted Transformer直接利用线性映射将维度为$d_k$ 的向量表示映射到$d_{\rm model}$维的向量。之后，将这个$d_{\rm model}$维向量分别送入每个分支中的前馈神经网络，最后对不同分支的输出进行线性加权。但是，这个模型的计算复杂度要大于标准的Transformer模型。
 \vspace{0.5em}
-\item 多分支注意力模型\upcite{DBLP:journals/corr/abs-2006-10270}。不同于Weighted Transformer模型，多分支注意力模型直接利用每个分支独立地进行自注意力模型的计算（图\ref{fig:15-5}）。同时为了避免结构相同的多个多头注意力机制之间的协同适应，这种模型使用Dropout方法在训练过程中以一定的概率随机地丢弃一些分支。
+\item 多分支注意力模型\upcite{DBLP:journals/corr/abs-2006-10270}。不同于Weighted Transformer模型，多分支注意力模型直接利用每个分支独立地进行自注意力模型的计算（图\ref{fig:15-6}）。同时为了避免结构相同的多个多头注意力机制之间的协同适应，这种模型使用Dropout方法在训练过程中以一定的概率随机地丢弃一些分支。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter15/Figures/figure-multi-branch-attention-model}
 \caption{多分支注意力模型}
-\label{fig:15-5}
+\label{fig:15-6}
 \end{figure}
 %-------------------------------------------

@@ -307,7 +317,16 @@ C(\mathbi{x}_j \mathbi{W}_K,\omega) &=& (\mathbi{x}_{j-\omega},\ldots,\mathbi{x}

 \parinterval 虽然，Transformer模型完全摒弃了循环单元与卷积单元，仅通过位置编码来区分序列中的不同位置。但是，循环神经网络也非常适用于处理序列结构，且其结构成熟、易于优化。因此，有研究人员尝试将其与Transformer模型融合。这种方式一方面能够发挥循环神经网络简单高效的特点，另一方面也能够发挥Transformer模型在特征提取方面的优势，是一种非常值得探索的思路\upcite{Chen2018TheBO}。

-\parinterval 在Transformer模型中引入循环神经网络的一种方法是，对深层网络的不同层使用循环机制。早在残差网络提出时，研究人员已经开始尝试探讨残差网络成功背后的原因\upcite{DBLP:conf/nips/VeitWB16,DBLP:journals/corr/GreffSS16,DBLP:conf/iclr/ChangMHTB18}。本质上，在卷积神经网络中引入残差连接后，神经网络从深度上隐性地利用循环的特性。也就是，多层Transformer的不同层本身也可以被看作是一个处理序列，只是序列中不同位置（对应不同层）的模型参数独立，而非共享。Transformer编码器与解码器分别由$N$个相同结构但参数独立的块堆叠而成，其中编码块与解码块中分别包含2和3个子层。同时，子层之间引入了残差连接保证了网络信息传递的高效性。因此，一个自然的想法是通过共享不同块之间的参数，引入循环神经网络中的归纳偏置\upcite{DBLP:conf/iclr/DehghaniGVUK19}。其中每层的权重是共享的，并引入了基于时序的编码向量用于显著区分不同深度下的时序信息，{\color{red} 如图XXX所示}。在训练大容量预训练模型时同样也采取了共享层间参数的方式\upcite{Lan2020ALBERTAL}。
+\parinterval 在Transformer模型中引入循环神经网络的一种方法是，对深层网络的不同层使用循环机制。早在残差网络提出时，研究人员已经开始尝试探讨残差网络成功背后的原因\upcite{DBLP:conf/nips/VeitWB16,DBLP:journals/corr/GreffSS16,DBLP:conf/iclr/ChangMHTB18}。本质上，在卷积神经网络中引入残差连接后，神经网络从深度上隐性地利用循环的特性。也就是，多层Transformer的不同层本身也可以被看作是一个处理序列，只是序列中不同位置（对应不同层）的模型参数独立，而非共享。Transformer编码器与解码器分别由$N$个相同结构但参数独立的块堆叠而成，其中编码块与解码块中分别包含2和3个子层。同时，子层之间引入了残差连接保证了网络信息传递的高效性。因此，一个自然的想法是通过共享不同块之间的参数，引入循环神经网络中的归纳偏置\upcite{DBLP:conf/iclr/DehghaniGVUK19}。其中每层的权重是共享的，并引入了基于时序的编码向量用于显著区分不同深度下的时序信息，如图\ref{fig:15-8}所示。在训练大容量预训练模型时同样也采取了共享层间参数的方式\upcite{Lan2020ALBERTAL}。
+
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\includegraphics[scale=0.5]{./Chapter15/Figures/figure-introducing-rnn-mechanism-into-transformer.jpg}
+\caption{在Transformer中引入循环机制}
+\label{fig:15-8}
+\end{figure}
+%-------------------------------------------

 \parinterval 另一种方法是，利用循环神经网络对输入序列进行编码，之后通过门控机制将得到的结果与Transformer进行融合\upcite{DBLP:conf/naacl/HaoWYWZT19}。融合机制可以采用串行计算或并行计算。

@@ -620,7 +639,7 @@ C(\mathbi{x}_j \mathbi{W}_K,\omega) &=& (\mathbi{x}_{j-\omega},\ldots,\mathbi{x}
 \label{eq:15-43}
 \end{eqnarray}

-\noindent 其中，$l$为对应的神经网络的层次，$\alpha$为预先设定的超参数来控制缩放的比例。这样，可以通过缩减顶层神经网络输出与输入之间的差异，减少顶层神经网络参数的梯度范数，从而缓解由于神经网络过深所带来的梯度消失问题。
+\noindent 其中，$l$为对应的神经网络的层数，$\alpha$为预先设定的超参数来控制缩放的比例。这样，可以通过缩减顶层神经网络输出与输入之间的差异，减少顶层神经网络参数的梯度范数，从而缓解由于神经网络过深所带来的梯度消失问题。

 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -637,17 +656,16 @@ C(\mathbi{x}_j \mathbi{W}_K,\omega) &=& (\mathbi{x}_{j-\omega},\ldots,\mathbi{x}
 \item 计算方差：${\bm  \sigma}=\textrm{std}⁡(\mathbi{x}_l+\mathbi{y}_l)$
 \vspace{0.5em}
 \item 根据均值和方差对输入进行放缩，如下：
-
 \begin{eqnarray}
 \mathbi{x}_{l+1}^{\textrm{post}} &=& \frac{\mathbi{x}_l+\mathbi{y}_l-{\bm  \mu}}{\bm  \sigma} \cdot \mathbi{w}+\mathbi{b}
-\label{eq:15-41}
+\label{eq:15-44}
 \end{eqnarray}

-\noindent 其中，$\mathbi{w}$和$\mathbi{b}$为可学习参数。进一步将公式\eqref{eq:15-41}展开后可得：
+\noindent 其中，$\mathbi{w}$和$\mathbi{b}$为可学习参数。进一步将公式\eqref{eq:15-44}展开后可得：
 \begin{eqnarray}
 \mathbi{x}_{l+1}^{\textrm{post}} &=& \frac{\mathbi{x}_l+\mathbi{y}_l}{\bm  \sigma} \cdot \mathbi{w} - \frac{\bm  \mu}{\bm  \sigma} \cdot \mathbi{w}+\mathbi{b} \nonumber \\
                                 &=& \frac{\mathbi{w}}{\bm  \sigma} \cdot \mathbi{x}_{l+1}^{\textrm{pre}}-\frac{\mathbi{w}}{\bm  \sigma} \cdot {\bm  \mu}+\mathbi{b}
-\label{eq:15-42}
+\label{eq:15-45}
 \end{eqnarray}
 \vspace{0.5em}
 \end{itemize}
@@ -655,7 +673,7 @@ C(\mathbi{x}_j \mathbi{W}_K,\omega) &=& (\mathbi{x}_{j-\omega},\ldots,\mathbi{x}
 \parinterval 可以看到相比于Pre-Norm的计算方式，基于Post-Norm的Transformer中子层的输出为Pre-Norm形式的$\frac{\mathbi{w}}{\bm  \sigma}$倍。当$\frac{\mathbi{w}}{\bm  \sigma}<1$时，$\mathbi{x}_l$较小，输入与输出之间差异过大，导致深层Transformer系统难以收敛。Lipschitz 初始化策略通过维持条件$\frac{\mathbi{w}}{\bm  \sigma}>1$，保证网络输入与输出范数一致，进而缓解梯度消失的问题\upcite{DBLP:conf/acl/XuLGXZ20}。一般情况下，$\mathbi{w}$可以被初始化为1，因此Lipschitz 初始化方法最终的约束条件则为：
 \begin{eqnarray}
 0\ <\ {\bm  \sigma} &=& \textrm{std}⁡(\mathbi{x}_l+\mathbi{y}_l) \ \leq\  1
-\label{eq:15-43}
+\label{eq:15-46}
 \end{eqnarray}

 %----------------------------------------------------------------------------------------
@@ -664,7 +682,7 @@ C(\mathbi{x}_j \mathbi{W}_K,\omega) &=& (\mathbi{x}_{j-\omega},\ldots,\mathbi{x}

 \subsubsection{3. T-Fixup初始化策略}

-\parinterval 另外一种初始化方法是从神经网络结构与优化器的计算方式入手。Post-Norm结构在Warmup阶段难以很精确地估计参数的二阶动量，导致训练不稳定问题\upcite{huang2020improving}。 也就是，层标准化是导致深层Transformer难以优化的主要原因之一\upcite{WangLearning}。Post-Norm方式下Transformer的底层网络，尤其是编码器的词嵌入层面临严重的梯度消失问题。问题的原因在于，在不改变层标准化位置的条件下，由于Adam优化器利用滑动平均的方式来估计参数的二阶矩，其方差是无界的。这样，在前期模型只能看到有限数量样本的前提下，其二阶矩很难进行有效的估计。因此反向更新参数时会导致参数的梯度方差过大。
+\parinterval 另外一种初始化方法是从神经网络结构与优化器的计算方式入手。Post-Norm结构在Warmup阶段难以精确地估计参数的二阶动量，这导致了训练不稳定问题\upcite{huang2020improving}。也就是，层标准化是导致深层Transformer难以优化的主要原因之一\upcite{WangLearning}。Post-Norm方式下Transformer的底层网络，尤其是编码器的词嵌入层面临严重的梯度消失问题。该问题的原因在于，在不改变层标准化位置的条件下，Adam优化器利用滑动平均的方式来估计参数的二阶矩，其方差是无界的。在训练阶段的前期，由于模型只能看到有限数量样本，因此很难有效的估计参数的二阶矩，导致反向更新参数时参数的梯度方差过大。

 \parinterval 除了用Pre-Norm代替Post-Norm结构来训练深层网络，也可以采用去除Warmup策略并移除层标准化机制的方式，并对神经网络中不同的参数矩阵制定相应的缩放机制来保证训练的稳定性\upcite{huang2020improving}。具体的缩放策略如下：

@@ -672,13 +690,13 @@ C(\mathbi{x}_j \mathbi{W}_K,\omega) &=& (\mathbi{x}_{j-\omega},\ldots,\mathbi{x}
 \vspace{0.5em}
 \item 类似于标准的Transformer初始化方式，使用Xavier初始化方式来初始化除了词嵌入以外的所有参数矩阵。词嵌入矩阵服从$\mathbb{N}(0,d^{-\frac{1}{2}})$的高斯分布，其中$d$代表词嵌入的维度。
 \vspace{0.5em}
-\item 对编码器中自注意力模型中的参数矩阵以及前馈神经网络中所有参数矩阵进行缩放因子为$0.67 {L}^{-\frac{1}{4}}$的缩放，$M$ 为编码器层数。
+\item 对编码器中自注意力机制的参数矩阵以及前馈神经网络中所有参数矩阵进行缩放因子为$0.67 {L}^{-\frac{1}{4}}$的缩放，$M$ 为编码器层数。
 \vspace{0.5em}
-\item 对解码器中注意力模型中的参数矩阵以及前馈神经网络中所有参数矩阵进行缩放因子为$(9 {M})^{-\frac{1}{4}}$的缩放，其中$L$为解码器层数。
+\item 对解码器中全部注意力机制的参数矩阵以及前馈神经网络中所有参数矩阵进行缩放因子为$(9 {M})^{-\frac{1}{4}}$的缩放，其中$L$为解码器层数。
 \vspace{0.5em}
 \end{itemize}

-\parinterval 这种初始化方法由于没有Warmup策略，学习率会直接从峰值根据参数的更新次数进行退火，大幅度增大了模型收敛的时间。但是，如何进一步解决该初始化方法下的模型收敛速度是比较关键的问题。
+\parinterval 这种初始化方法由于没有Warmup策略，学习率会直接从峰值根据参数的更新次数进行退火，大幅度增大了模型收敛的时间。因此，如何进一步解决该初始化方法下的模型收敛速度是比较关键的问题。

 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -686,10 +704,10 @@ C(\mathbi{x}_j \mathbi{W}_K,\omega) &=& (\mathbi{x}_{j-\omega},\ldots,\mathbi{x}

 \subsubsection{4. ADMIN初始化策略}

-\parinterval 也有研究发现Post-Norm结构在训练过程中过度依赖残差支路，在训练初期很容易发生参数梯度方差过大的现象\upcite{DBLP:conf/emnlp/LiuLGCH20}。经过分析发现，虽然底层神经网络发生梯度消失是导致训练不稳定的重要因素，但并不是唯一因素。例如，标准Transformer模型中梯度消失的原因在于使用Post-Norm 结构的解码器。尽管通过调整模型结构解决梯度消失问题，模型训练不稳定的问题仍然没有很好地解决。研究人员观测到Post-Norm 结构在训练过程中过于依赖残差支路，而Pre-Norm结构在训练过程中逐渐呈现出对残差支路的依赖性，这更易于网络的训练。进一步从参数更新的角度出发，Pre-Norm由于参数的改变导致网络输出变化的方差经推导后可以表示为$O(\log L)$，而Post-Norm对应的方差为O($L$)。因此，可以尝试减小Post-Norm中由于参数更新导致的输出的方差值，从而达到稳定训练的目的。针对该问题，可以采用两阶段的初始化方法。这里，可以重新定义子层之间的残差连接如下：
+\parinterval 也有研究发现Post-Norm结构在训练过程中过度依赖残差支路，在训练初期很容易发生参数梯度方差过大的现象\upcite{DBLP:conf/emnlp/LiuLGCH20}。经过分析发现，虽然底层神经网络发生梯度消失是导致训练不稳定的重要因素，但并不是唯一因素。例如，标准Transformer模型中梯度消失的原因在于使用Post-Norm 结构的解码器。尽管通过调整模型结构解决了梯度消失问题，但是模型训练不稳定的问题仍然没有被很好地解决。研究人员观测到Post-Norm结构在训练过程中过于依赖残差支路，而Pre-Norm结构在训练过程中逐渐呈现出对残差支路的依赖性，这更易于网络的训练。进一步，从参数更新的角度出发，Pre-Norm由于参数的改变导致网络输出变化的方差经推导后可以表示为$O(\log L)$，而Post-Norm对应的方差为$O(L)$。因此，可以尝试减小Post-Norm中由于参数更新导致的输出的方差值，从而达到稳定训练的目的。针对该问题，可以采用两阶段的初始化方法。这里，可以重新定义子层之间的残差连接如下：
 \begin{eqnarray}
 \mathbi{x}_{l+1} &=& \mathbi{x}_l \cdot {\bm  \omega_{l+1}} + F_{l+1}(\mathbi{x}_l)
-\label{eq:15-44}
+\label{eq:15-47}
 \end{eqnarray}

 \noindent 其两阶段的初始化方法如下所示：
@@ -698,10 +716,10 @@ C(\mathbi{x}_j \mathbi{W}_K,\omega) &=& (\mathbi{x}_{j-\omega},\ldots,\mathbi{x}
 \vspace{0.5em}
 \item Profiling阶段：${\bm  \omega_{l+1}} = 1$，只进行前向计算，无需进行梯度计算。在训练样本上计算$F_{l+1}(\mathbi{x}_l)$的方差
 \vspace{0.5em}
-\item 	Initialization阶段：通过Profiling阶段得到的$F_{l+1}(\mathbi{x}_l)$的方差来初始化$\bm  \omega_{l+1}$：
+\item Initialization阶段：通过Profiling阶段得到的$F_{l+1}(\mathbi{x}_l)$的方差来初始化$\bm  \omega_{l+1}$：
 \begin{eqnarray}
-{\bm  \omega_{l+1}} &=& \sqrt{\sum_{j<l}\textrm{Var}[F_{l+1}(\mathbi{x}_l)]}
-\label{eq:15-45}
+{\bm \omega_{l+1}} &=& \sqrt{\sum_{j<l}\textrm{Var}[F_{l+1}(\mathbi{x}_l)]}
+\label{eq:15-48}
 \end{eqnarray}
 \vspace{0.5em}
 \end{itemize}
@@ -724,7 +742,7 @@ C(\mathbi{x}_j \mathbi{W}_K,\omega) &=& (\mathbi{x}_{j-\omega},\ldots,\mathbi{x}

 \parinterval 所谓渐进式训练是指从浅层神经网络开始，在训练过程中逐渐增加模型的深度。一种比较简单的方法是将模型分为浅层部分和深层部分，之后分别进行训练，最终达到提高模型翻译性能的目的\upcite{DBLP:conf/acl/WuWXTGQLL19}。

-\parinterval 另一种方式是动态构建深层模型，并尽可能复用浅层部分的训练结果\upcite{li2020shallow}。假设开始的时候模型包含$l$层神经网络，然后训练这个模型至收敛。之后，直接拷贝这$l$ 层神经网络（包括参数），并堆叠出一个$2l$ 层的模型。之后继续训练，重复这个过程。进行$n$次之后就得到了$(n+1)l$层的模型。图\ref{fig:15-15}给出了在编码器上使用渐进式训练的示意图。
+\parinterval 另一种方式是动态构建深层模型，并尽可能复用浅层部分的训练结果\upcite{li2020shallow}。假设开始的时候模型包含$l$层神经网络，然后训练这个模型至收敛。之后，直接拷贝这$l$层神经网络（包括参数），并堆叠出一个$2l$ 层的模型。之后继续训练，重复这个过程。进行$n$次之后就得到了$(n+1) \times l$层的模型。图\ref{fig:15-15}给出了在编码器上使用渐进式训练的示意图。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -743,7 +761,7 @@ C(\mathbi{x}_j \mathbi{W}_K,\omega) &=& (\mathbi{x}_{j-\omega},\ldots,\mathbi{x}

 \subsubsection{2. 分组稠密连接}

-\parinterval 很多研究工作已经发现深层模型不同层之间的稠密连接能够很明显地提高信息传递的效率\upcite{WangLearning,DBLP:conf/cvpr/HuangLMW17,Dou2018ExploitingDR,DBLP:conf/acl/WuWXTGQLL19}。与此同时，对之前层信息的不断复用有助于得到更好的表示，但也带来了计算代价过大的问题。在动态线性层聚合方法（DLCL）中，每一次聚合时都需要重新计算之前每一层表示对当前层输入的贡献度，因此伴随着编码器整体深度的增加，这部分的计算代价变得不可忽略。例如，一个基于动态层聚合的48层Transformer模型比不使用动态层聚合进行训练慢近2倍。同时，缓存中间结果也增加了显存的使用量。比如，即使在使用半精度计算的情况，每张12G显存的GPU上计算的词也不能超过2048个，这导致训练开销急剧增大。
+\parinterval 很多研究工作已经发现深层模型不同层之间的稠密连接能够很明显地提高信息传递的效率\upcite{WangLearning,DBLP:conf/cvpr/HuangLMW17,Dou2018ExploitingDR,DBLP:conf/acl/WuWXTGQLL19}。与此同时，对之前层信息的不断复用有助于得到更好的表示，但也带来了计算代价过大的问题。在动态线性层聚合方法（DLCL）中，每一次聚合时都需要重新计算之前每一层表示对当前层输入的贡献度，因此伴随着编码器整体深度的增加，这部分的计算代价变得不可忽略。例如，一个基于动态层聚合的48层Transformer模型比不使用动态层聚合进行训练慢近2倍。同时，缓存中间结果也增加了显存的使用量。比如，即使在使用半精度计算的情况下，每张12G显存的GPU上计算的词也不能超过2048个，这导致训练开销急剧增大。

 \parinterval 缓解这个问题的一种方法是使用更稀疏的层间连接方式。其核心思想与动态线性层聚合是类似的，不同点在于可以通过调整层之间连接的稠密程度来降低训练代价。比如，可以将每$p$层分为一组，之后动态线性层聚合只在不同组之间进行。这样，通过调节$p$值的大小可以控制神经网络中连接的稠密程度，作为一种训练代价与翻译性能之间的权衡。显然，标准的Transformer模型\upcite{vaswani2017attention} 和DLCL模型\upcite{WangLearning}都可以看作是该方法的一种特例。如图\ref{fig:15-16}所示：当$p=1$时，每一个单独的块被看作一个独立的组，它等价于基于动态层聚合的DLCL模型；当$p=\infty$时，它等价于正常的Transformer模型。值得注意的是，如果配合渐进式训练。在分组稠密连接中可以设置$p$等于模型层数。

@@ -762,7 +780,7 @@ C(\mathbi{x}_j \mathbi{W}_K,\omega) &=& (\mathbi{x}_{j-\omega},\ldots,\mathbi{x}

 \subsubsection{3. 学习率重置}

-\parinterval 尽管渐进式训练策略与分组稠密连接结构可以加速深层模型的训练，但使用传统的学习率衰减策略会导致训练深层模型时的学习率较小，因此模型无法快速地达到收敛状态，同时也影响最终的模型性能。
+\parinterval 尽管渐进式训练策略与分组稠密连接结构都可以加速深层模型的训练，但使用传统的学习率衰减策略会导致训练深层模型时的学习率较小，因此模型无法快速地达到收敛状态，同时也影响最终的模型性能。

 \parinterval  图\ref{fig:15-17}中的红色曲线描绘了在WMT英德翻译任务上标准Transformer模型的学习率曲线，可以看到当模型训练到40k步时，学习率对比峰值有明显的差距，而此时刚开始训练最终的深层模型，过小的学习率并不利于后期深层网络的充分训练。

@@ -804,7 +822,7 @@ lr &=& d_{\textrm{model}}^{-0.5}\cdot step\_num^{-0.5}
 \subsection{深层模型的健壮性训练}


-\parinterval 伴随着网络的加深，还会面临另外一个比较严峻的问题\ \dash \ 过拟合。由于参数量的增大，深层模型的输入与输出分布之间的差异也会越来越大，然而不同子层之间的相互适应也会更加的明显，这将导致任意子层网络对其他子层的依赖过大。这种现象在训练阶段是有帮助的，因为不同子层可以协同工作从而更好地拟合训练数据。然而这种方式也降低了模型的泛化能力，即深层模型更容易陷入过拟合问题。
+\parinterval 伴随着网络的加深，模型的训练还会面临另外一个比较严峻的问题\ \dash \ 过拟合。由于参数量的增大，深层模型的输入与输出分布之间的差异也会越来越大，然而不同子层之间的相互适应也会更加的明显，这将导致任意子层网络对其他子层的依赖过大。这种现象在训练阶段是有帮助的，因为不同子层可以协同工作从而更好地拟合训练数据。然而这种方式也降低了模型的泛化能力，即深层模型更容易陷入过拟合问题。

 \parinterval 通常，可以使用Dropout手段用来缓解过拟合问题（见{\chapterthirteen}）。不幸的是,尽管目前Transformer模型使用了多种Dropout手段（如Residual Dropout、Attention Dropout、 ReLU Dropout等），过拟合问题在深层模型中仍然存在。从图\ref{fig:15-18}中可以看到，深层模型比浅层模型在训练集和校验集的困惑度上都有明显的优势，然而模型在训练一段时间后出现校验集困惑度上涨的现象，说明模型已经过拟合于训练数据。

@@ -818,7 +836,6 @@ lr &=& d_{\textrm{model}}^{-0.5}\cdot step\_num^{-0.5}
 %-------------------------------------------

 \parinterval {\chapterthirteen}提到的Layer Dropout方法可以有效地缓解这个问题。以编码器为例， Layer Dropout的过程可以被描述为：在训练过程中，对自注意力子层或前馈神经网络子层进行随机丢弃，以减少不同子层之间的相互适应。这里选择Pre-Norm结构作为基础架构，它可以被描述为：
-
 \begin{eqnarray}
 \mathbi{x}_{l+1} &=& F(\textrm{LN}(\mathbi{x}_l)) + \mathbi{x}_l
 \label{eq:15-48}
@@ -841,7 +858,7 @@ lr &=& d_{\textrm{model}}^{-0.5}\cdot step\_num^{-0.5}
 \end{figure}
 %-------------------------------------------

-\parinterval 除此之外，在残差网络中，研究人员已经发现底层神经网络的作用是对输入进行抽象表示，而上层神经网络进一步修正这种表示来拟合训练目标，因此底层神经网络对模型最终的输出有很大的影响\upcite{DBLP:journals/corr/GreffSS16}。该结论同样适用于Transformer模型，比如，在训练中，残差支路以及底层的梯度范数通常比较大，这也间接表明底层神经网络在整个优化的过程中需要更大的更新。考虑到这个因素，在设计每一个子层被丢弃的概率时，可以采用自底向上线性增大的策略，保证底层的神经网络相比于顶层更容易保留下来。
+\parinterval 除此之外，在残差网络中，研究人员已经发现底层神经网络的作用是对输入进行抽象表示，而上层神经网络会进一步修正这种表示来拟合训练目标，因此底层神经网络对模型最终的输出有很大的影响\upcite{DBLP:journals/corr/GreffSS16}。该结论同样适用于Transformer模型，比如，在训练中，残差支路以及底层的梯度范数通常比较大，这也间接表明底层神经网络在整个优化的过程中需要更大的更新。考虑到这个因素，在设计每一个子层被丢弃的概率时，可以采用自底向上线性增大的策略，保证底层的神经网络相比于顶层更容易保留下来。

 %----------------------------------------------------------------------------------------
 %    NEW SECTION

--- a/Chapter16/chapter16.tex
+++ b/Chapter16/chapter16.tex
@@ -244,9 +244,9 @@
 \end{figure}
 %----------------------------------------------

-\parinterval 在神经机器翻译中，应用多任务学习的主要策略是将翻译任务作为主任务，同时设置一些仅使用单语数据的子任务，通过这些子任务来捕捉单语数据中的语言知识\upcite{DBLP:conf/emnlp/DomhanH17,DBLP:conf/emnlp/ZhangZ16,DBLP:journals/corr/LuongLSVK15}。一种多任务学习的方法是利用源语言单语数据，通过单个编码器对源语言数据进行建模，再分别使用两个解码器来学习源语言排序和翻译任务。源语言排序任务是指利用预排序规则对源语言句子中词的顺序进行调整\upcite{DBLP:conf/emnlp/WangCK07}，可以通过单语数据来构造训练数据，从而使编码器被训练得更加充分\upcite{DBLP:conf/emnlp/ZhangZ16}，如图\ref{fig:16-7}所示，图中$y_{<}$表示当前时刻之前的译文，$x_{<}$表示源语言句子中词的顺序调整后的句子。
+\parinterval 在神经机器翻译中，应用多任务学习的主要策略是将翻译任务作为主任务，同时设置一些仅使用单语数据的子任务，通过这些子任务来捕捉单语数据中的语言知识\upcite{DBLP:conf/emnlp/DomhanH17,DBLP:conf/emnlp/ZhangZ16,DBLP:journals/corr/LuongLSVK15}。一种多任务学习的方法是利用源语言单语数据，通过单个编码器对源语言数据进行建模，再分别使用两个解码器来学习源语言排序和翻译任务。源语言排序任务是指利用预排序规则对源语言句子中词的顺序进行调整\upcite{DBLP:conf/emnlp/WangCK07}，可以通过单语数据来构造训练数据，从而使编码器被训练得更加充分\upcite{DBLP:conf/emnlp/ZhangZ16}，如图\ref{fig:16-7}所示，图中$y_{<}$表示当前时刻之前的单词序列，$x_{<}$表示源语言句子中词的顺序调整后的句子。

-\parinterval 虽然神经机器翻译模型可以看作一种语言生成模型，但生成过程中却依赖于源语言信息，因此无法直接利用目标语言单语数据进行多任务学习。针对这个问题，可以对原有翻译模型结构进行修改，在解码器底层增加一个语言模型子层，这个子层用于学习语言模型任务，与编码器端是完全独立的，如图\ref{fig:16-8}所示\upcite{DBLP:conf/emnlp/DomhanH17}，图中$y_{<}$表示当前时刻之前的译文，$z_{<}$表示当前时刻之前的单语数据。在训练过程中，分别将双语数据和单语数据送入翻译模型和语言模型进行计算，双语数据训练产生的梯度用于对整个模型进行参数更新，而单语数据产生的梯度只对语言模型子层进行参数更新。
+\parinterval 虽然神经机器翻译模型可以看作一种语言生成模型，但生成过程中却依赖于源语言信息，因此无法直接利用目标语言单语数据进行多任务学习。针对这个问题，可以对原有翻译模型结构进行修改，在解码器底层增加一个语言模型子层，这个子层用于学习语言模型任务，与编码器端是完全独立的，如图\ref{fig:16-8}所示\upcite{DBLP:conf/emnlp/DomhanH17}，图中$y_{<}$表示当前时刻之前的单词序列，$z_{<}$表示当前时刻之前的单语数据。在训练过程中，分别将双语数据和单语数据送入翻译模型和语言模型进行计算，双语数据训练产生的梯度用于对整个模型进行参数更新，而单语数据产生的梯度只对语言模型子层进行参数更新。

 %----------------------------------------------
 \begin{figure}[htp]

--- a/Chapter17/Figures/figure-an-end-to-end-voice-translation-model-based-on-transformer.tex
+++ b/Chapter17/Figures/figure-an-end-to-end-voice-translation-model-based-on-transformer.tex
 \begin{tikzpicture}
-	\tikzstyle{layer}=[draw,rounded corners=2pt,font=\scriptsize,align=center,minimum width=5em]
+	\tikzstyle{layer}=[draw,rounded corners=2pt,font=\scriptsize,align=center,minimum width=7.1em]
 	\tikzstyle{word}=[font=\scriptsize]
-	
+%%%%encoder
 \node[layer,fill=red!20] (en_sa) at (0,0){Multi-Head \\ Attention};
-\node[layer,anchor=south,fill=green!20] (en_ffn) at ([yshift=1.4em]en_sa.north){Feed Forward \\ Network};
-\node[draw,circle,inner sep=0pt, minimum size=1em,anchor=north] (en_add) at ([yshift=-1.4em]en_sa.south){};
-\draw[] (en_add.90) -- (en_add.-90);
-\draw[] (en_add.0) -- (en_add.180);
-\node[layer,anchor=north,fill=yellow!20] (en_cnn) at ([yshift=-1.4em]en_add.south){CNN};
+\node[anchor=south,layer,fill=yellow!20](en_add1) at ([yshift=1.0em]en_sa.north) {Add \& LayerNorm};
+\node[layer,anchor=south,fill=green!20] (en_ffn) at ([yshift=1.0em]en_add1.north){Feed Forward \\ Network};
+\node[anchor=south,layer,fill=yellow!20](en_add2) at ([yshift=1.0em]en_ffn.north) {Add \& LayerNorm};
+\node[draw,circle,inner sep=0pt, minimum size=1em,anchor=north,thick] (en_add) at ([yshift=-1.4em]en_sa.south){};
+\draw[thick] (en_add.90) -- (en_add.-90);
+\draw[thick] (en_add.0) -- (en_add.180);
+\node[layer,anchor=north,fill=yellow!20] (en_cnn) at ([yshift=-1.0em]en_add.south){CNN};
+\node[anchor=east,font=\scriptsize,align=center] (en_pos) at ([xshift=-2em]en_add.west){位置编码};
+\node[anchor=north,font=\scriptsize,align=center] (en_input) at ([yshift=-1em]en_cnn.south){源语言语音特征\\(FBank/MFCC)};

+\draw[->,thick] (en_input.90) -- ([yshift=-0.1em]en_cnn.-90);
+\draw[->,thick] ([yshift=0.1em]en_cnn.90) -- ([yshift=-0.1em]en_add.-90);
+\draw[->,thick] ([yshift=0.1em]en_add.90) -- ([yshift=-0.1em]en_sa.-90);
+\draw[->,thick] ([yshift=0.1em]en_sa.90) -- ([yshift=-0.1em]en_add1.-90);
+\draw[->,thick] ([yshift=0.1em]en_add1.90) -- ([yshift=-0.1em]en_ffn.-90);
+\draw[->,thick] ([yshift=0.1em]en_ffn.90) --([yshift=-0.1em]en_add2.-90);
+\draw[->,rounded corners=2pt,thick] ([yshift=-0.6em]en_sa.south)--([yshift=-0.6em,xshift=-4.0em]en_sa.south)--([xshift=-0.43em]en_add1.west)--(en_add1.west);
+\draw[->,rounded corners=2pt,thick] ([yshift=-0.6em]en_ffn.south)--([yshift=-0.6em,xshift=-4.0em]en_ffn.south)--([xshift=-0.43em]en_add2.west)--(en_add2.west);

-\node[draw,circle,inner sep=0pt, minimum size=1em,anchor=west] (de_add) at ([xshift=7em]en_add.east){};
-\draw[] (de_add.90) -- (de_add.-90);
-\draw[] (de_add.0) -- (de_add.180);
+%%%%decoder
+\node[draw,circle,inner sep=0pt, minimum size=1em,anchor=west,thick] (de_add) at ([xshift=9em]en_add.east){};
+\draw[thick] (de_add.90) -- (de_add.-90);
+\draw[thick] (de_add.0) -- (de_add.180);
 \node[layer,anchor=south,fill=red!20] (de_sa) at ([yshift=1.4em]de_add.north){Masked \\Multi-Head\\Attention};
-\node[layer,anchor=south,fill=red!20] (de_ca) at ([yshift=1.4em]de_sa.north){Multi-Head \\ Attention};
-\node[layer,anchor=south,fill=green!20] (de_ffn) at ([yshift=1.4em]de_ca.north){Feed Forward \\ Network};
+\node[anchor=south,layer,fill=yellow!20](de_add1) at ([yshift=1.0em]de_sa.north) {Add \& LayerNorm};
+\node[layer,anchor=south,fill=red!20] (de_ca) at ([yshift=1.0em]de_add1.north){Multi-Head \\ Attention};
+\node[anchor=south,layer,fill=yellow!20](de_add2) at ([yshift=1.0em]de_ca.north) {Add \& LayerNorm};
+\node[layer,anchor=south,fill=green!20] (de_ffn) at ([yshift=1.0em]de_add2.north){Feed Forward \\ Network};
+\node[anchor=south,layer,fill=yellow!20](de_add3) at ([yshift=1.0em]de_ffn.north) {Add \& LayerNorm};
+\node[layer,anchor=south,fill=blue!20] (sf) at ([yshift=1.2em]de_add3.north){Softmax};
+\node[anchor=north,font=\scriptsize,align=center] (de_input) at ([yshift=-1.1em]de_add.south){目标语言文本\\编码表示};

-\node[layer,anchor=south,fill=blue!20] (sf) at ([yshift=1.6em]de_ffn.north){Softmax};
-%\node[layer,anchor=south,fill=orange!20] (output) at ([yshift=1.4em]sf.north){STLoss};
+\node[anchor=west,font=\scriptsize,align=center] (de_pos) at ([xshift=2em]de_add.east){位置编码};

-\node[anchor=north,font=\scriptsize,align=center] (en_input) at ([yshift=-1em]en_cnn.south){语音特征\\(FBank/MFCC)};
-\node[anchor=north,font=\scriptsize,align=center] (de_input) at ([yshift=-1.1em]de_add.south){标注文本\\编码表示};
+\draw[->,thick] (de_input.90) -- ([yshift=-0.1em]de_add.-90);
+\draw[->,thick] ([yshift=0.1em]de_add.90) -- ([yshift=-0.1em]de_sa.-90);
+\draw[->,thick] ([yshift=0.1em]de_sa.90) -- ([yshift=-0.1em]de_add1.-90);
+\draw[->,thick] ([yshift=0.1em]de_add1.90) -- ([yshift=-0.1em]de_ca.-90);
+\draw[->,thick] ([yshift=0.1em]de_ca.90) -- ([yshift=-0.1em]de_add2.-90);
+\draw[->,thick] ([yshift=0.1em]de_add2.90) -- ([yshift=-0.1em]de_ffn.-90);
+\draw[->,thick] ([yshift=0.1em]de_ffn.90) -- ([yshift=-0.1em]de_add3.-90);
+\draw[->,thick] ([yshift=0.1em]de_add3.90) -- ([yshift=-0.1em]sf.-90);
+\draw[->,thick] ([yshift=0.1em]sf.90) -- ([yshift=1.0em]sf.90);
+\draw[->,thick] ([xshift=0.1em]en_pos.0) -- ([xshift=-0.1em]en_add.180);
+\draw[->,thick] ([xshift=-0.1em]de_pos.180) -- ([xshift=0.1em]de_add.0);
+\draw[->,rounded corners=2pt,thick] ([yshift=-0.6em]de_sa.south)--([yshift=-0.6em,xshift=4.0em]de_sa.south)--([xshift=0.43em]de_add1.east)--(de_add1.east);
+\draw[->,rounded corners=2pt,thick] ([yshift=-0.6em]de_ca.south)--([yshift=-0.6em,xshift=4.0em]de_ca.south)--([xshift=0.43em]de_add2.east)--(de_add2.east);
+\draw[->,rounded corners=2pt,thick] ([yshift=-0.6em]de_ffn.south)--([yshift=-0.6em,xshift=4.0em]de_ffn.south)--([xshift=0.43em]de_add3.east)--(de_add3.east);
+
+\draw[->,rounded corners=2pt,thick] ([yshift=0.1em]en_add2.90) -- ([yshift=1.5em]en_add2.90) -- ([xshift=5.0em,yshift=1.5em]en_add2.90) -- ([xshift=-1.5em]de_ca.west) -- ([xshift=-0.1em]de_ca.west);

-\node[anchor=east,font=\scriptsize,align=center] (en_pos) at ([xshift=-2em]en_add.west){位置编码};
-\node[anchor=west,font=\scriptsize,align=center] (de_pos) at ([xshift=2em]de_add.east){位置编码};

-\draw[->] (en_input.90) -- ([yshift=-0.1em]en_cnn.-90);
-\draw[->] ([yshift=0.1em]en_cnn.90) -- ([yshift=-0.1em]en_add.-90);
-\draw[->] ([yshift=0.1em]en_add.90) -- ([yshift=-0.1em]en_sa.-90);
-\draw[->] ([yshift=0.1em]en_sa.90) -- ([yshift=-0.1em]en_ffn.-90);
-\draw[->] (de_input.90) -- ([yshift=-0.1em]de_add.-90);
-\draw[->] ([yshift=0.1em]de_add.90) -- ([yshift=-0.1em]de_sa.-90);
-\draw[->] ([yshift=0.1em]de_sa.90) -- ([yshift=-0.1em]de_ca.-90);
-\draw[->] ([yshift=0.1em]de_ca.90) -- ([yshift=-0.1em]de_ffn.-90);
-\draw[->] ([yshift=0.1em]de_ffn.90) -- ([yshift=-0.1em]sf.-90);
-\draw[->] ([yshift=0.1em]sf.90) -- ([yshift=1.5em]sf.90);
-\draw[->] ([xshift=0.1em]en_pos.0) -- ([xshift=-0.1em]en_add.180);
-\draw[->] ([xshift=-0.1em]de_pos.180) -- ([xshift=0.1em]de_add.0);
-\draw[->,rounded corners=2pt] ([yshift=0.1em]en_ffn.90) -- ([yshift=2em]en_ffn.90) -- ([xshift=4em,yshift=2em]en_ffn.90) -- ([xshift=-1.5em]de_ca.west) -- ([xshift=-0.1em]de_ca.west);
 \begin{pgfonlayer}{background}
-\node[draw=ugreen,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt,dashed,thick][fit=(en_sa)(en_ffn)]{};
-\node[draw=red,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt,dashed,thick][fit=(de_sa)(de_ca)(de_ffn)]{};
+\node[draw=ugreen,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt,dashed,thick,xshift=-0.2em,yshift=-0.2em][fit=(en_add1)(en_add2)(en_sa)(en_ffn)](box1){};
+\node[draw=red,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt,dashed,thick,xshift=0.2em,yshift=-0.2em][fit=(de_sa)(de_ca)(de_ffn)(de_add3)](box2){};
 \end{pgfonlayer}

 \node[anchor=east,font=\scriptsize,text=ugreen] at ([xshift=-0.1em]box1.west){$N \times$};
 \node[anchor=west,font=\scriptsize,text=red] at ([xshift=0.1em]box2.east){$\times N$};
 \node[anchor=east,font=\scriptsize] at ([xshift=-0.1em]en_cnn.west){$2 \times$};
-\node[anchor=east,font=\scriptsize,align=center,text=ugreen] at ([xshift=-0.1em,yshift=3em]box1.west){ST\\ 编码器};
-\node[anchor=west,font=\scriptsize,align=center,text=red] at ([xshift=0.1em,yshift=5em]box2.east){ST\\解码器};
+\node[anchor=east,font=\scriptsize,align=center,text=ugreen] at ([xshift=-0.1em,yshift=3em]box1.west){ST \\ 编码器};
+\node[anchor=west,font=\scriptsize,align=center,text=red] at ([xshift=0.1em,yshift=5em]box2.east){ST \\ 解码器};
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter17/Figures/figure-application-of-multimodal-machine-translation-to-multitask-learning.tex
+++ b/Chapter17/Figures/figure-application-of-multimodal-machine-translation-to-multitask-learning.tex
@@ -10,6 +10,9 @@

 \node(figure)[draw=white,above of = decoder_right,yshift=6.5em,scale=0.25] {\includegraphics[width=0.62\textwidth]{./Chapter17/Figures/figure-bank-without-attention.jpg}};

+\node [anchor=south,scale=1.2] (node1) at ([xshift=-2.5em,yshift=4.5em]y.north) {\small{$x$：源语言文本数据}};
+\node [anchor=north,scale=1.2] (node2) at ([xshift=0.57em]node1.south){\small{$y$：目标语言文本数据}};
+
 \draw[->,thick](x)to(encoder);
 \draw[->,thick](encoder)to(decoder_left)node[right,xshift=-0.1cm,yshift=-1.25cm,scale=1.2]{\small{翻译}};
 \draw[->,thick](decoder_left)to(y_hat);

--- a/Chapter17/Figures/figure-audio-processing.tex
+++ b/Chapter17/Figures/figure-audio-processing.tex
@@ -36,6 +36,7 @@
 \draw[->,very thick](process_2.east)to([xshift=1.8cm]process_2.east);
 %%%%音频
 \node(signal)[right of = process_2,xshift=5.5cm]{};
+\node(text_3)[below of = signal,yshift=-1.98cm,scale=1.3]{语音信号};
 \draw[-,thick,]([xshift=-1.2cm]signal.center)--([xshift=1.2cm]signal.center);
 \draw[-,thick]([xshift=-1cm,yshift=-0.8cm]signal.center)--([xshift=-0.9cm,yshift=0.4cm]signal.center)--([xshift=-0.8cm,yshift=-0.3cm]signal.center)--([xshift=-0.7cm,yshift=0.7cm]signal.center)--([xshift=-0.6cm,yshift=-0.1cm]signal.center)--([xshift=-0.5cm,yshift=0.3cm]signal.center)--([xshift=-0.4cm,yshift=-0.5cm]signal.center)--([xshift=-0.3cm,yshift=0.7cm]signal.center)--([xshift=-0.2cm,yshift=-0.2cm]signal.center)--([xshift=-0.1cm,yshift=0.4cm]signal.center)--([xshift=0cm,yshift=-0.9cm]signal.center)--([xshift=0.1cm,yshift=0.5cm]signal.center)--([xshift=0.2cm,yshift=-0.4cm]signal.center)--([xshift=0.3cm,yshift=0.3cm]signal.center)--([xshift=0.4cm,yshift=-0.2cm]signal.center)--([xshift=0.5cm,yshift=0.1cm]signal.center)--([xshift=0.6cm,yshift=-0.8cm]signal.center)--([xshift=0.7cm,yshift=0.4cm]signal.center)--([xshift=0.8cm,yshift=-0.6cm]signal.center)--([xshift=0.9cm,yshift=0.7cm]signal.center)--([xshift=1cm,yshift=-0.2cm]signal.center);
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter17/Figures/figure-speech-recognition-model-based-on-transformer.tex
+++ b/Chapter17/Figures/figure-speech-recognition-model-based-on-transformer.tex
 \begin{tikzpicture}
-	\tikzstyle{layer}=[draw,rounded corners=2pt,font=\scriptsize,align=center,minimum width=5em]
+	\tikzstyle{layer}=[draw,rounded corners=2pt,font=\scriptsize,align=center,minimum width=7.1em]
 	\tikzstyle{word}=[font=\scriptsize]
-	
+%%%%encoder
 \node[layer,fill=red!20] (en_sa) at (0,0){Multi-Head \\ Attention};
-\node[layer,anchor=south,fill=green!20] (en_ffn) at ([yshift=1.4em]en_sa.north){Feed Forward \\ Network};
-\node[draw,circle,inner sep=0pt, minimum size=1em,anchor=north] (en_add) at ([yshift=-1.4em]en_sa.south){};
-\draw[] (en_add.90) -- (en_add.-90);
-\draw[] (en_add.0) -- (en_add.180);
-\node[layer,anchor=north,fill=yellow!20] (en_cnn) at ([yshift=-1.4em]en_add.south){CNN};
+\node[anchor=south,layer,fill=yellow!20](en_add1) at ([yshift=1.0em]en_sa.north) {Add \& LayerNorm};
+\node[layer,anchor=south,fill=green!20] (en_ffn) at ([yshift=1.0em]en_add1.north){Feed Forward \\ Network};
+\node[anchor=south,layer,fill=yellow!20](en_add2) at ([yshift=1.0em]en_ffn.north) {Add \& LayerNorm};
+\node[draw,circle,inner sep=0pt, minimum size=1em,anchor=north,thick] (en_add) at ([yshift=-1.4em]en_sa.south){};
+\draw[thick] (en_add.90) -- (en_add.-90);
+\draw[thick] (en_add.0) -- (en_add.180);
+\node[layer,anchor=north,fill=yellow!20] (en_cnn) at ([yshift=-1.0em]en_add.south){CNN};
+\node[anchor=east,font=\scriptsize,align=center] (en_pos) at ([xshift=-2em]en_add.west){位置编码};
+\node[anchor=north,font=\scriptsize,align=center] (en_input) at ([yshift=-1em]en_cnn.south){语音特征\\(FBank/MFCC)};

+\draw[->,thick] (en_input.90) -- ([yshift=-0.1em]en_cnn.-90);
+\draw[->,thick] ([yshift=0.1em]en_cnn.90) -- ([yshift=-0.1em]en_add.-90);
+\draw[->,thick] ([yshift=0.1em]en_add.90) -- ([yshift=-0.1em]en_sa.-90);
+\draw[->,thick] ([yshift=0.1em]en_sa.90) -- ([yshift=-0.1em]en_add1.-90);
+\draw[->,thick] ([yshift=0.1em]en_add1.90) -- ([yshift=-0.1em]en_ffn.-90);
+\draw[->,thick] ([yshift=0.1em]en_ffn.90) --([yshift=-0.1em]en_add2.-90);
+\draw[->,rounded corners=2pt,thick] ([yshift=-0.6em]en_sa.south)--([yshift=-0.6em,xshift=-4.0em]en_sa.south)--([xshift=-0.43em]en_add1.west)--(en_add1.west);
+\draw[->,rounded corners=2pt,thick] ([yshift=-0.6em]en_ffn.south)--([yshift=-0.6em,xshift=-4.0em]en_ffn.south)--([xshift=-0.43em]en_add2.west)--(en_add2.west);

-\node[draw,circle,inner sep=0pt, minimum size=1em,anchor=west] (de_add) at ([xshift=7em]en_add.east){};
-\draw[] (de_add.90) -- (de_add.-90);
-\draw[] (de_add.0) -- (de_add.180);
+%%%%decoder
+\node[draw,circle,inner sep=0pt, minimum size=1em,anchor=west,thick] (de_add) at ([xshift=9em]en_add.east){};
+\draw[thick] (de_add.90) -- (de_add.-90);
+\draw[thick] (de_add.0) -- (de_add.180);
 \node[layer,anchor=south,fill=red!20] (de_sa) at ([yshift=1.4em]de_add.north){Masked \\Multi-Head\\Attention};
-\node[layer,anchor=south,fill=red!20] (de_ca) at ([yshift=1.4em]de_sa.north){Multi-Head \\ Attention};
-\node[layer,anchor=south,fill=green!20] (de_ffn) at ([yshift=1.4em]de_ca.north){Feed Forward \\ Network};
-
-\node[layer,anchor=south,fill=blue!20] (sf) at ([yshift=1.6em]de_ffn.north){Softmax};
-%\node[layer,anchor=south,fill=orange!20] (output) at ([yshift=1.4em]sf.north){Output Probabilities};
-
-\node[anchor=north,font=\scriptsize,align=center] (en_input) at ([yshift=-1em]en_cnn.south){语音特征\\(FBank/MFCC)};
+\node[anchor=south,layer,fill=yellow!20](de_add1) at ([yshift=1.0em]de_sa.north) {Add \& LayerNorm};
+\node[layer,anchor=south,fill=red!20] (de_ca) at ([yshift=1.0em]de_add1.north){Multi-Head \\ Attention};
+\node[anchor=south,layer,fill=yellow!20](de_add2) at ([yshift=1.0em]de_ca.north) {Add \& LayerNorm};
+\node[layer,anchor=south,fill=green!20] (de_ffn) at ([yshift=1.0em]de_add2.north){Feed Forward \\ Network};
+\node[anchor=south,layer,fill=yellow!20](de_add3) at ([yshift=1.0em]de_ffn.north) {Add \& LayerNorm};
+\node[layer,anchor=south,fill=blue!20] (sf) at ([yshift=1.2em]de_add3.north){Softmax};
 \node[anchor=north,font=\scriptsize,align=center] (de_input) at ([yshift=-1.1em]de_add.south){标注文本\\编码表示};

-\node[anchor=east,font=\scriptsize,align=center] (en_pos) at ([xshift=-2em]en_add.west){位置编码};
 \node[anchor=west,font=\scriptsize,align=center] (de_pos) at ([xshift=2em]de_add.east){位置编码};

-\draw[->] (en_input.90) -- ([yshift=-0.1em]en_cnn.-90);
-\draw[->] ([yshift=0.1em]en_cnn.90) -- ([yshift=-0.1em]en_add.-90);
-\draw[->] ([yshift=0.1em]en_add.90) -- ([yshift=-0.1em]en_sa.-90);
-\draw[->] ([yshift=0.1em]en_sa.90) -- ([yshift=-0.1em]en_ffn.-90);
-\draw[->] (de_input.90) -- ([yshift=-0.1em]de_add.-90);
-\draw[->] ([yshift=0.1em]de_add.90) -- ([yshift=-0.1em]de_sa.-90);
-\draw[->] ([yshift=0.1em]de_sa.90) -- ([yshift=-0.1em]de_ca.-90);
-\draw[->] ([yshift=0.1em]de_ca.90) -- ([yshift=-0.1em]de_ffn.-90);
-\draw[->] ([yshift=0.1em]de_ffn.90) -- ([yshift=-0.1em]sf.-90);
-\draw[->] ([yshift=0.1em]sf.90) -- ([yshift=1.5em]sf.90);
-\draw[->] ([xshift=0.1em]en_pos.0) -- ([xshift=-0.1em]en_add.180);
-\draw[->] ([xshift=-0.1em]de_pos.180) -- ([xshift=0.1em]de_add.0);
-\draw[->,rounded corners=2pt] ([yshift=0.1em]en_ffn.90) -- ([yshift=2em]en_ffn.90) -- ([xshift=4em,yshift=2em]en_ffn.90) -- ([xshift=-1.5em]de_ca.west) -- ([xshift=-0.1em]de_ca.west);
+\draw[->,thick] (de_input.90) -- ([yshift=-0.1em]de_add.-90);
+\draw[->,thick] ([yshift=0.1em]de_add.90) -- ([yshift=-0.1em]de_sa.-90);
+\draw[->,thick] ([yshift=0.1em]de_sa.90) -- ([yshift=-0.1em]de_add1.-90);
+\draw[->,thick] ([yshift=0.1em]de_add1.90) -- ([yshift=-0.1em]de_ca.-90);
+\draw[->,thick] ([yshift=0.1em]de_ca.90) -- ([yshift=-0.1em]de_add2.-90);
+\draw[->,thick] ([yshift=0.1em]de_add2.90) -- ([yshift=-0.1em]de_ffn.-90);
+\draw[->,thick] ([yshift=0.1em]de_ffn.90) -- ([yshift=-0.1em]de_add3.-90);
+\draw[->,thick] ([yshift=0.1em]de_add3.90) -- ([yshift=-0.1em]sf.-90);
+\draw[->,thick] ([yshift=0.1em]sf.90) -- ([yshift=1.0em]sf.90);
+\draw[->,thick] ([xshift=0.1em]en_pos.0) -- ([xshift=-0.1em]en_add.180);
+\draw[->,thick] ([xshift=-0.1em]de_pos.180) -- ([xshift=0.1em]de_add.0);
+\draw[->,rounded corners=2pt,thick] ([yshift=-0.6em]de_sa.south)--([yshift=-0.6em,xshift=4.0em]de_sa.south)--([xshift=0.43em]de_add1.east)--(de_add1.east);
+\draw[->,rounded corners=2pt,thick] ([yshift=-0.6em]de_ca.south)--([yshift=-0.6em,xshift=4.0em]de_ca.south)--([xshift=0.43em]de_add2.east)--(de_add2.east);
+\draw[->,rounded corners=2pt,thick] ([yshift=-0.6em]de_ffn.south)--([yshift=-0.6em,xshift=4.0em]de_ffn.south)--([xshift=0.43em]de_add3.east)--(de_add3.east);
+
+\draw[->,rounded corners=2pt,thick] ([yshift=0.1em]en_add2.90) -- ([yshift=1.5em]en_add2.90) -- ([xshift=5.0em,yshift=1.5em]en_add2.90) -- ([xshift=-1.5em]de_ca.west) -- ([xshift=-0.1em]de_ca.west);
+
+
 \begin{pgfonlayer}{background}
-\node[draw=ugreen,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt,dashed,thick][fit=(en_sa)(en_ffn)](box1){};
-\node[draw=red,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt,dashed,thick][fit=(de_sa)(de_ca)(de_ffn)](box2){};
+\node[draw=ugreen,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt,dashed,thick,xshift=-0.2em,yshift=-0.2em][fit=(en_add1)(en_add2)(en_sa)(en_ffn)](box1){};
+\node[draw=red,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt,dashed,thick,xshift=0.2em,yshift=-0.2em][fit=(de_sa)(de_ca)(de_ffn)(de_add3)](box2){};
 \end{pgfonlayer}

 \node[anchor=east,font=\scriptsize,text=ugreen] at ([xshift=-0.1em]box1.west){$N \times$};

--- a/Chapter17/Figures/figure-speech-translation-model-based-on-CTC.tex
+++ b/Chapter17/Figures/figure-speech-translation-model-based-on-CTC.tex
 \begin{tikzpicture}
-	\tikzstyle{layer}=[draw,rounded corners=2pt,font=\scriptsize,align=center,minimum width=5em]
+	\tikzstyle{layer}=[draw,rounded corners=2pt,font=\scriptsize,align=center,minimum width=7.1em]
 	\tikzstyle{word}=[font=\scriptsize]
-	
+%%%%encoder
 \node[layer,fill=red!20] (en_sa) at (0,0){Multi-Head \\ Attention};
-\node[layer,anchor=south,fill=green!20] (en_ffn) at ([yshift=1.4em]en_sa.north){Feed Forward \\ Network};
-\node[draw,circle,inner sep=0pt, minimum size=1em,anchor=north] (en_add) at ([yshift=-1.4em]en_sa.south){};
-\draw[] (en_add.90) -- (en_add.-90);
-\draw[] (en_add.0) -- (en_add.180);
-\node[layer,anchor=north,fill=yellow!20] (en_cnn) at ([yshift=-1.4em]en_add.south){CNN};
+\node[anchor=south,layer,fill=yellow!20](en_add1) at ([yshift=1.0em]en_sa.north) {Add \& LayerNorm};
+\node[layer,anchor=south,fill=green!20] (en_ffn) at ([yshift=1.0em]en_add1.north){Feed Forward \\ Network};
+\node[anchor=south,layer,fill=yellow!20](en_add2) at ([yshift=1.0em]en_ffn.north) {Add \& LayerNorm};
+\node[layer,anchor=south,fill=blue!20] (en_sf) at ([yshift=2.4em]en_add2.north){Softmax};
+\node[layer,anchor=south,fill=orange!20] (en_output) at ([yshift=1.0em]en_sf.north){CTC Output};
+\node[draw,circle,inner sep=0pt, minimum size=1em,anchor=north,thick] (en_add) at ([yshift=-1.4em]en_sa.south){};
+\draw[thick] (en_add.90) -- (en_add.-90);
+\draw[thick] (en_add.0) -- (en_add.180);
+\node[layer,anchor=north,fill=yellow!20] (en_cnn) at ([yshift=-1.0em]en_add.south){CNN};
+\node[anchor=east,font=\scriptsize,align=center] (en_pos) at ([xshift=-2em]en_add.west){位置编码};
+\node[anchor=north,font=\scriptsize,align=center] (en_input) at ([yshift=-1em]en_cnn.south){源语言语音特征\\(FBank/MFCC)};

+\draw[->,thick] (en_input.90) -- ([yshift=-0.1em]en_cnn.-90);
+\draw[->,thick] ([yshift=0.1em]en_cnn.90) -- ([yshift=-0.1em]en_add.-90);
+\draw[->,thick] ([yshift=0.1em]en_add.90) -- ([yshift=-0.1em]en_sa.-90);
+\draw[->,thick] ([yshift=0.1em]en_sa.90) -- ([yshift=-0.1em]en_add1.-90);
+\draw[->,thick] ([yshift=0.1em]en_add1.90) -- ([yshift=-0.1em]en_ffn.-90);
+\draw[->,thick] ([yshift=0.1em]en_ffn.90) --([yshift=-0.1em]en_add2.-90);
+\draw[->,thick] ([yshift=0.1em]en_add2.90) -- ([yshift=-0.1em]en_sf.-90);
+\draw[->,thick] ([yshift=0.1em]en_sf.90) -- ([yshift=-0.1em]en_output.-90);
+\draw[->,rounded corners=2pt,thick] ([yshift=-0.6em]en_sa.south)--([yshift=-0.6em,xshift=-4.0em]en_sa.south)--([xshift=-0.43em]en_add1.west)--(en_add1.west);
+\draw[->,rounded corners=2pt,thick] ([yshift=-0.6em]en_ffn.south)--([yshift=-0.6em,xshift=-4.0em]en_ffn.south)--([xshift=-0.43em]en_add2.west)--(en_add2.west);

-\node[draw,circle,inner sep=0pt, minimum size=1em,anchor=west] (de_add) at ([xshift=7em]en_add.east){};
-\draw[] (de_add.90) -- (de_add.-90);
-\draw[] (de_add.0) -- (de_add.180);
+%%%%decoder
+\node[draw,circle,inner sep=0pt, minimum size=1em,anchor=west,thick] (de_add) at ([xshift=9em]en_add.east){};
+\draw[thick] (de_add.90) -- (de_add.-90);
+\draw[thick] (de_add.0) -- (de_add.180);
 \node[layer,anchor=south,fill=red!20] (de_sa) at ([yshift=1.4em]de_add.north){Masked \\Multi-Head\\Attention};
-\node[layer,anchor=south,fill=red!20] (de_ca) at ([yshift=1.4em]de_sa.north){Multi-Head \\ Attention};
-\node[layer,anchor=south,fill=green!20] (de_ffn) at ([yshift=1.4em]de_ca.north){Feed Forward \\ Network};
+\node[anchor=south,layer,fill=yellow!20](de_add1) at ([yshift=1.0em]de_sa.north) {Add \& LayerNorm};
+\node[layer,anchor=south,fill=red!20] (de_ca) at ([yshift=1.0em]de_add1.north){Multi-Head \\ Attention};
+\node[anchor=south,layer,fill=yellow!20](de_add2) at ([yshift=1.0em]de_ca.north) {Add \& LayerNorm};
+\node[layer,anchor=south,fill=green!20] (de_ffn) at ([yshift=1.0em]de_add2.north){Feed Forward \\ Network};
+\node[anchor=south,layer,fill=yellow!20](de_add3) at ([yshift=1.0em]de_ffn.north) {Add \& LayerNorm};
+\node[layer,anchor=south,fill=blue!20] (sf) at ([yshift=1.2em]de_add3.north){Softmax};
+\node[anchor=north,font=\scriptsize,align=center] (de_input) at ([yshift=-1.1em]de_add.south){目标语言文本\\编码表示};

-\node[layer,anchor=south,fill=blue!20] (en_sf) at ([yshift=3em]en_ffn.north){Softmax};
-\node[layer,anchor=south,fill=blue!20] (sf) at ([yshift=2em]de_ffn.north){Softmax};
-\node[layer,anchor=south,fill=orange!20] (en_output) at ([yshift=1.4em]en_sf.north){CTC Output};
-%\node[layer,anchor=south,fill=orange!20] (output) at ([yshift=1.4em]sf.north){ST Output};
+\node[anchor=west,font=\scriptsize,align=center] (de_pos) at ([xshift=2em]de_add.east){位置编码};

-\node[anchor=north,font=\scriptsize,align=center] (en_input) at ([yshift=-1em]en_cnn.south){语音特征\\(FBank/MFCC)};
-\node[anchor=north,font=\scriptsize,align=center] (de_input) at ([yshift=-1em]de_add.south){标注文本\\编码表示};
+\draw[->,thick] (de_input.90) -- ([yshift=-0.1em]de_add.-90);
+\draw[->,thick] ([yshift=0.1em]de_add.90) -- ([yshift=-0.1em]de_sa.-90);
+\draw[->,thick] ([yshift=0.1em]de_sa.90) -- ([yshift=-0.1em]de_add1.-90);
+\draw[->,thick] ([yshift=0.1em]de_add1.90) -- ([yshift=-0.1em]de_ca.-90);
+\draw[->,thick] ([yshift=0.1em]de_ca.90) -- ([yshift=-0.1em]de_add2.-90);
+\draw[->,thick] ([yshift=0.1em]de_add2.90) -- ([yshift=-0.1em]de_ffn.-90);
+\draw[->,thick] ([yshift=0.1em]de_ffn.90) -- ([yshift=-0.1em]de_add3.-90);
+\draw[->,thick] ([yshift=0.1em]de_add3.90) -- ([yshift=-0.1em]sf.-90);
+\draw[->,thick] ([yshift=0.1em]sf.90) -- ([yshift=1.0em]sf.90);
+\draw[->,thick] ([xshift=0.1em]en_pos.0) -- ([xshift=-0.1em]en_add.180);
+\draw[->,thick] ([xshift=-0.1em]de_pos.180) -- ([xshift=0.1em]de_add.0);
+\draw[->,rounded corners=2pt,thick] ([yshift=-0.6em]de_sa.south)--([yshift=-0.6em,xshift=4.0em]de_sa.south)--([xshift=0.43em]de_add1.east)--(de_add1.east);
+\draw[->,rounded corners=2pt,thick] ([yshift=-0.6em]de_ca.south)--([yshift=-0.6em,xshift=4.0em]de_ca.south)--([xshift=0.43em]de_add2.east)--(de_add2.east);
+\draw[->,rounded corners=2pt,thick] ([yshift=-0.6em]de_ffn.south)--([yshift=-0.6em,xshift=4.0em]de_ffn.south)--([xshift=0.43em]de_add3.east)--(de_add3.east);
+
+\draw[->,rounded corners=2pt,thick] ([yshift=0.1em]en_add2.90) -- ([yshift=1.5em]en_add2.90) -- ([xshift=5.0em,yshift=1.5em]en_add2.90) -- ([xshift=-1.5em]de_ca.west) -- ([xshift=-0.1em]de_ca.west);

-\node[anchor=east,font=\scriptsize,align=center] (en_pos) at ([xshift=-2em]en_add.west){位置编码};
-\node[anchor=west,font=\scriptsize,align=center] (de_pos) at ([xshift=2em]de_add.east){位置编码};

-\draw[->] (en_input.90) -- ([yshift=-0.1em]en_cnn.-90);
-\draw[->] ([yshift=0.1em]en_cnn.90) -- ([yshift=-0.1em]en_add.-90);
-\draw[->] ([yshift=0.1em]en_add.90) -- ([yshift=-0.1em]en_sa.-90);
-\draw[->] ([yshift=0.1em]en_sa.90) -- ([yshift=-0.1em]en_ffn.-90);
-\draw[->] (de_input.90) -- ([yshift=-0.1em]de_add.-90);
-\draw[->] ([yshift=0.1em]de_add.90) -- ([yshift=-0.1em]de_sa.-90);
-\draw[->] ([yshift=0.1em]de_sa.90) -- ([yshift=-0.1em]de_ca.-90);
-\draw[->] ([yshift=0.1em]de_ca.90) -- ([yshift=-0.1em]de_ffn.-90);
-\draw[->] ([yshift=0.1em]en_ffn.90) -- ([yshift=-0.1em]en_sf.-90);
-\draw[->] ([yshift=0.1em]en_sf.90) -- ([yshift=-0.1em]en_output.-90);
-\draw[->] ([yshift=0.1em]de_ffn.90) -- ([yshift=-0.1em]sf.-90);
-\draw[->] ([yshift=0.1em]sf.90) -- ([yshift=1.5em]sf.90);
-\draw[->] ([xshift=0.1em]en_pos.0) -- ([xshift=-0.1em]en_add.180);
-\draw[->] ([xshift=-0.1em]de_pos.180) -- ([xshift=0.1em]de_add.0);
-\draw[->,rounded corners=2pt] ([yshift=2em]en_ffn.90) -- ([xshift=4em,yshift=2em]en_ffn.90) -- ([xshift=-1.5em]de_ca.west) -- ([xshift=-0.1em]de_ca.west);
 \begin{pgfonlayer}{background}
-\node[draw=ugreen,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt,dashed,thick][fit=(en_sa)(en_ffn)]{};
-\node[draw=red,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt,dashed,thick][fit=(de_sa)(de_ca)(de_ffn)]{};
+\node[draw=ugreen,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt,dashed,thick,xshift=-0.2em,yshift=-0.2em][fit=(en_add1)(en_add2)(en_sa)(en_ffn)](box1){};
+\node[draw=red,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt,dashed,thick,xshift=0.2em,yshift=-0.2em][fit=(de_sa)(de_ca)(de_ffn)(de_add3)](box2){};
 \end{pgfonlayer}

 \node[anchor=east,font=\scriptsize,text=ugreen] at ([xshift=-0.1em]box1.west){$N \times$};
 \node[anchor=west,font=\scriptsize,text=red] at ([xshift=0.1em]box2.east){$\times N$};
 \node[anchor=east,font=\scriptsize] at ([xshift=-0.1em]en_cnn.west){$2 \times$};
-\node[anchor=east,font=\scriptsize,align=center,text=ugreen] at ([xshift=-0.1em,yshift=3em]box1.west){ST\\ 编码器};
-\node[anchor=west,font=\scriptsize,align=center,text=red] at ([xshift=0.1em,yshift=5em]box2.east){ST\\解码器};
+\node[anchor=east,font=\scriptsize,align=center,text=ugreen] at ([xshift=-0.1em,yshift=3em]box1.west){ST \\ 编码器};
+\node[anchor=west,font=\scriptsize,align=center,text=red] at ([xshift=0.1em,yshift=5em]box2.east){ST \\ 解码器};
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter17/chapter17.tex
+++ b/Chapter17/chapter17.tex
@@ -54,7 +54,7 @@
 %----------------------------------------------------------------------------------------
 \section{语音翻译}

-\parinterval 语音，是人类交流中最常用的一种信息载体。从日常聊天、出国旅游，到国际会议、跨国合作，对于语音进行翻译的需求不断增加。甚至在有些场景下，用语音进行交互要比用文本进行交互频繁得多。因此，{\small\bfnew{语音翻译}}\index{语音翻译}（Speech Translation）\index{Speech Translation}也成为了语音处理和机器翻译相结合的重要产物。根据目标语言的载体类型，可以将语音翻译分为{\small\bfnew{语音到文本翻译}}\index{语音到文本翻译}（Speech-to-Text Translation）\index{Speech-to-Text Translation}和{\small\bfnew{语音到语音翻译}}\index{语音到语音翻译}（Speech-to-Speech Translation）\index{Speech-to-Speech Translation}；基于翻译的实时性，还可以分为{\small\bfnew{实时语音翻译}}\index{实时语音翻译}（即同声传译，Simultaneous Translation）\index{Simultaneous Translation}和{\small\bfnew{离线语音翻译}}（Offline Speech Translation）\index{离线语音翻译}\index{Offline Speech Translation}。本节主要关注离线语音到文本翻译方法（简称为语音翻译），分别从音频处理、级联语音翻译和端到端语音翻译几个角度开展讨论。
+\parinterval 语音，是人类交流中最常用的一种信息载体。从日常聊天、出国旅游，到国际会议、跨国合作，对于语音翻译的需求不断增加。甚至在有些场景下，用语音进行交互要比用文本进行交互频繁得多。因此，{\small\bfnew{语音翻译}}\index{语音翻译}（Speech Translation）\index{Speech Translation}也成为了语音处理和机器翻译相结合的重要产物。根据目标语言的载体类型，可以将语音翻译分为{\small\bfnew{语音到文本翻译}}\index{语音到文本翻译}（Speech-to-Text Translation）\index{Speech-to-Text Translation}和{\small\bfnew{语音到语音翻译}}\index{语音到语音翻译}（Speech-to-Speech Translation）\index{Speech-to-Speech Translation}；基于翻译的实时性，还可以分为{\small\bfnew{实时语音翻译}}\index{实时语音翻译}（即同声传译，Simultaneous Translation）\index{Simultaneous Translation}和{\small\bfnew{离线语音翻译}}（Offline Speech Translation）\index{离线语音翻译}\index{Offline Speech Translation}。本节主要关注离线语音到文本翻译方法（简称为语音翻译），分别从音频处理、级联语音翻译和端到端语音翻译几个角度开展讨论。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -75,7 +75,7 @@

 \parinterval 经过上面的描述可以看出，音频的表示实际上是一个非常长的采样点序列，这导致了直接使用现有的深度学习技术处理音频序列较为困难。并且，原始的音频信号中可能包含着较多的噪声、环境声或冗余信息，也会对模型产生干扰。因此，一般会对音频序列进行处理来提取声学特征，具体为将长序列的采样点序列转换为短序列的特征向量序列，再用于下游系统。虽然已有一些工作不依赖特征提取，直接在原始的采样点序列上进行声学建模和模型训练\upcite{DBLP:conf/interspeech/SainathWSWV15}，但目前的主流方法仍然是基于声学特征进行建模\upcite{DBLP:conf/icassp/MohamedHP12}。

-\parinterval 声学特征提取的第一步是预处理。其流程主要是对音频进行预加重、分帧和加窗。预加重用来提升音频信号中的高频部分，目的是使频谱更加平滑。分帧（原理如图\ref{fig:17-3}所示）是基于短时平稳假设，即根据生物学特征，语音信号是一个缓慢变化的过程，10ms$\thicksim$30ms的信号片段是相对平稳的。基于这个假设，一般将每25ms作为一帧来提取特征，这个时间称为{\small\bfnew{帧长}}\index{帧长}（Frame Length）\index{Frame Length}。同时，为了保证不同帧之间的信号平滑性，使每两个相邻帧之间存在一定的重合部分。一般每隔10ms取一帧，这个时长称为{\small\bfnew{帧移}}\index{帧移}（Frame Shift）\index{Frame Shift}。为了缓解分帧带来的频谱泄漏，对每帧的信号进行加窗处理使其幅度在两段渐变到0，一般采用的是{\small\bfnew{汉明窗}}\index{汉明窗}（Hamming）\index{Hamming}\upcite{洪青阳2020语音识别原理与应用}。
+\parinterval 声学特征提取的第一步是预处理。其流程主要是对音频进行预加重、分帧和加窗。预加重是通过增强音频信号中的高频部分来减弱语音中对高频信号的抑制，使频谱更加顺滑。分帧（原理如图\ref{fig:17-3}所示）是基于短时平稳假设，即根据生物学特征，语音信号是一个缓慢变化的过程，10ms$\thicksim$30ms的信号片段是相对平稳的。基于这个假设，一般将每25ms作为一帧来提取特征，这个时间称为{\small\bfnew{帧长}}\index{帧长}（Frame Length）\index{Frame Length}。同时，为了保证不同帧之间的信号平滑性，使每两个相邻帧之间存在一定的重合部分。一般每隔10ms取一帧，这个时长称为{\small\bfnew{帧移}}\index{帧移}（Frame Shift）\index{Frame Shift}。为了缓解分帧带来的频谱泄漏，对每帧的信号进行加窗处理使其幅度在两段渐变到0，一般采用的是{\small\bfnew{汉明窗}}\index{汉明窗}（Hamming）\index{Hamming}\upcite{洪青阳2020语音识别原理与应用}。
 %----------------------------------------------------------------------------------------------------
 \begin{figure}[htp]
 \centering
@@ -85,9 +85,7 @@
 \end{figure}
 %----------------------------------------------------------------------------------------------------

-\parinterval 经过了上述的预处理操作，可以得到音频对应的帧序列，之后通过不同的操作来提取不同类型的声学特征。常用的声学特征包括{\small\bfnew{Mel频率倒谱系数}}\index{Mel频率倒谱系数}（Mel-frequency Cepstral Coefficient，MFCC）\index{Mel-Frequency Cepstral Coefficient}、{\small\bfnew{感知线性预测系数}}\index{感知线性预测系数}（Perceptual Lienar Predictive，PLP）\index{Perceptual Lienar Predictive}、{\small\bfnew{滤波器组}}\index{滤波器组}（Filter-bank，Fbank）\index{Filter-bank}等。MFCC、PLP和Fbank特征都需要对预处理后的音频做{\small\bfnew{短时傅里叶变换}}\index{短时傅里叶变换}（Short-time Fourier Tranform，STFT）\index{Short-time Fourier Tranform}，得到具有规律的线性分辨率。之后再经过特定的操作，得到各种声学特征。不同声学特征的特点是不同的，MFCC去相关性较好，PLP抗噪性强，FBank可以保留更多的语音原始特征。在语音翻译中，比较常用的声学特征为FBank或MFCC\upcite{洪青阳2020语音识别原理与应用}。
-
-\parinterval 实际上，提取到的声学特征可以类比于计算机视觉中的像素特征，或者自然语言处理中的词嵌入表示。不同之处在于，声学特征更加复杂多变，可能存在着较多的噪声和冗余信息。此外，相比对应的文字序列，音频提取到的特征序列长度要大十倍以上。比如，人类正常交流中每秒钟一般可以说2-3个字，而每秒钟的语音可以提取得到100帧的特征序列。巨大的长度比差异也为声学特征建模带来了挑战。
+\parinterval 经过了上述的预处理操作，可以得到音频对应的帧序列，之后通过不同的操作来提取不同类型的声学特征。在语音翻译中，比较常用的声学特征为{\small\bfnew{滤波器组}}\index{滤波器组}（Filter-bank，Fbank）\index{Filter-bank}和{\small\bfnew{Mel频率倒谱系数}}\index{Mel频率倒谱系数}（Mel-frequency Cepstral Coefficient，MFCC）\upcite{洪青阳2020语音识别原理与应用}。实际上，提取到的声学特征可以类比于计算机视觉中的像素特征，或者自然语言处理中的词嵌入表示。不同之处在于，声学特征更加复杂多变，可能存在着较多的噪声和冗余信息。此外，相比对应的文字序列，音频提取到的特征序列长度要大十倍以上。比如，人类正常交流中每秒钟一般可以说2-3个字，而每秒钟的语音可以提取得到100帧的特征序列。巨大的长度比差异也为声学特征建模带来了挑战。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -109,15 +107,9 @@

 \parinterval 由于声学特征提取在上一节中已经进行了描述，而且文本翻译可以直接使用本书介绍的统计机器翻译或者神经机器翻译方法。因此下面简要介绍一下语音识别模型，以便读者对级联式语音翻译系统有一个完整的认识。其中的部分概念在后续介绍的端到端语言翻译中也会有所涉及。

-%----------------------------------------------------------------------------------------
-%    NEW SUBSUB-SECTION
-%----------------------------------------------------------------------------------------
-
-\subsubsection{1. 语音识别方法}
-
 \parinterval 传统的语音识别模型和统计机器翻译相似，需要利用声学模型、语言模型和发音词典联合进行识别，系统较为复杂\upcite{DBLP:journals/ftsig/GalesY07,DBLP:journals/taslp/MohamedDH12,DBLP:journals/spm/X12a}。而近些年来，随着神经网络的发展，基于神经网络的端到端语音识别模型逐渐受到关注，大大简化了训练流程\upcite{DBLP:conf/nips/ChorowskiBSCB15,DBLP:conf/icassp/ChanJLV16}。目前的端到端语音识别模型主要基于序列到序列结构，编码器根据输入的声学特征进一步提取高级特征，解码器根据编码器提取的特征识别对应的文本。在后文中即将介绍的端到端语音翻译模型也是基于十分相似的结构。因此，从某种意义上说，语音识别和翻译所使用的端到端方法与神经机器翻译是一致的。

-\parinterval 语音识别目前广泛使用基于Transformer的模型结构（见{\chaptertwelve}），如图\ref{fig:17-5}所示。可以看出，相比文本翻译，模型结构上唯一的区别在于编码器的输入为声学特征，以及编码器底层会使用额外的卷积层来减小输入序列的长度，从而降低长序列带来的显存占用以及建模难度。由于语音对应的特征序列过长，在计算注意力模型的时候，会占用大量的内存/显存，并增加训练时间。因此，通常会先对语音特征做一个下采样，缩小语音的序列长度。目前一个常用的做法，是在输入的语音特征上进行两层步长为2的卷积操作，从而将输入序列的长度缩小为之前的1/4。 通过大量的语音-标注平行数据对模型进行训练，可以得到高质量的语音识别模型。
+\parinterval 语音识别目前广泛使用基于Transformer的模型结构（见{\chaptertwelve}），如图\ref{fig:17-5}所示。可以看出，相比文本翻译，模型结构上唯一的区别在于编码器的输入为声学特征，以及编码器底层会使用额外的卷积层来减小输入序列的长度。这是由于语音对应的特征序列过长，在计算注意力模型的时候，会占用大量的内存/显存，并增加训练时间。因此，一个常用的做法是在语音特征上进行两层步长为2的卷积操作，从而将输入序列的长度缩小为之前的1/4。通过大量的语音-标注平行数据对模型进行训练，可以得到高质量的语音识别模型。

 %----------------------------------------------------------------------------------------------------
 \begin{figure}[htp]
@@ -128,26 +120,7 @@
 \end{figure}
 %----------------------------------------------------------------------------------------------------

-%----------------------------------------------------------------------------------------
-%    NEW SUBSUB-SECTION
-%----------------------------------------------------------------------------------------
-
-\subsubsection{2. 语音识别结果的表示}
-
-\parinterval 级联语音翻译模型利用翻译模型将语音识别结果翻译为目标语言文本，但存在的一个问题是语音识别模型只输出One-best，其中可能存在一些识别错误，这些错误在翻译过程中会被放大，也就是错误传播问题。传统级联语音模型的一个主要方向是丰富语音识别模型的预测结果，为翻译模型提供更多的信息，具体做法是在语音识别模型中，声学模型解码得到词格来取代One-best 识别结果。词格是一种有向无环图，包含单个起点和终点，图中的每条边记录了每个词和对应的转移概率，如图\ref{fig:17-6}所示。
-
-%----------------------------------------------------------------------------------------------------
-\begin{figure}[htp]
-\centering
-\input{./Chapter17/Figures/figure-word-lattice.tex}
-\caption{词格示例}
-\label{fig:17-6}
-\end{figure}
-%----------------------------------------------------------------------------------------------------
-
-\parinterval 可以看出，词格可以保存多条搜索路径，路径中保存了输入序列的时间信息以及解码过程。翻译模型基于词格进行翻译，可以降低语音识别模型带来的误差\upcite{DBLP:conf/acl/ZhangGCF19,DBLP:conf/acl/SperberNPW19}。但在端到端语音识别模型中，一般基于束搜索方法进行解码，且解码序列的长度与输入序列并不匹配，相比传统声学模型解码丢失了语音的时间信息，因此这种基于词格的方法主要集中在传统语音识别系统上。
-
-\parinterval 为了降低错误传播问题带来的影响，一种思路是通过一个后处理模型修正识别结果中的错误，再送给文本翻译模型进行翻译。也可以进一步对文本做{\small\bfnew{顺滑}}\index{顺滑}（Disfluency Detection\index{Disfluency Detection}）处理，使得送给翻译系统的文本更加干净、流畅，比如除去一些导致停顿的语气词。这一做法在工业界得到了广泛应用，但由于每个模型只能串行地计算，也会带来额外的计算代价以及运算时间。另外一种思路是训练更加健壮的文本翻译模型，使其可以处理输入中存在的噪声或误差\upcite{DBLP:conf/acl/LiuTMCZ18}。 
+\parinterval 为了降低语音识别的错误对下游系统的影响，通常也会用词格来取代One-best语音识别结果。另一种思路是通过一个后处理模型修正识别结果中的错误，再送给文本翻译模型进行翻译。也可以进一步对文本做{\small\bfnew{顺滑}}\index{顺滑}（Disfluency Detection\index{Disfluency Detection}）处理，使得送给翻译系统的文本更加干净、流畅，比如除去一些导致停顿的语气词。这一做法在工业界得到了广泛应用，但由于每个模型只能串行地计算，也会带来额外的计算代价以及运算时间。另外一种思路是训练更加健壮的文本翻译模型，使其可以处理输入中存在的噪声或误差\upcite{DBLP:conf/acl/LiuTMCZ18}。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -162,7 +135,7 @@
    \vspace{0.5em}
    \item {\small\bfnew{错误传播问题}}。级联模型导致的一个很严重的问题在于，语音识别模型得到的文本如果存在错误，这些错误很可能在翻译过程中被放大，从而使最后翻译结果出现比较大的偏差。比如识别时在句尾少生成了个“吗”，会导致翻译模型将疑问句翻译为陈述句。
    \vspace{0.5em}
-    \item {\small\bfnew{翻译效率问题}}。由于需要语音识别模型和文本标注模型只能串行地计算，翻译效率相对较低，而实际很多场景中都需要达到低延时的翻译。
+    \item {\small\bfnew{翻译效率问题}}。由于语音识别模型和文本标注模型只能串行地计算，翻译效率相对较低，而实际很多场景中都需要低延时的翻译。
    \vspace{0.5em}
    \item {\small\bfnew{语音中的副语言信息丢失}}。将语音识别为文本的过程中，语音中包含的语气、情感、音调等信息会丢失，而同一句话在不同的语气中表达的意思很可能是不同的。尤其是在实际应用中，由于语音识别结果通常并不包含标点，还需要额外的后处理模型将标点还原，也会带来额外的计算代价。
    \vspace{0.5em}
@@ -215,7 +188,7 @@

 \parinterval 一种思路是进行多任务学习，让模型在训练过程中得到更多的监督信息。使用多个任务强化主任务（机器翻译），在本书的{\chapterfifteen}和{\chaptersixteen}也有所涉及。从这个角度说，机器翻译中很多问题的解决手段都是一致的。

-\parinterval 语音语言中多任务学习主要借助语音对应的标注信息，也就是源语言文本。{\small\bfnew{连接时序分类}}\index{连接时序分类}（Connectionist Temporal Classification，CTC）\index{Connectionist Temporal Classification}\upcite{DBLP:conf/icml/GravesFGS06}是语音处理中最简单有效的一种多任务学习方法\upcite{DBLP:journals/jstsp/WatanabeHKHH17,DBLP:conf/icassp/KimHW17}，也被广泛应用于文本识别任务中\upcite{DBLP:journals/pami/ShiBY17}。CTC可以将输入序列的每一位置都对应到标注文本中，学习语音和文字之间的软对齐关系。比如，对于下面的音频序列，CTC可以将每个位置分别对应到同一个词。需要注意的是，CTC会额外新增一个词$\epsilon$，类似于一个空白词，表示这个位置没有声音或者没有任何对应的预测结果。然后，将相同且连续的词合并，去除$\epsilon$，就可以得到预测结果，如图\ref{fig:17-8} 所示。
+\parinterval 语音翻译中多任务学习主要借助语音对应的标注信息，也就是源语言文本。{\small\bfnew{连接时序分类}}\index{连接时序分类}（Connectionist Temporal Classification，CTC）\index{Connectionist Temporal Classification}\upcite{DBLP:conf/icml/GravesFGS06}是语音处理中最简单有效的一种多任务学习方法\upcite{DBLP:journals/jstsp/WatanabeHKHH17,DBLP:conf/icassp/KimHW17}，也被广泛应用于文本识别任务中\upcite{DBLP:journals/pami/ShiBY17}。CTC可以将输入序列的每一位置都对应到标注文本中，学习语音和文字之间的软对齐关系。比如，对于下面的音频序列，CTC可以将每个位置分别对应到同一个词。需要注意的是，CTC会额外新增一个词$\epsilon$，类似于一个空白词，表示这个位置没有声音或者没有任何对应的预测结果。在对齐完成之后，将相同且连续的词合并，去除$\epsilon$，就可以得到预测结果，如图\ref{fig:17-8} 所示。

 %----------------------------------------------------------------------------------------------------
 \begin{figure}[htp]
@@ -226,12 +199,12 @@
 \end{figure}
 %----------------------------------------------------------------------------------------------------

-\parinterval CTC的一些特性使其可以很好的完成输入输出之间的对齐，例如
+\parinterval CTC的一些特性使其可以很好的完成输入输出之间的对齐，例如：

 %----------------------------------------------------------------------------------------------------
 \begin{itemize}
    \vspace{0.5em}
-    \item {\small\bfnew{输入和输出之间的对齐是单调的}}。也就是后面的输入只会预测与前面的序列相同或后面的输出内容。比如对于图\ref{fig:17-8}中的例子，如果输入的位置t已经预测了字符l，那么t之后的位置不会再预测前面的字符h和e。
+    \item {\small\bfnew{输入和输出之间的对齐是单调的}}。也就是后面的输入只会预测与前面序列相同的后面的输出内容。比如对于图\ref{fig:17-8}中的例子，如果输入的位置t已经对齐了字符“l”，那么t之后的位置不会再对齐前面的字符“h”和“e”。
    \vspace{0.5em}
    \item {\small\bfnew{输入和输出之间是多对一的关系}}。也就是多个输入会对应到同一个输出上。这对于语音序列来说是非常自然的一件事情，由于输入的每个位置只包含非常短的语音特征，因此多个输入才可以对应到一个输出字符。
    \vspace{0.5em}
@@ -249,7 +222,7 @@
 \end{figure}
 %----------------------------------------------------------------------------------------------------

-\parinterval 另外一种多任务学习的思想是通过两个解码器，分别预测语音对应的源语言句子和目标语言句子，具体有图\ref{fig:17-10}展示的三种方式\upcite{DBLP:conf/naacl/AnastasopoulosC18,DBLP:conf/asru/BaharBN19}。图\ref{fig:17-10}(a)中采用单编码器-双解码器的方式，两个解码器根据编码器的表示，分别预测源语言句子和目标语言句子，从而使编码器训练地更加充分。这种做法的好处在于源语言文的文本生成任务成可以辅助翻译过程，相当于为源语言语音提供了额外的“模态”信息。图\ref{fig:17-10}(b)则通过使用两个级联的解码器，先利用第一个解码器生成源语言句子，然后再利用第一个解码器的表示，通过第二个解码器生成目标语言句子。这种方法通过增加一个中间输出，降低了模型的训练难度，但同时也会带来额外的解码耗时，因为两个解码器需要串行地进行生成。图\ref{fig:17-10}(c) 中模型更进一步，第二个编码器联合编码器和第一个解码器的表示进行生成，更充分地利用了已有信息。
+\parinterval 另外一种多任务学习的思想是通过两个解码器，分别预测语音对应的源语言句子和目标语言句子，具体有图\ref{fig:17-10}展示的三种方式\upcite{DBLP:conf/naacl/AnastasopoulosC18,DBLP:conf/asru/BaharBN19}。图\ref{fig:17-10}(a)中采用单编码器-双解码器的方式，两个解码器根据编码器的表示，分别预测源语言句子和目标语言句子，从而使编码器训练地更加充分。这种做法的好处在于源语言的文本生成任务成可以辅助翻译过程，相当于为源语言语音提供了额外的“模态”信息。图\ref{fig:17-10}(b)则通过使用两个级联的解码器，先利用第一个解码器生成源语言句子，然后再利用第一个解码器的表示，通过第二个解码器生成目标语言句子。这种方法通过增加一个中间输出，降低了模型的训练难度，但同时也会带来额外的解码耗时，因为两个解码器需要串行地进行生成。图\ref{fig:17-10}(c) 中模型更进一步，第二个编码器联合编码器和第一个解码器的表示进行生成，更充分地利用了已有信息。
 %----------------------------------------------------------------------------------------------------
 \begin{figure}[htp]
 \centering
@@ -283,7 +256,7 @@

 \section{图像翻译}

-\parinterval 在人类所接受的信息中，视觉信息的比重往往不亚于语音和文本信息，甚至更多。视觉信息通常以图像的形式存在，近几年，结合图像的多模态机器翻译受到了广泛的关注。多模态机器翻译（图\ref{fig:17-11} (a)）简单来说就是结合源语言和其他模态（例如图像等）的信息生成目标语言的过程。这种结合图像的机器翻译还是一种狭义上的“翻译”，它本质上还是从源语言到目标语言或者说从文本到文本的翻译。那么从图像到文本（图\ref{fig:17-11}(b)）的转换，即给定图像生成与图像内容相关的描述，也可以被称为广义上的“翻译”。例如，{\small\bfnew{图片描述生成}}\index{图片描述生成}（Image Captioning）\index{Image Captioning}就是一种典型的图像到文本的翻译。当然，这种广义上的翻译形式不仅仅包括图像到文本的转换，还可以包括从图像到图像的转换（图\ref{fig:17-11}(c)），甚至是从文本到图像的转换（图\ref{fig:17-11}(d)）等等。这里将这些与图像相关的翻译任务统称为图像翻译。
+\parinterval 在人类所接受的信息中，视觉信息的比重往往不亚于语音和文本信息，甚至更多。视觉信息通常以图像的形式存在，近几年，结合图像的多模态机器翻译受到了广泛的关注。多模态机器翻译（图\ref{fig:17-11} (a)）简单来说就是结合源语言和其他模态（例如图像等）的信息生成目标语言的过程。这种结合图像的机器翻译还是一种狭义上的“翻译”，它本质上还是从源语言到目标语言或者说从文本到文本的翻译。事实上从图像到文本（图\ref{fig:17-11}(b)）的转换，即给定图像，生成与图像内容相关的描述，也可以被称为广义上的“翻译”。例如，{\small\bfnew{图片描述生成}}\index{图片描述生成}（Image Captioning）\index{Image Captioning}就是一种典型的图像到文本的翻译。当然，这种广义上的翻译形式不仅仅包括图像到文本的转换，还可以包括从图像到图像的转换（图\ref{fig:17-11}(c)），甚至是从文本到图像的转换（图\ref{fig:17-11}(d)）等等。这里将这些与图像相关的翻译任务统称为图像翻译。

 %----------------------------------------------------------------------------------------------------
 \begin{figure}[htp]
@@ -301,7 +274,7 @@
 \subsection{基于图像增强的文本翻译}
 \label{sec:image-augmented-translation}

-\parinterval 在文本翻译中引入图像信息是最典型的多模态机器翻译任务。虽然多模态机器翻译还是一种从源语言文本到目标语言文本的转换，但是在转换的过程中，融入了其他模态的信息减少了歧义的产生。例如前文提到的通过与源语言相关的图像信息，将“A medium sized  child jumps off of a dusty bank”中“bank”翻译为“河岸”而不是“银行”，因为图像中出现了河岸，因此“bank”的歧义大大降低。换句话说，对于同一图像或者视觉场景的描述，源语言和目标语言描述的信息是一致的，只不过，体现在不同语言上会有表达方法上的差异。那么，图像就会存在一些源语言和目标语言的隐含对齐“约束”，而这种“约束”可以捕捉语言中不易表达的隐含信息。
+\parinterval 在文本翻译中引入图像信息是最典型的多模态机器翻译任务。虽然多模态机器翻译还是一种从源语言文本到目标语言文本的转换，但是在转换的过程中，融入了其他模态的信息减少了歧义的产生。例如前文提到的通过与源语言相关的图像信息，将“A girl jumps off a bank .”中“bank”翻译为“河岸”而不是“银行”，因为图像中出现了河岸，因此“bank”的歧义大大降低。换句话说，对于同一图像或者视觉场景的描述，源语言和目标语言描述的信息是一致的，只不过，体现在不同语言上会有表达方法上的差异。那么，图像就会存在一些源语言和目标语言的隐含对齐“约束”，而这种“约束”可以捕捉语言中不易表达的隐含信息。

 \parinterval 如何融入视觉信息，更好的理解多模态上下文语义是多模态机器翻译研究的重点\upcite{DBLP:conf/wmt/SpeciaFSE16,DBLP:conf/wmt/CaglayanABGBBMH17,DBLP:conf/wmt/LibovickyHTBP16}，主要方向包括基于特征融合的方法\upcite{DBLP:conf/emnlp/CalixtoL17,DBLP:journals/corr/abs-1712-03449,DBLP:conf/wmt/HelclLV18}、基于联合模型的方法\upcite{DBLP:conf/ijcnlp/ElliottK17,DBLP:conf/acl/YinMSZYZL20}。下面是具体介绍。

@@ -311,7 +284,7 @@

 \subsubsection{1. 基于特征融合的方法}

-\parinterval 早期，通常将图像信息作为输入句子的一部分\upcite{DBLP:conf/emnlp/CalixtoL17,DBLP:conf/wmt/HuangLSOD16}，或者用其对编码器、解码器的状态进行初始化\upcite{DBLP:conf/emnlp/CalixtoL17,Elliott2015MultilingualID,DBLP:conf/wmt/MadhyasthaWS17}。如图\ref{fig:17-12}所示，对图像特征的提取通常是基于卷积神经网络，有关卷积神经网络的内容，可以参考{\chaptereleven}内容。通过卷积神经网络得到全局视觉特征，在进行维度变换后，将其作为源语言输入的一部分或者初始化状态引入到模型当中。但是，这种图像信息的引入方式有以下两个缺点：
+\parinterval 早期，通常将图像信息作为输入句子的一部分\upcite{DBLP:conf/emnlp/CalixtoL17,DBLP:conf/wmt/HuangLSOD16}，或者用其对编码器、解码器的状态进行初始化\upcite{DBLP:conf/emnlp/CalixtoL17,Elliott2015MultilingualID,DBLP:conf/wmt/MadhyasthaWS17}。如图\ref{fig:17-12}所示，图中$y_{<}$表示当前时刻之前的单词序列，对图像特征的提取通常是基于卷积神经网络，有关卷积神经网络的内容，可以参考{\chaptereleven}内容。通过卷积神经网络得到全局视觉特征，在进行维度变换后，将其作为源语言输入的一部分或者初始化状态引入到模型当中。但是，这种图像信息的引入方式有以下两个缺点：

 \begin{itemize}
    \vspace{0.5em}
@@ -364,9 +337,9 @@

 \parinterval 基于联合模型的方法通常是把翻译任务与其他视觉任务结合，进行联合训练。这种方法也可以被看做是一种多任务学习，只不过这里仅关注翻译和视觉任务。一种常见的方法是共享模型的部分参数来学习不同任务之间相似的部分，并通过特定的模块来学习每个任务特有的部分。

-\parinterval 如图\ref{fig:17-14}所示，可以将多模态机器翻译任务分解为两个子任务：机器翻译和图片生成\upcite{DBLP:conf/ijcnlp/ElliottK17}。其中机器翻译作为主任务，图片生成作为子任务。这里的图片生成指的是从一个图片描述生成对应图片，对于图片生成任务在后面还会有描述。通过单个编码器对源语言数据进行建模，然后通过两个解码器（翻译解码器和图像解码器）来学习翻译任务和图像生成任务。顶层任务学习每个任务的独立特征，底层共享参数层能够学习到更丰富的文本表示。
+\parinterval 如图\ref{fig:17-14}所示，图中$y_{<}$表示当前时刻之前的单词序列，可以将多模态机器翻译任务分解为两个子任务：机器翻译和图片生成\upcite{DBLP:conf/ijcnlp/ElliottK17}。其中机器翻译作为主任务，图片生成作为子任务。这里的图片生成指的是从一个图片描述生成对应图片，对于图片生成任务在后面还会有描述。通过单个编码器对源语言数据进行建模，然后通过两个解码器（翻译解码器和图像解码器）来学习翻译任务和图像生成任务。顶层任务学习每个任务的独立特征，底层共享参数层能够学习到更丰富的文本表示。

-\parinterval 另外在视觉问答领域有研究表明，在多模态任务中，不宜引入过多层的注意力机制，因为过深的模型会导致多模态模型的过拟合\upcite{DBLP:conf/nips/LuYBP16}。这一方面是由于深模型本身对数据的拟合能力，另一方面也是由于多模态任务的数据普遍较小，容易造成复杂模型的过拟合。从另一角度来说，利用多任务学习的方式，提高模型的泛化能力，也是一种有效防止过拟合现象的方式。类似的思想，也大量使用在多模态自然语言处理中，例如图像描述生成、视觉问答等\upcite{DBLP:conf/iccv/AntolALMBZP15}。
+\parinterval 另外在视觉问答领域有研究表明，在多模态任务中，不宜引入过多层的注意力机制，因为过深的模型会导致多模态模型的过拟合\upcite{DBLP:conf/nips/LuYBP16}。这一方面是由于深模型本身对数据的拟合能力，另一方面也是由于多模态任务的数据普遍较小，容易造成复杂模型的过拟合。从另一角度来说，利用多任务学习的方式，提高模型的泛化能力，也是一种有效防止过拟合现象的方式。类似的思想，也大量使用在多模态自然语言处理任务中，例如图像描述生成、视觉问答等\upcite{DBLP:conf/iccv/AntolALMBZP15}。

 %----------------------------------------------------------------------------------------------------
 \begin{figure}[htp]
@@ -383,7 +356,7 @@

 \subsection{图像到文本的翻译}

-\parinterval 图像到文本的转换也可以看作是广义上的翻译，简单来说，就是把图像作为了源语言的唯一输入，而输出是文本。其中，图像描述生成是最典型的图像到文本的翻译任务\upcite{DBLP:conf/ijcai/BernardiCEEEIKM17}。虽然，这部分内容并不是本书的重点，不过为了保证多模态翻译内容的完整性，这里对相关技术进行简要介绍。图像描述有时也被称看图说话、图像字幕生成，它在图像检索、智能导盲、人机交互等领域有着广泛的应用场景。
+\parinterval 图像到文本的转换也可以看作是广义上的翻译，简单来说，就是把图像作为唯一的输入，而输出是文本。其中，图像描述生成是最典型的图像到文本的翻译任务\upcite{DBLP:conf/ijcai/BernardiCEEEIKM17}。虽然，这部分内容并不是本书的重点，不过为了保证多模态翻译内容的完整性，这里对相关技术进行简要介绍。图像描述有时也被称看图说话、图像字幕生成，它在图像检索、智能导盲、人机交互等领域有着广泛的应用场景。

 %----------------------------------------------------------------------------------------------------
 \begin{figure}[htp]
@@ -402,7 +375,7 @@

 \subsubsection{1. 基础框架}

-\parinterval 在编码器-解码器框架中，编码器将输入的图像转换为一种新的“表示”形式，这种“表示”包含了输入图像的所有信息。之后解码器把这种“表示”转换为自然语言描述。比如，可以通过卷积神经网络提取图像特征到一个向量表示。然后，利用长短时记忆网络（LSTM）解码生成文字描述，这个过程中与机器翻译的解码过程类似。这种建模方式存在与\ref{sec:image-augmented-translation}描述一样的问题：生成的描述单词不一定需要所有的图像信息，将全局的图像信息送入模型中，可能会引入噪音。这时也可以使用注意力机制对其进行缓解\upcite{DBLP:conf/icml/XuBKCCSZB15}。
+\parinterval 在编码器-解码器框架中，编码器将输入的图像转换为一种新的“表示”形式，这种“表示”包含了输入图像的所有信息。之后解码器把这种“表示”转换为自然语言描述。比如，可以通过卷积神经网络提取图像特征为一个向量表示。然后，利用长短时记忆网络（LSTM）解码生成文字描述，这个过程中与机器翻译的解码过程类似。这种建模方式存在与\ref{sec:image-augmented-translation}描述一样的问题：生成的描述单词不一定需要所有的图像信息，将全局的图像信息送入模型中，可能会引入噪音。这时可以使用注意力机制来缓解该问题\upcite{DBLP:conf/icml/XuBKCCSZB15}。

 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -433,7 +406,7 @@

 \parinterval 由于解码器输出的是语言文字序列，因此需要考虑语言的特点对其进行改进。 例如，解码过程中， “the”,“on”，“at”这种介词或者冠词与图像的相关性较低\upcite{DBLP:conf/cvpr/LuXPS17}。因此，可以通过门控单元，控制视觉信号作用于文字生成的程度。另外,在解码过程中，生成的每个单词对应着图像的区域可能是不同的。因此也可以设计更为有效的注意力机制来捕捉解码器端对不同图像局部信息的关注程度\upcite{DBLP:conf/cvpr/00010BT0GZ18}。

-\parinterval 除了更好地使生成文本与图像特征进行相互作用以外，还有一些改进方法。例如，用卷积神经网络或者Transformer代替解码器所使用的循环神经网络\upcite{DBLP:conf/cvpr/AnejaDS18}。或者使用更深层的神经网络学习动词或者名词等视觉中不易表现出来的单词\upcite{DBLP:journals/mta/FangWCT18}，其思想与深层神经机器翻译模型有相通之处（见{\chapterfifteen}）。
+\parinterval 除了更好地使生成文本与图像特征进行相互作用以外，还有一些改进方法。例如，用卷积神经网络或者Transformer代替解码器所使用的循环神经网络\upcite{DBLP:conf/cvpr/AnejaDS18}。或者使用更深层的神经网络学习动词或者形容词等视觉中不易表现出来的单词\upcite{DBLP:journals/mta/FangWCT18}，其思想与深层神经机器翻译模型有相通之处（见{\chapterfifteen}）。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION

--- a/Chapter2/Figures/figure-word-frequency-distribution.tex
+++ b/Chapter2/Figures/figure-word-frequency-distribution.tex
 \definecolor{ublue}{rgb}{0.152,0.250,0.545}
 \begin{tikzpicture}
 \begin{axis}[
-  width=11cm,
+  width=13cm,
  height=5.5cm,
  xlabel={WikiText-103上的词表},
  ylabel={词汇出现总次数},

--- a/Chapter2/chapter2.tex
+++ b/Chapter2/chapter2.tex
@@ -862,9 +862,6 @@ c(\cdot) & \textrm{当计算最高阶模型时}  \\

 \parinterval 当任务对单词序列长度没有限制时，上述两种方法枚举出的单词序列也是无穷无尽的。因此这两种枚举策略并不具备完备性而且会导致枚举过程无法停止。由于日常生活中通常不会见到特别长的句子，因此可以通过限制单词序列的最大长度来避免这个问题。一旦单词序列的最大长度被确定，以上两种枚举策略就可以在一定时间内枚举出所有可能的单词序列，因而一定可以找到最优的单词序列，即具备最优性。

-\parinterval 此时上述生成策略虽然可以满足完备性和最优性，但其仍然算不上是优秀的生成策略，因为这两种算法在时间复杂度和空间复杂度上的表现很差，如表\ref{tab:2-4}所示。其中$|V|$为词表大小，$m$ 为序列长度。值得注意的是，在之前的遍历过程中，除了在序列开头一定会挑选<sos>之外，其他位置每次可挑选的单词并不只有词表中的单词，还有结束符号<eos>，因此实际上生成过程中每个位置的单词候选数量为$|V|+1$。
-
-\vspace{0.5em}
 %------------------------------------------------------
 \begin{table}[htp]{
 \begin{center}
@@ -881,6 +878,8 @@ c(\cdot) & \textrm{当计算最高阶模型时}  \\
 }\end{table}
 %------------------------------------------------------

+\parinterval 此时上述生成策略虽然可以满足完备性和最优性，但其仍然算不上是优秀的生成策略，因为这两种算法在时间复杂度和空间复杂度上的表现很差，如表\ref{tab:2-4}所示。其中$|V|$为词表大小，$m$ 为序列长度。值得注意的是，在之前的遍历过程中，除了在序列开头一定会挑选<sos>之外，其他位置每次可挑选的单词并不只有词表中的单词，还有结束符号<eos>，因此实际上生成过程中每个位置的单词候选数量为$|V|+1$。
+
 \parinterval 那么是否有比枚举策略更高效的方法呢？答案是肯定的。一种直观的方法是将搜索的过程表示成树型结构，称为解空间树。它包含了搜索过程中可生成的全部序列。该树的根节点恒为<sos>，代表序列均从<sos> 开始。该树结构中非叶子节点的兄弟节点有$|V|+1$个，由词表和结束符号<eos>构成。从图\ref{fig:2-13}可以看到，对于一个最大长度为4的序列的搜索过程，生成某个单词序列的过程实际上就是访问解空间树中从根节点<sos> 开始一直到叶子节点<eos>结束的某条路径，而这条的路径上节点按顺序组成了一段独特的单词序列。此时对所有可能单词序列的枚举就变成了对解空间树的遍历。并且枚举的过程与语言模型打分的过程也是一致的，每枚举一个词$i$也就是在图\ref{fig:2-13}选择$w_i$一列的一个节点，语言模型就可以为当前的树节点$w_i$给出一个分值，即$\funp{P}(w_i | w_1 w_2 \ldots w_{i-1})$。对于$n$-gram语言模型，这个分值可以表示为$\funp{P}(w_i | w_1 w_2 \ldots w_{i-1})=\funp{P}(w_i | w_{i-n+1} \ldots w_{i-1})$

 %----------------------------------------------

--- a/Chapter7/chapter7.tex
+++ b/Chapter7/chapter7.tex
@@ -124,14 +124,14 @@

 \subsection{机器翻译中的短语}

-\parinterval 基于短语的机器翻译的基本假设是：双语句子的生成可以用短语之间的对应关系进行表示。图\ref{fig:7-9}展示了一个基于短语的翻译实例。可以看到，这里的翻译单元是连续的词串。比如，“进口”的译文“The imports have”就包含了三个单词，而“下降/了”也是一个包含两个单词的源语言片段。
+\parinterval 基于短语的机器翻译的基本假设是：双语句子的生成可以用短语之间的对应关系进行表示。图\ref{fig:7-6}展示了一个基于短语的翻译实例。可以看到，这里的翻译单元是连续的词串。比如，“进口”的译文“The imports have”就包含了三个单词，而“下降/了”也是一个包含两个单词的源语言片段。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter7/Figures/figure-example-of-zh2en-translation-base-phrase}
 \caption{基于短语的汉英翻译实例}
-\label{fig:7-9}
+\label{fig:7-6}
 \end{figure}
 %-------------------------------------------

@@ -204,7 +204,7 @@ p_4 &=& \text{任务}\nonumber

 \parinterval 基于短语的翻译推导定义了一种从源语言短语序列到目标语言短语序列的对应，其中源语言短语序列是源语言句子的一种切分，同样的，目标语言短语序列是目标语言句子的一种切分。翻译推导提供了一种描述翻译过程的手段：对于一个源语言句子，可以找到从它出发的翻译推导，推导中短语的目标语部分就构成了译文。也就是，每个源语言句子$\seq{s}$上的一个推导$d$都蕴含着一个目标语句子$\seq{t}$。

-\parinterval 图\ref{fig:7-10}给出了一个由三个双语短语$\{(\bar{s}_{\bar{a}_1},\bar{t}_1),(\bar{s}_{\bar{a}_2},\bar{t}_2),(\bar{s}_{\bar{a}_3},\bar{t}_3)\}$ 构成的汉英互译句对，其中短语对齐信息为$\bar{a}_1 = 1$，$\bar{a}_2 = 2$，$\bar{a}_3 = 3$。这里，可以把这三个短语对的组合看作是翻译推导，形式化表示为如下公式：
+\parinterval 图\ref{fig:7-7}给出了一个由三个双语短语$\{(\bar{s}_{\bar{a}_1},\bar{t}_1),(\bar{s}_{\bar{a}_2},\bar{t}_2),(\bar{s}_{\bar{a}_3},\bar{t}_3)\}$ 构成的汉英互译句对，其中短语对齐信息为$\bar{a}_1 = 1$，$\bar{a}_2 = 2$，$\bar{a}_3 = 3$。这里，可以把这三个短语对的组合看作是翻译推导，形式化表示为如下公式：

 \begin{eqnarray}
 d & = & {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)} \circ {(\bar{s}_{\bar{a}_3},\bar{t}_3)}
@@ -218,7 +218,7 @@ d & = & {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)
 \centering
 \input{./Chapter7/Figures/figure-derivation-consist-of-bilingual-phrase}
 \caption{三个双语短语$\{(\bar{s}_{\bar{a}_1},\bar{t}_1),(\bar{s}_{\bar{a}_2},\bar{t}_2),(\bar{s}_{\bar{a}_3},\bar{t}_3)\}$构成的翻译推导}
-\label{fig:7-10}
+\label{fig:7-7}
 \end{figure}
 %-------------------------------------------

@@ -366,14 +366,14 @@ d & = & {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)

 在基于短语的翻译模型中，通常包含三类特征：短语翻译特征、调序特征、语言模型相关的特征。这些特征都需要从训练数据中学习。

-\parinterval 图\ref{fig:7-11}展示了一个基于短语的机器翻译模型的搭建流程。其中的训练数据包括双语平行语料和目标语言单语语料。首先，需要从双语平行数据中学习短语的翻译，并形成一个短语翻译表；然后，再从双语平行数据中学习调序模型；最后，从目标语单语数据中学习语言模型。短语翻译表、调序模型、语言模型都会作为特征被送入判别式模型，由解码器完成对新句子的翻译。而这些特征的权重可以在额外的开发集上进行调优。关于短语抽取、调序模型和翻译特征的学习，会在本章的\ref{section-7.3}-\ref{section-7.6}节进行介绍。
+\parinterval 图\ref{fig:7-8}展示了一个基于短语的机器翻译模型的搭建流程。其中的训练数据包括双语平行语料和目标语言单语语料。首先，需要从双语平行数据中学习短语的翻译，并形成一个短语翻译表；然后，再从双语平行数据中学习调序模型；最后，从目标语单语数据中学习语言模型。短语翻译表、调序模型、语言模型都会作为特征被送入判别式模型，由解码器完成对新句子的翻译。而这些特征的权重可以在额外的开发集上进行调优。关于短语抽取、调序模型和翻译特征的学习，会在本章的\ref{section-7.3}-\ref{section-7.6}节进行介绍。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter7/Figures/figure-process-of-machine-translation-base-phrase}
 \caption{基于短语的机器翻译的系统流程}
-\label{fig:7-11}
+\label{fig:7-8}
 \end{figure}
 %-------------------------------------------

@@ -384,14 +384,14 @@ d & = & {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)
 \sectionnewpage
 \section{短语抽取}\label{section-7.3}

-\parinterval 在基于短语的模型中，学习短语翻译是重要的步骤之一。获得短语翻译的方法有很多种，最常用的方法是从双语平行语料中进行{\small\bfnew{短语抽取}}\index{短语抽取}（Phrase Extraction）\index{Phrase Extraction}。前面已经介绍过短语的概念，句子中任意的连续子串都被称为短语。例如在图\ref{fig:7-12}中，用点阵的形式来表示双语之间的对应关系，那么图中任意一个矩形框都可以构成一个双语短语（或短语对），例如“什么/都/没”对应“learned nothing ？”。
+\parinterval 在基于短语的模型中，学习短语翻译是重要的步骤之一。获得短语翻译的方法有很多种，最常用的方法是从双语平行语料中进行{\small\bfnew{短语抽取}}\index{短语抽取}（Phrase Extraction）\index{Phrase Extraction}。前面已经介绍过短语的概念，句子中任意的连续子串都被称为短语。例如在图\ref{fig:7-9}中，用点阵的形式来表示双语之间的对应关系，那么图中任意一个矩形框都可以构成一个双语短语（或短语对），例如“什么/都/没”对应“learned nothing ？”。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter7/Figures/figure-unlimited-phrase-extraction}
 \caption{无限制的短语抽取}
-\label{fig:7-12}
+\label{fig:7-9}
 \end{figure}
 %-------------------------------------------

@@ -403,7 +403,7 @@ d & = & {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)

 \subsection{与词对齐一致的短语}

-\parinterval 图\ref{fig:7-13}中大蓝色方块代表词对齐。通过词对齐信息，可以很容易地获得双语短语“天气 $\leftrightarrow$ The weather”。这里称其为与词对齐一致（兼容）的双语短语。具体定义如下：
+\parinterval 图\ref{fig:7-10}中大蓝色方块代表词对齐。通过词对齐信息，可以很容易地获得双语短语“天气 $\leftrightarrow$ The weather”。这里称其为与词对齐一致（兼容）的双语短语。具体定义如下：

 %-------------------------------------------
 \vspace{0.5em}
@@ -420,29 +420,29 @@ d & = & {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)
 \centering
 \input{./Chapter7/Figures/figure-phrase-extraction-consistent-with-word-alignment}
 \caption{与词对齐一致的短语抽取}
-\label{fig:7-13}
+\label{fig:7-10}
 \end{figure}
 %-------------------------------------------

-\parinterval 如图\ref{fig:7-14}所示，左边的例子中的$t_1$和$t_2$严格地对应到$s_1$、$s_2$、$s_3$，所以短语是与词对齐相一致的；中间例子中的$t_2$对应到短语$s_1$和$s_2$的外面，所以短语是与词对齐不一致的；类似的，右边的例子中短语与词对齐也是相一致的。
+\parinterval 如图\ref{fig:7-11}所示，左边的例子中的$t_1$和$t_2$严格地对应到$s_1$、$s_2$、$s_3$，所以短语是与词对齐相一致的；中间例子中的$t_2$对应到短语$s_1$和$s_2$的外面，所以短语是与词对齐不一致的；类似的，右边的例子中短语与词对齐也是相一致的。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter7/Figures/figure-consistence-of-word-alignment}
 \caption{词对齐一致性示例}
-\label{fig:7-14}
+\label{fig:7-11}
 \end{figure}
 %-------------------------------------------

-\parinterval 图\ref{fig:7-15}展示了与词对齐一致的短语抽取过程，首先判断抽取得到的双语短语是否与词对齐保持一致，若一致，则抽取出来。在实际抽取过程中，通常需要对短语的最大长度进行限制，以免抽取过多的无用短语。比如，在实际系统中，最大短语长度一般是5-7个词。
+\parinterval 图\ref{fig:7-12}展示了与词对齐一致的短语抽取过程，首先判断抽取得到的双语短语是否与词对齐保持一致，若一致，则抽取出来。在实际抽取过程中，通常需要对短语的最大长度进行限制，以免抽取过多的无用短语。比如，在实际系统中，最大短语长度一般是5-7个词。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter7/Figures/figure-phrase-extraction-consistent-with-word-alignment-1}
 \caption{与词对齐一致的短语抽取}
-\label{fig:7-15}
+\label{fig:7-12}
 \end{figure}
 %-------------------------------------------

@@ -454,14 +454,14 @@ d & = & {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)

 \parinterval 如何获得词对齐呢？{\chapterfive}和{\chaptersix}介绍的IBM模型本身就是一个词对齐模型，因此一种常用的方法是直接使用IBM模型生成词对齐。IBM模型约定每个源语言单词必须对应、也只能对应到一个目标语单词。因此，IBM 模型得到的词对齐结果是不对称的。正常情况下词对齐可以是一个源语言单词对应多个目标语言单词，或者多对一，甚至多对多的情况。为了获得对称的词对齐，一种简单的方法是，分别进行正向翻译和反向翻译的词对齐，然后利用启发性方法生成对称的词对齐，例如，双向词对齐取交集、并集等。

-\parinterval 如图\ref{fig:7-16}中，左边两个图就是正向和反向两种词对齐的结果。右边的图是融合双向词对齐的结果，取交集是蓝色的方框，取并集是红色的方框。当然，还可以设计更多的启发性规则生成词对齐\upcite{koehn2000estimating}。
+\parinterval 如图\ref{fig:7-13}中，左边两个图就是正向和反向两种词对齐的结果。右边的图是融合双向词对齐的结果，取交集是蓝色的方框，取并集是红色的方框。当然，还可以设计更多的启发性规则生成词对齐\upcite{koehn2000estimating}。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter7/Figures/figure-get-word-alignment}
 \caption{词对齐的获取}
-\label{fig:7-16}
+\label{fig:7-13}
 \end{figure}
 %-------------------------------------------

@@ -489,25 +489,25 @@ d & = & {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)

 \parinterval 它表达的意思是短语$\bar{s}$和$\bar{t}$存在词汇级的对应关系，其中$a(j,i)=1$表示双语句对$(\seq{s},\seq{t})$中单词$s_j$和单词$t_i$对齐，$\sigma $表示词汇翻译概率用来度量两个单词之间翻译的可能性大小（见{\chapterfive}），作为两个词之间对应的强度。

-\parinterval 下面来看一个具体的例子，如图\ref{fig:7-17}所示。对于一个双语短语，将它们的词对齐关系代入到公式\eqref{eq:7-14}就会得到短语的词汇翻译概率。对于词汇翻译概率，可以使用IBM 模型中的单词翻译表，也可以通过统计获得\upcite{koehn2002learning}。如果一个单词的词对齐为空，则用$N$表示它翻译为空的概率。和短语翻译概率一样，可以使用双向的词汇化翻译概率来评价双语短语的好坏。
-
 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter7/Figures/figure-example-of-vocabulary-translation-probability}
 \caption{词汇翻译概率实例}
-\label{fig:7-17}
+\label{fig:7-14}
 \end{figure}
 %-------------------------------------------

-\parinterval 经过上面的介绍，可以从双语平行语料中把双语短语抽取出来，同时得到相应的翻译概率（即特征），组成{\small\bfnew{短语表}}\index{短语表}（Phrase Table）\index{Phrase Table}。图\ref{fig:7-18}展示了一个真实短语表的片段。其中包括源语言短语和目标语言短语，用|||进行分割。每个双语对应的得分，包括正向和反向的词汇翻译概率以及短语翻译概率，还包括词对齐信息（0-0、1-1）等其他信息。
+\parinterval 来看一个具体的例子，如图\ref{fig:7-14}所示。对于一个双语短语，将它们的词对齐关系代入到公式\eqref{eq:7-14}就会得到短语的词汇翻译概率。对于词汇翻译概率，可以使用IBM 模型中的单词翻译表，也可以通过统计获得\upcite{koehn2002learning}。如果一个单词的词对齐为空，则用$N$表示它翻译为空的概率。和短语翻译概率一样，可以使用双向的词汇化翻译概率来评价双语短语的好坏。
+
+\parinterval 经过上面的介绍，可以从双语平行语料中把双语短语抽取出来，同时得到相应的翻译概率（即特征），组成{\small\bfnew{短语表}}\index{短语表}（Phrase Table）\index{Phrase Table}。图\ref{fig:7-15}展示了一个真实短语表的片段。其中包括源语言短语和目标语言短语，用|||进行分割。每个双语对应的得分，包括正向和反向的词汇翻译概率以及短语翻译概率，还包括词对齐信息（0-0、1-1）等其他信息。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter7/Figures/figure-example-of-phrase-table}
 \caption{短语表实例}
-\label{fig:7-18}
+\label{fig:7-15}
 \end{figure}
 %-------------------------------------------

@@ -519,14 +519,14 @@ d & = & {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)

 \parinterval 尽管已经知道了如何将一个源语言短语翻译成目标语言短语，但是想要获得一个高质量的译文，仅有互译的双语短语是远远不够的。

-\parinterval 如图\ref{fig:7-19}所示，按照从左到右的顺序对一个句子“在/桌子/上/的/苹果”进行翻译，得到的译文“on the table the apple”的语序是不对的。虽然可以使用$n$-gram语言模型对语序进行建模，但是此处仍然需要用更加准确的方式描述目标语短语间的次序。一般，把这个问题称为短语调序，或者简称{\small\bfnew{调序}}\index{调序}（Reordering）\index{Reordering}。通常，基于短语的调序模型会作为判别式模型的特征参与到翻译过程中来。接下来，会介绍3 种不同的调序方法，分别是基于距离的调序、基于方向的调序（MSD模型）以及基于分类的调序。
+\parinterval 如图\ref{fig:7-16}所示，按照从左到右的顺序对一个句子“在/桌子/上/的/苹果”进行翻译，得到的译文“on the table the apple”的语序是不对的。虽然可以使用$n$-gram语言模型对语序进行建模，但是此处仍然需要用更加准确的方式描述目标语短语间的次序。一般，把这个问题称为短语调序，或者简称{\small\bfnew{调序}}\index{调序}（Reordering）\index{Reordering}。通常，基于短语的调序模型会作为判别式模型的特征参与到翻译过程中来。接下来，会介绍3 种不同的调序方法，分别是基于距离的调序、基于方向的调序（MSD模型）以及基于分类的调序。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter7/Figures/figure-reorder-base-phrase-translation}
 \caption{基于短语翻译的调序}
-\label{fig:7-19}
+\label{fig:7-16}
 \end{figure}
 %-------------------------------------------

@@ -546,14 +546,14 @@ dr & = & {\rm{start}}_i-{\rm{end}}_{i-1}-1
 \label{eq:7-15}
 \end{eqnarray}

-\parinterval 在图\ref{fig:7-20}的例子中，“the apple”所对应的调序距离为4，“on the table”所对应的调序距离为$-5$。显然，如果两个源语短语按顺序翻译，则$\rm{start}_i = \rm{end}_{i-1} + 1$，这时调序距离为0。
+\parinterval 在图\ref{fig:7-17}的例子中，“the apple”所对应的调序距离为4，“on the table”所对应的调序距离为$-5$。显然，如果两个源语短语按顺序翻译，则$\rm{start}_i = \rm{end}_{i-1} + 1$，这时调序距离为0。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter7/Figures/figure-reorder-base-distance}
 \caption{基于距离的调序}
-\label{fig:7-20}
+\label{fig:7-17}
 \end{figure}
 %-------------------------------------------

@@ -567,14 +567,14 @@ dr & = & {\rm{start}}_i-{\rm{end}}_{i-1}-1

 \parinterval 基于方向的调序模型是另一种常用的调序模型。该模型是一种典型的词汇化调序模型，因此调序的结果会根据不同短语有所不同。简单来说，在两个短语目标语言端连续的情况下，该模型会判断两个双语短语在源语言端的调序情况，包含三种调序类型：顺序的单调翻译（M）、与前一个短语交换位置（S）、非连续翻译（D）。因此，这个模型也被称作MSD调序模型，也是Moses等经典的机器翻译系统所采用的调序模型\upcite{Koehn2007Moses}。

-\parinterval 图\ref{fig:7-21}展示了这三种调序类型，当两个短语对在源语言和目标语言中都是按顺序排列时，它们就是单调的（如：从左边数前两个短语）；如果对应的短语顺序在目标语中是反过来的，属于交换调序（如：从左边数第三和第四个短语）；如果两个短语之间还有其他的短语，就是非连续调序（如：从右边数的前两个短语）。
+\parinterval 图\ref{fig:7-18}展示了这三种调序类型，当两个短语对在源语言和目标语言中都是按顺序排列时，它们就是单调的（如：从左边数前两个短语）；如果对应的短语顺序在目标语中是反过来的，属于交换调序（如：从左边数第三和第四个短语）；如果两个短语之间还有其他的短语，就是非连续调序（如：从右边数的前两个短语）。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter7/Figures/figure-three-types-of-reorder-method-in-msd}
 \caption{词汇化调序模型的三种调序类型}
-\label{fig:7-21}
+\label{fig:7-18}
 \end{figure}
 %-------------------------------------------

@@ -586,14 +586,14 @@ dr & = & {\rm{start}}_i-{\rm{end}}_{i-1}-1

 \noindent 其中，$o_i$表示（目标语言）第$i$个短语的调序方向，$\seq{o}=\{o_i\}$表示短语序列的调序方向，$K$表示短语的数量。短语之间的调序概率是由双语短语以及短语对齐决定的，$o$表示调序的种类，可以取M、S、D 中的任意一种。而整个句子调序的好坏就是把相邻的短语之间的调序概率相乘（对应取log后的加法）。这样，公式\eqref{eq:7-16}把调序的好坏定义为新的特征，对于M、S、D总共就有三个特征。除了当前短语和前一个短语的调序特征，还可以定义当前短语和后一个短语的调序特征，即将上述公式中的$a_{i-1}$换成$a_{i+1}$。 于是，又可以得到三个特征。因此在MSD调序中总共可以有6个特征。

-\parinterval 具体实现时，通常使用词对齐对两个短语间的调序关系进行判断。图\ref{fig:7-22}展示了这个过程。先判断短语的左上角和右上角是否存在词对齐，再根据其位置对调序类型进行划分。每个短语对应的调序概率都可以用相对频次估计进行计算。而MSD调序模型也相当于在短语表中的每个双语短语后添加6个特征。不过，调序模型一般并不会和短语表一起存储，因此在系统中通常会看到两个独立的模型文件，分别保存短语表和调序模型。
+\parinterval 具体实现时，通常使用词对齐对两个短语间的调序关系进行判断。图\ref{fig:7-19}展示了这个过程。先判断短语的左上角和右上角是否存在词对齐，再根据其位置对调序类型进行划分。每个短语对应的调序概率都可以用相对频次估计进行计算。而MSD调序模型也相当于在短语表中的每个双语短语后添加6个特征。不过，调序模型一般并不会和短语表一起存储，因此在系统中通常会看到两个独立的模型文件，分别保存短语表和调序模型。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter7/Figures/figure-judge-type-of-reorder-method}
 \caption{调序类型的判断}
-\label{fig:7-22}
+\label{fig:7-19}
 \end{figure}
 %-------------------------------------------

@@ -628,7 +628,6 @@ dr & = & {\rm{start}}_i-{\rm{end}}_{i-1}-1
 \item 短语翻译概率（取对数），包含正向翻译概率$\textrm{log}(\funp{P}(\bar{t}|\bar{s}))$和反向翻译概率$\textrm{log}(\funp{P}(\bar{s}$\\$|\bar{t}))$，它们是基于短语的模型中最主要的特征。
 \vspace{0.5em}
 \item 词汇化翻译概率（取对数），同样包含正向词汇化翻译概率$\textrm{log}(\funp{P}_{\textrm{lex}}(\bar{t}|\bar{s}))$和反向词汇化翻译概率$\textrm{log}(\funp{P}_{\textrm{lex}}(\bar{s}|\bar{t}))$，它们用来描述双语短语中单词间对应的好坏。
-\vspace{0.5em}
 \item $n$-gram语言模型，用来度量译文的流畅程度，可以通过大规模目标端单语数据得到。
 \vspace{0.5em}
 \item 译文长度，避免模型倾向于短译文，同时让系统自动学习对译文长度的偏好。
@@ -677,29 +676,29 @@ dr & = & {\rm{start}}_i-{\rm{end}}_{i-1}-1

 \parinterval 需要注意的是， BLEU本身是一个不可微分函数。因此，无法使用梯度下降等方法对式\eqref{eq:7-19}进行求解。那么如何能快速得到最优解？这里会使用一种特殊的优化方法，称作{\small\bfnew{线搜索}}\index{线搜索}（Line Search）\index{Line Search}，它是Powell搜索的一种形式\upcite{powell1964an}。这种方法也构成了最小错误率训练的核心。

-\parinterval 首先，重新看一下特征权重的搜索空间。按照前面的介绍，如果要进行暴力搜索，需要把特征权重的取值按小的间隔进行划分。这样，所有特征权重的取值可以用图\ref{fig:7-23}的网格来表示。
+\parinterval 首先，重新看一下特征权重的搜索空间。按照前面的介绍，如果要进行暴力搜索，需要把特征权重的取值按小的间隔进行划分。这样，所有特征权重的取值可以用图\ref{fig:7-20}的网格来表示。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter7/Figures/figure-search-space-representation-of-feature-weight}
 \caption{特征权重的搜索空间表示}
-\label{fig:7-23}
+\label{fig:7-20}
 \end{figure}
 %-------------------------------------------

-\parinterval 其中横坐标为所有的$M$个特征函数，纵坐标为权重可能的取值。假设每个特征都有$V$种取值，那么遍历所有特征权重取值的组合有$M^V$种。每组$\lambda = \{\lambda_i\}$的取值实际上就是一个贯穿所有特征权重的折线，如图\ref{fig:7-23}中间蓝线所展示的路径。当然，可以通过枚举得到很多这样的折线（图\ref{fig:7-23}右）。假设计算BLEU的时间开销为$B$，那么遍历所有的路径的时间复杂度为$O(M^V \cdot B)$，由于$V$可能很大，而且$B$往往也无法忽略，因此这种计算方式的时间成本是极高的。如果考虑对每一组特征权重都需要重新解码得到$n$-best译文，那么基于这种简单枚举的方法是无法使用的。
+\parinterval 其中横坐标为所有的$M$个特征函数，纵坐标为权重可能的取值。假设每个特征都有$V$种取值，那么遍历所有特征权重取值的组合有$M^V$种。每组$\lambda = \{\lambda_i\}$的取值实际上就是一个贯穿所有特征权重的折线，如图\ref{fig:7-20}中间蓝线所展示的路径。当然，可以通过枚举得到很多这样的折线（图\ref{fig:7-20}右）。假设计算BLEU的时间开销为$B$，那么遍历所有的路径的时间复杂度为$O(M^V \cdot B)$，由于$V$可能很大，而且$B$往往也无法忽略，因此这种计算方式的时间成本是极高的。如果考虑对每一组特征权重都需要重新解码得到$n$-best译文，那么基于这种简单枚举的方法是无法使用的。

 \parinterval 对全搜索的一种改进是使用局部搜索。循环处理每个特征，每一次只调整一个特征权重的值，找到使BLEU达到最大的权重。反复执行该过程，直到模型达到稳定状态（例如BLEU不再降低）。

-\parinterval 图\ref{fig:7-24}左侧展示了这种方法。其中蓝色部分为固定的权重，相应的虚线部分为当前权重所有可能的取值，这样搜索一个特征权重的时间复杂度为$O(V \cdot B)$。而整个算法的时间复杂度为$O(L \cdot V \cdot B)$，其中$L$为循环访问特征的总次数。这种方法也被称作{\small\bfnew{格搜索}}\index{格搜索}（Grid Search）\index{Grid Search}。
+\parinterval 图\ref{fig:7-21}左侧展示了这种方法。其中蓝色部分为固定的权重，相应的虚线部分为当前权重所有可能的取值，这样搜索一个特征权重的时间复杂度为$O(V \cdot B)$。而整个算法的时间复杂度为$O(L \cdot V \cdot B)$，其中$L$为循环访问特征的总次数。这种方法也被称作{\small\bfnew{格搜索}}\index{格搜索}（Grid Search）\index{Grid Search}。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter7/Figures/figure-grid-search}
 \caption{格搜索（左侧：所有点都访问（蓝色）；右侧：避开无效点（绿色））}
-\label{fig:7-24}
+\label{fig:7-21}
 \end{figure}
 %-------------------------------------------

@@ -715,14 +714,14 @@ dr & = & {\rm{start}}_i-{\rm{end}}_{i-1}-1
 \label{eq:7-20}
 \end{eqnarray}

-\parinterval 这里，$a = h_i(d)$是直线的斜率，$b = \sum_{k \neq i}^{M} \lambda_k \cdot h_k (d)$是截距。有了关于权重$\lambda_i$的直线表示，可以将$d_1$和$d_2$分别画成两条直线，如图\ref{fig:7-25}所示。在两条直线交叉点的左侧，$d_2$是最优的翻译结果；在交叉点右侧，$d_1$是最优的翻译结果。也就是说，只需知道交叉点左侧和右侧谁的BLEU 值高，$\lambda_i$的最优值就应该落在相应的范围，比如，这个例子中交叉点右侧（即$d_2$）所对应的BLEU值更高，因此最优特征权重$\hat{\lambda}_i$应该在交叉点右侧（$\lambda_x \sim \lambda_i$任意取值都可以）。
+\parinterval 这里，$a = h_i(d)$是直线的斜率，$b = \sum_{k \neq i}^{M} \lambda_k \cdot h_k (d)$是截距。有了关于权重$\lambda_i$的直线表示，可以将$d_1$和$d_2$分别画成两条直线，如图\ref{fig:7-22}所示。在两条直线交叉点的左侧，$d_2$是最优的翻译结果；在交叉点右侧，$d_1$是最优的翻译结果。也就是说，只需知道交叉点左侧和右侧谁的BLEU 值高，$\lambda_i$的最优值就应该落在相应的范围，比如，这个例子中交叉点右侧（即$d_2$）所对应的BLEU值更高，因此最优特征权重$\hat{\lambda}_i$应该在交叉点右侧（$\lambda_x \sim \lambda_i$任意取值都可以）。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter7/Figures/figure-function-image-about-weight-and-Bleu}
 \caption{推导得分关于权重的函数（左）以及对应的BLEU值变化（右）}
-\label{fig:7-25}
+\label{fig:7-22}
 \end{figure}
 %-------------------------------------------

@@ -761,7 +760,7 @@ dr & = & {\rm{start}}_i-{\rm{end}}_{i-1}-1

 \parinterval 然而想要找到得分最高的翻译推导并不是一件简单的事情。对于每一句源语言句子，可能的翻译结果是指数级的。由于机器翻译解码是一个NP完全问题\upcite{knight1999decoding}，简单的暴力搜索显然不现实。因此，在机器翻译中会使用特殊的解码策略来确保搜索的效率。本节将介绍基于栈的自左向右解码方法。它是基于短语的模型中的经典解码方法，非常适于处理语言生成的各种任务。

-\parinterval 首先，看一下翻译一个句子的基本流程。如图\ref{fig:7-26}所示，首先需要得到译文句子的第一个单词。在基于短语的模型中，可以从源语言端找出生成句首译文的短语，之后把译文放到目标语言端，例如，源语言的“有”对应的译文是“There is”。这个过程可以重复执行，直到生成完整句子的译文。但是，有两点需要注意：
+\parinterval 首先，看一下翻译一个句子的基本流程。如图\ref{fig:7-23}所示，首先需要得到译文句子的第一个单词。在基于短语的模型中，可以从源语言端找出生成句首译文的短语，之后把译文放到目标语言端，例如，源语言的“有”对应的译文是“There is”。这个过程可以重复执行，直到生成完整句子的译文。但是，有两点需要注意：

 \begin{itemize}
 \vspace{0.5em}
@@ -776,7 +775,7 @@ dr & = & {\rm{start}}_i-{\rm{end}}_{i-1}-1
 \centering
 \input{./Chapter7/Figures/figure-basic-process-of-translation}
 \caption{按目标语言短语自左向右生成的翻译实例}
-\label{fig:7-26}
+\label{fig:7-23}
 \end{figure}
 %-------------------------------------------

@@ -788,14 +787,14 @@ dr & = & {\rm{start}}_i-{\rm{end}}_{i-1}-1

 \subsection{翻译候选匹配}

-\parinterval 在解码时，首先要知道每个源语言短语可能的译文都是什么。对于一个源语言短语，每个可能的译文也被称作翻译候选。实现翻译候选的匹配很简单。只需要遍历输入的源语言句子中所有可能的短语，之后在短语表中找到相应的翻译即可。比如，图\ref{fig:7-27}展示了句子“桌子/上/有/一个/苹果”的翻译候选匹配结果。可以看到，不同的短语会对应若干翻译候选。这些翻译候选会保存在所对应的范围（被称为跨度）中。这里，跨度$[a,b]$表示从第$a+1$个词开始到第$b$个词为止所表示的词串。比如，“upon the table” 是短语“桌子/上/有”的翻译候选，即对应源语言跨度[0,3]。
+\parinterval 在解码时，首先要知道每个源语言短语可能的译文都是什么。对于一个源语言短语，每个可能的译文也被称作翻译候选。实现翻译候选的匹配很简单。只需要遍历输入的源语言句子中所有可能的短语，之后在短语表中找到相应的翻译即可。比如，图\ref{fig:7-24}展示了句子“桌子/上/有/一个/苹果”的翻译候选匹配结果。可以看到，不同的短语会对应若干翻译候选。这些翻译候选会保存在所对应的范围（被称为跨度）中。这里，跨度$[a,b]$表示从第$a+1$个词开始到第$b$个词为止所表示的词串。比如，“upon the table” 是短语“桌子/上/有”的翻译候选，即对应源语言跨度[0,3]。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter7/Figures/figure-translation-option}
 \caption{一个句子匹配的短语翻译候选}
-\label{fig:7-27}
+\label{fig:7-24}
 \end{figure}
 %-------------------------------------------

@@ -807,14 +806,14 @@ dr & = & {\rm{start}}_i-{\rm{end}}_{i-1}-1

 \parinterval 接下来，需要使用这些翻译候选生成完整的译文。在机器翻译中，一个很重要的概念是{\small\bfnew{翻译假设}}\index{翻译假设}（Translation Hypothesis）\index{Translation Hypothesis}。 它可以被当作是一个局部译文所对应的短语翻译推导。在解码开始时，只有一个空假设，也就是任何译文单词都没有被生成出来。接着，可以挑选翻译选项来扩展当前的翻译假设。

-\parinterval 图\ref{fig:7-28}展示了翻译假设扩展的过程。在翻译假设扩展时，需要保证新加入的翻译候选放置在旧翻译假设译文的右侧，也就是要确保翻译自左向右的连续性。而且，同一个翻译假设可以使用不同的翻译候选进行扩展。例如，扩展第一个翻译假设时，可以选择“桌子”的翻译候选“table”；也可以选择“有”的翻译候选“There is”。扩展完之后需要记录输入句子中已翻译的短语，同时计算当前所有翻译假设的模型得分。这个过程相当于生成了一个图的结构，每个节点代表了一个翻译假设。当翻译假设覆盖了输入句子所有的短语，不能被继续扩展时，就生成了一个完整的翻译假设（译文）。最后需要找到得分最高的完整翻译假设，它对应了搜索图中的最优路径。
+\parinterval 图\ref{fig:7-25}展示了翻译假设扩展的过程。在翻译假设扩展时，需要保证新加入的翻译候选放置在旧翻译假设译文的右侧，也就是要确保翻译自左向右的连续性。而且，同一个翻译假设可以使用不同的翻译候选进行扩展。例如，扩展第一个翻译假设时，可以选择“桌子”的翻译候选“table”；也可以选择“有”的翻译候选“There is”。扩展完之后需要记录输入句子中已翻译的短语，同时计算当前所有翻译假设的模型得分。这个过程相当于生成了一个图的结构，每个节点代表了一个翻译假设。当翻译假设覆盖了输入句子所有的短语，不能被继续扩展时，就生成了一个完整的翻译假设（译文）。最后需要找到得分最高的完整翻译假设，它对应了搜索图中的最优路径。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter7/Figures/figure-translation-hypothesis-extension}
 \caption{翻译假设扩展}
-\label{fig:7-28}
+\label{fig:7-25}
 \end{figure}
 %-------------------------------------------

@@ -836,18 +835,18 @@ dr & = & {\rm{start}}_i-{\rm{end}}_{i-1}-1
 \vspace{0.5em}
 \end{itemize}

-\parinterval 对翻译假设进行重新组合又被称作{\small\bfnew{假设重组}}\index{假设重组}（Hypothesis Recombination）\index{Hypothesis Recombination}。其核心思想是，把代表同一个译文的不同翻译假设融合为一个翻译假设。如图\ref{fig:7-29}所示，对于给定的输入短语“一个\ \ 苹果”，系统可能将两个单词“一个”、“苹果”分别翻译成“an”和“apple”，也可能将这两个单词作为一个短语直接翻译成“an apple”。虽然这两个翻译假设得到的译文相同，并且覆盖了相同的源语言短语，但是却是两个不同的翻译假设，模型给它们的打分也是不一样的。这时，可以舍弃两个翻译假设中分数较低的那个，因为分数较低的翻译假设永远不可能成为最优路径的一部分。这也就相当于把两个翻译假设重组为一个假设。
+\parinterval 对翻译假设进行重新组合又被称作{\small\bfnew{假设重组}}\index{假设重组}（Hypothesis Recombination）\index{Hypothesis Recombination}。其核心思想是，把代表同一个译文的不同翻译假设融合为一个翻译假设。如图\ref{fig:7-26}所示，对于给定的输入短语“一个\ \ 苹果”，系统可能将两个单词“一个”、“苹果”分别翻译成“an”和“apple”，也可能将这两个单词作为一个短语直接翻译成“an apple”。虽然这两个翻译假设得到的译文相同，并且覆盖了相同的源语言短语，但是却是两个不同的翻译假设，模型给它们的打分也是不一样的。这时，可以舍弃两个翻译假设中分数较低的那个，因为分数较低的翻译假设永远不可能成为最优路径的一部分。这也就相当于把两个翻译假设重组为一个假设。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter7/Figures/figure-example-of-hypothesis-recombination}
 \caption{假设重组示例}
-\label{fig:7-29}
+\label{fig:7-26}
 \end{figure}
 %-------------------------------------------

-\parinterval 即使翻译假设对应的译文不同也可以进行假设重组。图\ref{fig:7-29}的下半部分给出了一个这样的实例。在两个翻译假设中，第一个单词分别被翻译成了“it”和“he”，紧接着它们后面的部分都被翻译成了“is not”。这两个翻译假设是非常相似的，因为它们译文的最后两个单词是相同的，而且翻译假设都覆盖了相同的源语言部分。这时，也可以对这两个翻译假设进行假设重组：如果得分较低的翻译假设和得分较高的翻译假设都使用相同的翻译候选进行扩展，且两个翻译假设都覆盖相同的源语言单词，分数低的翻译假设可以被剪枝掉。此外，还有两点需要注意：
+\parinterval 即使翻译假设对应的译文不同也可以进行假设重组。图\ref{fig:7-26}的下半部分给出了一个这样的实例。在两个翻译假设中，第一个单词分别被翻译成了“it”和“he”，紧接着它们后面的部分都被翻译成了“is not”。这两个翻译假设是非常相似的，因为它们译文的最后两个单词是相同的，而且翻译假设都覆盖了相同的源语言部分。这时，也可以对这两个翻译假设进行假设重组：如果得分较低的翻译假设和得分较高的翻译假设都使用相同的翻译候选进行扩展，且两个翻译假设都覆盖相同的源语言单词，分数低的翻译假设可以被剪枝掉。此外，还有两点需要注意：

 \begin{itemize}
 \vspace{0.5em}
@@ -883,14 +882,14 @@ dr & = & {\rm{start}}_i-{\rm{end}}_{i-1}-1

 \parinterval 比如，第一个堆栈包含了覆盖一个源语言单词的翻译假设，第二个堆栈包含了覆盖两个源语言单词的翻译假设，以此类推。利用覆盖源语言单词数进行栈的划分的原因在于：翻译相同数量的单词所对应的翻译假设一般是“可比的”，因此在同一个栈里对它们进行剪枝带来的风险较小。

-\parinterval 在基于栈的解码中，每次都会从所有的栈中弹出一个翻译假设，并选择一个或者若干个翻译假设进行扩展，之后把新得到的翻译假设重新压入解码栈中。这个过程不断执行，并可以配合束剪枝、假设重组等技术。最后在覆盖所有源语言单词的栈中得到整个句子的译文。图\ref{fig:7-30}展示了一个简单的栈解码过程。第一个栈（0号栈）用来存放空翻译假设。之后通过假设扩展，不断将翻译假设填入对应的栈中。
+\parinterval 在基于栈的解码中，每次都会从所有的栈中弹出一个翻译假设，并选择一个或者若干个翻译假设进行扩展，之后把新得到的翻译假设重新压入解码栈中。这个过程不断执行，并可以配合束剪枝、假设重组等技术。最后在覆盖所有源语言单词的栈中得到整个句子的译文。图\ref{fig:7-27}展示了一个简单的栈解码过程。第一个栈（0号栈）用来存放空翻译假设。之后通过假设扩展，不断将翻译假设填入对应的栈中。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter7/Figures/figure-example-of-stack-decode}
 \caption{栈解码示例}
-\label{fig:7-30}
+\label{fig:7-27}
 \end{figure}
 %-------------------------------------------


--- a/Chapter8/Figures/figure-combination-of-translation-with-different-rules.tex
+++ b/Chapter8/Figures/figure-combination-of-translation-with-different-rules.tex
@@ -5,7 +5,7 @@
 \begin{scope}%[scale=0.2]

 \node[anchor=north] (q1) at (0,0) {\scriptsize\sffamily\bfseries{输入字符串：}};
-\node[anchor=west] (q2) at ([xshift=0em,yshift=-2em]q1.west) {\footnotesize{进口$\quad$和$\quad$出口$\quad$大幅度$\quad$下降$\quad$了}};
+\node[anchor=west] (q2) at ([xshift=0em,yshift=-1.5em]q1.west) {\footnotesize{进口$\quad$和$\quad$出口$\quad$大幅度$\quad$下降$\quad$了}};


 \node[anchor=north,fill=blue!20,minimum height=4em,minimum width=1em] (f1) at ([xshift=2.7em,yshift=-0.7em]q2.south) {};

--- a/Chapter8/Figures/figure-hierarchical-phrase-rule-match-generate.tex
+++ b/Chapter8/Figures/figure-hierarchical-phrase-rule-match-generate.tex
@@ -5,7 +5,7 @@
 \begin{scope}%[scale=0.2]

 \node[anchor=north] (q1) at (0,0) {\scriptsize\sffamily\bfseries{输入字符串：}};
-\node[anchor=west] (q2) at ([xshift=0em,yshift=-2em]q1.west) {\footnotesize{进口$\quad$和$\quad$出口$\quad$大幅度$\quad$下降$\quad$了}};
+\node[anchor=west] (q2) at ([xshift=0em,yshift=-1.5em]q1.west) {\footnotesize{进口$\quad$和$\quad$出口$\quad$大幅度$\quad$下降$\quad$了}};

 \node[anchor=north,fill=blue!20,minimum height=1em,minimum width=1em] (f1) at ([xshift=-4.2em,yshift=-0.8em]q2.south) {};


--- a/Chapter8/Figures/figure-one-best-node-alignment-and-alignment-matrix.tex
+++ b/Chapter8/Figures/figure-one-best-node-alignment-and-alignment-matrix.tex
@@ -107,7 +107,7 @@
 \vspace{-1em}
 \footnotesize{(a)节点对齐矩阵（1-best vs Matrix）}
 \end{center}
-
+\vspace{-3em}
 \begin{center}
 \begin{tabular}[t]{C{0.48\linewidth} C{0.48\linewidth} }


--- a/Chapter8/chapter8.tex
+++ b/Chapter8/chapter8.tex
@@ -301,9 +301,6 @@ d & = & {r_1} \circ {r_2} \circ {r_3} \circ {r_4}
 %----------------------------------------------------------------------------------------

 \subsubsection{4. 处理流程}
-
-\parinterval 层次短语系统的流程如图\ref{fig:8-6}所示。其核心是从双语数据中学习同步翻译文法，并进行翻译特征的学习，形成翻译模型（即规则+特征）。同时，要从目标语言数据中学习语言模型。最终，把翻译模型和语言模型一起送入解码器，在特征权重调优后，完成对新输入句子的翻译。
-
 %----------------------------------------------
 \begin{figure}[htp]
 \centering
@@ -313,6 +310,8 @@ d & = & {r_1} \circ {r_2} \circ {r_3} \circ {r_4}
 \end{figure}
 %-------------------------------------------

+\parinterval 层次短语系统的流程如图\ref{fig:8-6}所示。其核心是从双语数据中学习同步翻译文法，并进行翻译特征的学习，形成翻译模型（即规则+特征）。同时，要从目标语言数据中学习语言模型。最终，把翻译模型和语言模型一起送入解码器，在特征权重调优后，完成对新输入句子的翻译。
+
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
 %----------------------------------------------------------------------------------------
@@ -543,34 +542,38 @@ span\textrm{[0,4]}&=&\textrm{“猫} \quad \textrm{喜欢} \quad \textrm{吃} \q
 \subsection{立方剪枝}

 \parinterval 相比于基于短语的模型，基于层次短语的模型引入了“变量”的概念。这样，可以根据变量周围的上下文信息对变量进行调序。变量的内容由其所对应的跨度上的翻译假设进行填充。图\ref{fig:8-11}展示了一个层次短语规则匹配词串的实例。可以看到，规则匹配词串之后，变量$\seq{X}$的位置对应了一个跨度。这个跨度上所有标记为X的局部推导都可以作为变量的内容。
-
+\vspace{-0.5em}
 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter8/Figures/figure-hierarchical-phrase-rule-match-generate}
+\setlength{\abovecaptionskip}{-0.5em}
+\setlength{\belowcaptionskip}{-0.5em}
 \caption{层次短语规则匹配及译文生成}
 \label{fig:8-11}
 \end{figure}
 %-------------------------------------------

 \parinterval 真实的情况会更加复杂。对于一个规则的源语言端，可能会有多个不同的目标语言端与之对应。比如，如下规则的源语言端完全相同，但是译文不同：
+\vspace{-0.5em}
 \begin{eqnarray}
 \funp{X} & \to & \langle\ \funp{X}_1\ \text{大幅度}\ \text{下降}\ \text{了},\ \funp{X}_1\ \textrm{have}\ \textrm{drastically}\ \textrm{fallen}\ \rangle \nonumber \\
 \funp{X} & \to & \langle\ \funp{X}_1\ \text{大幅度}\ \text{下降}\ \text{了},\ \funp{X}_1\ \textrm{have}\ \textrm{fallen}\ \textrm{drastically}\ \rangle \nonumber \\
 \funp{X} & \to & \langle\ \funp{X}_1\ \text{大幅度}\ \text{下降}\ \text{了},\ \funp{X}_1\ \textrm{has}\ \textrm{drastically}\ \textrm{fallen}\ \rangle \nonumber
 \end{eqnarray}
-
-\parinterval 这也就是说，当匹配规则的源语言部分“$\funp{X}_1$\ \ 大幅度\ \ 下降\ \ 了”时会有三个译文可以选择。而变量$\funp{X}_1$部分又有很多不同的局部翻译结果。不同的规则译文和不同的变量译文都可以组合出一个局部翻译结果。图\ref{fig:8-12}展示了这种情况的实例。
-
+\vspace{-2em}
 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter8/Figures/figure-combination-of-translation-with-different-rules}
+\setlength{\abovecaptionskip}{-0.5em}
 \caption{不同规则目标语端及变量译文的组合}
 \label{fig:8-12}
 \end{figure}
 %-------------------------------------------

+\parinterval 这也就是说，当匹配规则的源语言部分“$\funp{X}_1$\ \ 大幅度\ \ 下降\ \ 了”时会有三个译文可以选择。而变量$\funp{X}_1$部分又有很多不同的局部翻译结果。不同的规则译文和不同的变量译文都可以组合出一个局部翻译结果。图\ref{fig:8-12}展示了这种情况的实例。
+
 \parinterval 假设有$n$个规则的源语言端相同，规则中每个变量可以被替换为$m$个结果，对于只含有一个变量的规则，一共有$nm$种不同的组合。如果规则含有两个变量，这种组合的数量是$n{m}^2$。由于翻译中会进行大量的规则匹配，如果每个匹配的源语言端都考虑所有$n{m}^2$种译文的组合，解码速度会很慢。

 \parinterval 在层次短语系统中，会进一步对搜索空间剪枝。简言之，此时并不需要对所有$n{m}^2$种组合进行遍历，而是只考虑其中的一部分组合。这种方法也被称作{\small\bfnew{立方剪枝}}\index{立方剪枝}（Cube Pruning）\index{Cube Pruning}。所谓“ 立方”是指组合译文时的三个维度：规则的目标语端、第一个变量所对应的翻译候选、第二个变量所对应的翻译候选。立方剪枝假设所有的译文候选都经过排序，比如，按照短语翻译概率排序。这样，每个译文都对应一个坐标，比如，$(i,j,k)$就表示第$i$个规则目标语端、第一个变量的第$j$个翻译候选、第二个变量的第$k$个翻译候选的组合。于是，可以把每种组合看作是一个三维空间中的一个点。在立方剪枝中，开始的时候会看到$(0,0,0)$这个翻译假设，并把这个翻译假设放入一个优先队列中。之后每次从这个优先队里中弹出最好的结果，之后沿着三个维度分别将坐标加1，比如，如果优先队列弹出$(i,j,k)$，则会生成$(i+1,j,k)$、$(i,j+1,k)$和$(i,j,k+1)$这三个新的翻译假设。之后，计算出它们的模型得分，并压入优先队列。这个过程不断被执行，直到达到终止条件，比如，扩展次数达到一个上限。
@@ -579,6 +582,7 @@ span\textrm{[0,4]}&=&\textrm{“猫} \quad \textrm{喜欢} \quad \textrm{吃} \q
 \begin{figure}[htp]
 \centering
 \input{./Chapter8/Figures/figure-execution-of-cube-pruning}
+\setlength{\abovecaptionskip}{-0.5em}
 \caption{立方剪枝执行过程（行表示规则，列表示变量可替换的内容）}
 \label{fig:8-13}
 \end{figure}
@@ -678,6 +682,16 @@ span\textrm{[0,4]}&=&\textrm{“猫} \quad \textrm{喜欢} \quad \textrm{吃} \q
 &此翻译可以被看作从句法树到句法树的转换 \\
 \rule{0pt}{15pt}基于句法 & 使用语言学句法 \\
 \rule{0pt}{15pt}基于树 &（源语言）使用树结构（大多指句法树） \\
+\end{tabular}
+}
+\end{center}
+}\end{table}
+\begin{table}[htp]{
+\begin{center}
+{
+\begin{tabular}{p{6.5em} | l}
+术语 & 说明 \\
+\hline
 \rule{0pt}{15pt}基于串 &（源语言）使用词串，比如串到树翻译系统的解码器一般\\
 &都是基于串的解码方法 \\
 \rule{0pt}{15pt}基于森林 &（源语言）使用句法森林，这里森林只是对多个句法树的一 \\
@@ -686,26 +700,11 @@ span\textrm{[0,4]}&=&\textrm{“猫} \quad \textrm{喜欢} \quad \textrm{吃} \q
 \rule{0pt}{15pt}非词汇规则 & 不含有终结符的规则 \\
 \rule{0pt}{15pt}句法软约束 & 不强制规则推导匹配语言学句法树，通常把句法信息作为特\\
 &征使用 \\
-\rule{0pt}{15pt}句法硬约束 & 要求推导必须符合语言学句法树，不符合的推导会被过滤掉
+\rule{0pt}{15pt}句法硬约束 & 要求推导必须符合语言学句法树，不符合的推导会被过滤掉 \\
 \end{tabular}
 }
 \end{center}
 }\end{table}
-%\begin{table}[htp]{
-%\begin{center}
-%{
-%\begin{tabular}{p{6.5em} | l}
-%术语 & 说明 \\
-%\hline
-%\rule{0pt}{15pt}词汇化规则 & 含有终结符的规则 \\
-%\rule{0pt}{15pt}非词汇规则 & 不含有终结符的规则 \\
-%\rule{0pt}{15pt}句法软约束 & 不强制规则推导匹配语言学句法树，通常把句法信息作为特\\
-%&征使用 \\
-%\rule{0pt}{15pt}句法硬约束 & 要求推导必须符合语言学句法树，不符合的推导会被过滤掉 \\
-%\end{tabular}
-%}
-%\end{center}
-%}\end{table}
 %----------------------------------------------

 \parinterval 基于句法的翻译模型可以被分为两类：基于形式化文法的模型和语言学上基于句法的模型（图\ref{fig:8-17}）。基于形式化文法的模型的典型代表包括，基于反向转录文法的模型\upcite{wu1997stochastic}和基于层次短语的模型\upcite{chiang2007hierarchical}。而语言学上基于句法的模型包括，句法树到串的模型\upcite{liu2006tree,huang2006statistical}、串到句法树的模型\upcite{galley2006scalable,galley2004s}、句法树到句法树的模型\upcite{eisner2003learning,zhang2008tree}等。
@@ -884,7 +883,9 @@ r_9: \quad \textrm{IP(}\textrm{NN}_1\ \textrm{VP}_2) \rightarrow \textrm{S(}\tex
 & \xrightarrow[r_8]{\textrm{VP}^{[3]} \Leftrightarrow \textrm{VP}^{[3]}} & \langle\ \textrm{IP(NN(进口)}\ \textrm{VP(AD}^{[4]}\ \textrm{VP(VV}^{[5]}\ \textrm{AS}^{[6]}))), \nonumber \\
 &                 & \ \ \textrm{S(NP(DT(the) NNS(imports))}\ \textrm{VP(VBP}^{[6]}\ \textrm{ADVP(RB}^{[4]}\ \textrm{VBN}^{[5]})))\ \rangle \hspace{4.5em} \nonumber \\
 & \xrightarrow[r_3]{\textrm{AD}^{[4]} \Leftrightarrow \textrm{RB}^{[4]}} & \langle\ \textrm{IP(NN(进口)}\ \textrm{VP(AD(大幅度)}\ \textrm{VP(VV}^{[5]}\ \textrm{AS}^{[6]}))), \nonumber \\
-&                 & \ \ \textrm{S(NP(DT(the) NNS(imports))}\ \textrm{VP(VBP}^{[6]}\ \textrm{ADVP(RB(drastically)}\  \textrm{VBN}^{[5]})))\ \rangle \nonumber \\
+&                 & \ \ \textrm{S(NP(DT(the) NNS(imports))}\ \textrm{VP(VBP}^{[6]}\ \textrm{ADVP(RB(drastically)}\  \textrm{VBN}^{[5]})))\ \rangle \nonumber 
+\end{eqnarray}
+\begin{eqnarray}
 & \xrightarrow[r_4]{\textrm{VV}^{[5]} \Leftrightarrow \textrm{VBN}^{[5]}} & \langle\ \textrm{IP(NN(进口)}\ \textrm{VP(AD(大幅度)}\ \textrm{VP(VV(减少)}\ \textrm{AS}^{[6]}))), \hspace{10em} \nonumber \\
 &                 & \ \ \textrm{S(NP(DT(the) NNS(imports))}\ \textrm{VP(VBP}^{[6]}\ \nonumber \\
 &                 & \ \ \textrm{ADVP(RB(drastically)}\ \textrm{VBN(fallen)})))\ \rangle \nonumber \\
@@ -1097,7 +1098,6 @@ r_9: \quad \textrm{IP(}\textrm{NN}_1\ \textrm{VP}_2) \rightarrow \textrm{S(}\tex
 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
 %----------------------------------------------------------------------------------------
-
 \subsubsection{2. 空对齐处理}

 \parinterval 空对齐是翻译中的常见现象。比如，一些虚词经常找不到在另一种语言中的对应，因此不会被翻译，这种情况也被称作空对齐。在图\ref{fig:8-27}中目标语中的“was”就是一个空对齐单词。空对齐的使用可以大大增加翻译的灵活度。具体到树到串规则抽取任务，需要把空对齐考虑进来，这样能够覆盖更多的语言现象。
@@ -1258,27 +1258,30 @@ r_9: \quad \textrm{IP(}\textrm{NN}_1\ \textrm{VP}_2) \rightarrow \textrm{S(}\tex
 \subsubsection{1. 基于节点对齐的规则抽取}

 \parinterval 不过，GHKM方法的问题在于过于依赖词对齐结果。在树到树翻译中，真正需要的是树结构（节点）之间的对应关系，而不是词对齐。特别是在两端都加入句法树结构约束的情况下，词对齐的错误可能会导致较为严重的规则抽取错误。图\ref{fig:8-34}就给出了一个实例，其中，中文的“了”被错误的对齐到了英文的“the”，导致很多高质量的规则无法被抽取出来。
-
+\vspace{-0.5em}
 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter8/Figures/figure-tree-to-tree-rule-extraction-base-word-alignment}
+%\setlength{\abovecaptionskip}{-0.1em}
 \caption{基于词对齐的树到树规则抽取}
 \label{fig:8-34}
 \end{figure}
 %-------------------------------------------
-
-\parinterval 换一个角度来看，词对齐实际上只是帮助模型找到两种语言句法树中节点的对应关系。如果能够直接得到句法树节点的对应，就可以避免掉词对齐的错误。也就是，可以直接使用节点对齐来进行树到树规则的抽取。首先，利用外部的节点对齐工具获得两棵句法树节点之间的对齐关系。之后，将每个对齐的节点看作是树片段的根节点，再进行规则抽取。图\ref{fig:8-35}展示了基于节点对齐的规则抽取结果。
-
+\vspace{-2em}
 %----------------------------------------------
 \begin{figure}[htb]
 \centering
 \input{./Chapter8/Figures/figure-tree-to-tree-rule-extraction-base-node-alignment}
+%\setlength{\abovecaptionskip}{-0.1em}
 \caption{基于节点对齐的树到树规则抽取}
 \label{fig:8-35}
 \end{figure}
 %-------------------------------------------

+\vspace{-1.0em}
+\parinterval 换一个角度来看，词对齐实际上只是帮助模型找到两种语言句法树中节点的对应关系。如果能够直接得到句法树节点的对应，就可以避免掉词对齐的错误。也就是，可以直接使用节点对齐来进行树到树规则的抽取。首先，利用外部的节点对齐工具获得两棵句法树节点之间的对齐关系。之后，将每个对齐的节点看作是树片段的根节点，再进行规则抽取。图\ref{fig:8-35}展示了基于节点对齐的规则抽取结果。
+
 \parinterval 可以看到，节点对齐可以避免词对齐错误造成的影响。不过，节点对齐需要开发额外的工具，有很多方法可以参考，比如可以基于启发性规则\upcite{DBLP:conf/coling/GrovesHW04}、基于分类模型\upcite{DBLP:conf/coling/SunZT10}、基于无指导的方法\upcite{xiao2013unsupervised}等。

 %----------------------------------------------------------------------------------------
@@ -1288,16 +1291,18 @@ r_9: \quad \textrm{IP(}\textrm{NN}_1\ \textrm{VP}_2) \rightarrow \textrm{S(}\tex
 \subsubsection{2. 基于对齐矩阵的规则抽取}

 \parinterval 同词对齐一样，节点对齐也会存在错误，这样就不可避免地造成规则抽取的错误。既然单一的对齐中含有错误，那能否让系统看到更多样的对齐结果，进而提高正确规则被抽取到的几率呢？答案是肯定的。实际上，在基于短语的模型中就有基于多个词对齐（如$n$-best词对齐）进行规则抽取的方法\upcite{liu2009weighted}，这种方法可以在一定程度上提高短语的召回率。在树到树规则抽取中也可以使用多个节点对齐结果进行规则抽取。但是，简单使用多个对齐结果会使系统运行代价线性增长，而且即使是$n$-best对齐，也无法保证涵盖到正确的对齐结果。对于这个问题，另一种思路是使用对齐矩阵进行规则的“软”抽取。
-
+\vspace{-1em}
 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter8/Figures/figure-one-best-node-alignment-and-alignment-matrix}
+\setlength{\abovecaptionskip}{-0.5em}
 \caption{使用1-best节点对齐和概率化节点对齐矩阵的树到树规则抽取\upcite{xiao2013unsupervised}}
 \label{fig:8-36}
 \end{figure}
 %-------------------------------------------

+\vspace{-0.5em}
 \parinterval 所谓对齐矩阵，是描述两个句法树节点之间对应强度的数据结构。矩阵的每个单元中都是一个0到1之间的数字。规则抽取时，可以认为所有节点之间都存在对齐，这样可以抽取出很多$n$-best对齐中无法覆盖的规则。图\ref{fig:8-36}展示了一个用对齐矩阵的进行规则抽取的实例。其中矩阵1（Matrix 1）表示的是标准的1-best节点对齐，矩阵2（Matrix 2）表示的是一种概率化的对齐矩阵。可以看到使用矩阵2可以抽取到更多样的规则。另外，值得注意的是，基于对齐矩阵的方法也同样适用于短语和层次短语规则的抽取。关于对齐矩阵的生成可以参考相关论文的内容\upcite{xiao2013unsupervised,liu2009weighted,sun2010exploring,DBLP:conf/coling/SunZT10}。

 \parinterval 此外，在基于句法的规则抽取中，一般会对规则进行一些限制，以避免规则数量过大，系统无法处理。比如，可以限制树片段的深度、变量个数、规则组合的次数等等。这些限制往往需要根据具体任务进行设计和调整。
@@ -1392,16 +1397,18 @@ r_9: \quad \textrm{IP(}\textrm{NN}_1\ \textrm{VP}_2) \rightarrow \textrm{S(}\tex
 \textrm{VP} \rightarrow \textrm{VV}\ \textrm{NP} \nonumber \\
 \textrm{NP} \rightarrow \textrm{NN}\ \textrm{NP} \nonumber
 \end{eqnarray}
-
+\vspace{-3.0em}
 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter8/Figures/figure-example-of-hyper-graph}
+\setlength{\abovecaptionskip}{-0.1em}
 \caption{超图实例}
 \label{fig:8-37}
 \end{figure}
 %-------------------------------------------

+\vspace{-1.0em}
 \parinterval 对于规则“$\textrm{VP} \rightarrow \textrm{VV}\ \textrm{NP}$”，超边的头指向VP，超边的尾表示规则右部的两个变量VV和NP。规则“$\textrm{NP} \rightarrow \textrm{NN}\ \textrm{NP}$”也可以进行类似的解释。

 \parinterval 不难发现，超图提供了一种非常紧凑的数据结构来表示多个推导，因为不同推导之间可以共享节点。如果把图\ref{fig:8-37}中的蓝色和红色部分看作是两个推导，那么它们就共享了同一个节点NN[1,2]，其中NN是句法标记，[1,2]是跨度。能够想象，简单枚举一个句子所有的推导几乎是不可能的，但是用超图的方式却可以很有效地对指数级数量的推导进行表示。另一方面，超图上的运算常常被看作是一种基于半环的代数系统，而且人们发现许多句法分析和机器翻译问题本质上都是{\small\bfnew{半环分析}}\index{半环分析}（Semi-ring Parsing）\index{Semi-ring Parsing}。不过，由于篇幅有限，这里不会对半环等结构展开讨论。感兴趣的读者可以查阅相关文献\upcite{goodman1999semiring,eisner2002parameter}。
@@ -1439,6 +1446,7 @@ d_1 & = & {d'} \circ {r_5}
 \begin{figure}[htp]
 \centering
 \input{./Chapter8/Figures/figure-hyper-graph-representation-of-machine-translation-derivation}
+\setlength{\abovecaptionskip}{-0.5em}
 \caption{机器翻译推导的超图表示}
 \label{fig:8-39}
 \end{figure}
@@ -1492,6 +1500,7 @@ d_1 & = & {d'} \circ {r_5}
 \begin{figure}[htp]
 \centering
 \input{./Chapter8/Figures/figure-role-of-syntax-tree-in-different-decoding-methods}
+\setlength{\abovecaptionskip}{-0.5em}
 \caption{句法树在不同解码方法中的角色}
 \label{fig:8-40}
 \end{figure}
@@ -1520,6 +1529,7 @@ d_1 & = & {d'} \circ {r_5}
 \begin{figure}[htp]
 \centering
 \input{./Chapter8/Figures/figure-content-of-chart-in-tree-based-decoding}
+\setlength{\abovecaptionskip}{-0.5em}
 \caption{基于树的解码中Chart的内容}
 \label{fig:8-41}
 \end{figure}
@@ -1531,6 +1541,7 @@ d_1 & = & {d'} \circ {r_5}
 \begin{figure}[htp]
 \centering
 \input{./Chapter8/Figures/figure-rule-matching-base-tree}
+\setlength{\abovecaptionskip}{-0.5em}
 \caption{基于树的规则匹配}
 \label{fig:8-42}
 \end{figure}