合并分支 'caorunzhe' 到 'master'

Caorunzhe 查看合并请求 !105

合并分支 'caorunzhe' 到 'master'
Caorunzhe 查看合并请求 !105
a2313fde · 曹润柘 · b1ec1728 · 051237b8 · a2313fde · a2313fde
Commit a2313fde authored May 09, 2020 by 曹润柘
--- a/Book/Chapter1/chapter1.tex
+++ b/Book/Chapter1/chapter1.tex
@@ -45,7 +45,7 @@
 %----------------------------------------------
 \begin{figure}[htp]
    \centering
-\input{./Chapter1/Figures/figure-Required-parts-of-MT}
+\input{./Chapter1/Figures/figure-required-parts-of-mt}
    \caption{机器翻译系统的组成}
    \label{fig:1-2}
 \end{figure}
@@ -220,7 +220,7 @@
 %----------------------------------------------
 \begin{figure}[htp]
    \centering
-\input{./Chapter1/Figures/figure-Example-RBMT}
+\input{./Chapter1/Figures/figure-example-rbmt}
 \setlength{\belowcaptionskip}{-1.5em}
    \caption{基于规则的机器翻译的示例图（左：规则库；右：规则匹配结果）}
    \label{fig:1-8}
@@ -290,7 +290,7 @@
 %----------------------------------------------
 \begin{figure}[htp]
    \centering
-\input{./Chapter1/Figures/figure-Example-SMT}
+\input{./Chapter1/Figures/figure-example-smt}
    \caption{统计机器翻译的示例图（左：语料资源；中：翻译模型与语言模型；右：翻译假设与翻译引擎）}
    \label{fig:1-11}
 \end{figure}
@@ -311,7 +311,7 @@
 %----------------------------------------------
 \begin{figure}[htp]
    \centering
-\input{./Chapter1/Figures/figure-Example-NMT}
+\input{./Chapter1/Figures/figure-example-nmt}
    \caption{神经机器翻译的示例图（左：编码器-解码器网络；右：编码器示例网络）}
    \label{fig:1-12}
 \end{figure}
@@ -727,7 +727,7 @@ His house is on the south bank of the river.
 \parinterval 《机器学习》\cite{周志华2016机器学习}由南京大学教授周志华教授所著，作为机器学习领域入门教材，该书尽可能地涵盖了机器学习基础知识的各个方面，试图尽可能少地使用数学知识介绍机器学习方法与思想。
-\parinterval 《统计学习方法》（{\red 参考文献}）由李航博士所著，该书对机器学习的有监督和无监督等方法进行了全面而系统的介绍。可以作为梳理机器学习的知识体系，同时了解相关基础概念的参考读物。
+\parinterval 《统计学习方法》\cite{李航2012统计学习方法}由李航博士所著，该书对机器学习的有监督和无监督等方法进行了全面而系统的介绍。可以作为梳理机器学习的知识体系，同时了解相关基础概念的参考读物。
 \parinterval 《神经网络与深度学习》\cite{邱锡鹏2020神经网络与深度学习}由复旦大学邱锡鹏教授所著，全面的介绍了神经网络和深度学习的基本概念和常用技术，同时涉及了许多深度学习的前沿方法。该书适合初学者阅读，同时又不失为一本面向专业人士的参考书。

--- a/Book/Chapter2/chapter2.tex
+++ b/Book/Chapter2/chapter2.tex
@@ -22,7 +22,7 @@
 \parinterval 语言分析部分将以汉语为例介绍词法和句法分析的基本概念。它们都是自然语言处理中的经典问题，而且在机器翻译中也会经常被使用。同样，本章会介绍这两个任务的定义和求解问题的思路。
-\parinterval 语言建模是机器翻译中最常用的一种技术，它主要用于句子的生成和流畅度评价。本章会以传统统计语言模型为例，对语言建模的相关概念进行介绍。但是，这里并不深入探讨语言模型技术，在后面的章节中还会有单独的内容对神经网络语言模型等前沿技术进行讨论。
+\parinterval 语言建模是机器翻译中最常用的一种技术，它主要用于句子的生成和流畅度评价。本章会以传统统计语言模型为例，对语言建模的相关概念进行介绍。但是，这里并不深入探讨语言模型技术，在后面的章节中还会单独对神经网络语言模型等前沿技术进行讨论。
 %----------------------------------------------------------------------------------------
 %    NEW SECTION
@@ -30,13 +30,13 @@
 \section{问题概述 }
-\parinterval 很多时候机器翻译系统被看作是孤立的``黑盒''系统（图 \ref {fig:2-1} (a)）。可以将一段文本作为输入送入机器翻译系统，之后得到翻译好的译文输出。但是真实的机器翻译系统要复杂的多。因为系统看到的输入和输出的实际上只是一些符号串，这些符号并没有任何其他意义，因此需要进一步对这些符号串进行处理才能更好的使用它们，比如，需要定义翻译中最基本的单元是什么？符号串是否还有结构信息？如何用数学工具刻画这些基本单元和结构？
+\parinterval 很多时候机器翻译系统被看作是孤立的``黑盒''系统（图 \ref {fig:2-1} (a)）。将一段文本作为输入送入机器翻译系统之后，系统输出翻译好的译文。但是真实的机器翻译系统非常复杂，因为系统看到的输入和输出的实际上只是一些符号串，这些符号并没有任何意义，因此需要进一步对这些符号串进行处理才能更好的使用它们，比如，需要定义翻译中最基本的单元是什么？符号串是否还有结构信息？如何用数学工具刻画这些基本单元和结构？
 %----------------------------------------------
 \begin{figure}[htp]
    \centering
- 	\subfigure[机器翻译系统被看作一个黑盒] {\input{./Chapter2/Figures/figure-MT-system-as-a-black-box}  }
+ 	\subfigure[机器翻译系统被看作一个黑盒] {\input{./Chapter2/Figures/figure-mt-system-as-a-black-box}  }
- 	\subfigure[机器翻系统 = 前/后处理 + 翻译引擎] {\input{./Chapter2/Figures/figure-MT=language-analysis+translation-engine}}
+ 	\subfigure[机器翻系统 = 前/后处理 + 翻译引擎] {\input{./Chapter2/Figures/figure-mt=language-analysis+translation-engine}}
 	\caption{机器翻译系统的结构}
    \label{fig:2-1}
 \end{figure}
@@ -63,7 +63,7 @@
 \parinterval 类似的，机器翻译输出的结果也可以包含同样的信息。甚至系统输出英文译文之后，还有一个额外的步骤来把部分英文单词的大小写恢复出来，比如，上例中句首单词Cats的首字母要大写。
-\parinterval 一般来说，在送入机器翻译系统前需要对文字序列进行处理和加工，这个过程被称为{\small\sffamily\bfseries{预处理}}\index{预处理}（Pre-processing）\index{Pre-processing}。同理，在机器翻译模型输出译文后的处理作被称作{\small\sffamily\bfseries{后处理}}\index{后处理}（Post-processing）\index{Post-processing}。这两个过程对机器翻译性能影响很大，比如，在神经机器翻译里，不同的分词策略可能会造成翻译性能的天差地别。
+\parinterval 一般来说，在送入机器翻译系统前需要对文字序列进行处理和加工，这个过程被称为{\small\sffamily\bfseries{预处理}}\index{预处理}（Pre-processing）\index{Pre-processing}。同理，在机器翻译模型输出译文后进行的处理被称作{\small\sffamily\bfseries{后处理}}\index{后处理}（Post-processing）\index{Post-processing}。这两个过程对机器翻译性能影响很大，比如，在神经机器翻译里，不同的分词策略可能会造成翻译性能的天差地别。
 \parinterval 值得注意的是，有些观点认为，不论是分词还是句法分析，对于机器翻译来说并不要求符合人的认知和语言学约束。换句话说，机器翻译所使用的``单词''和``结构''本身并不是为了符合人类的解释，它们更直接目的是为了进行翻译。从系统开发的角度，有时候即使进行一些与人类的语言习惯有差别的处理，仍然会带来性能的提升，比如在神经机器翻译中，在传统分词的基础上进一步使用双字节编码（Byte Pair Encoding，BPE）子词切分会使得机器翻译性能大幅提高。当然，自然语言处理中语言学信息的使用一直是学界关注的焦点。甚至关于语言学结构对机器翻译是否有作用这个问题也有争论。但是不能否认的是，无论是语言学的知识，还是计算机自己学习到的知识，对机器翻译都是有价值的。在后续章节会看到，这两种类型的知识对机器翻译帮助很大 \footnote[1]{笔者并不认同语言学结构对机器翻译的帮助有限，相反机器翻译需要更多的人类先验知识的指导。当然，这个问题不是这里讨论的重点。} 。
@@ -125,7 +125,7 @@ F(X)=\int_{-\infty}^x f(x)dx
 %----------------------------------------------
 \begin{figure}[htp]
 \centering
- \input{./Chapter2/Figures/figure-Probability-density-function&Distribution-function}
+ \input{./Chapter2/Figures/figure-probability-density-function&distribution-function}
 \caption{一个概率密度函数(左)与其对应的分布函数(右)}
 \label{fig:2-3}
 \end{figure}
@@ -221,7 +221,7 @@ F(X)=\int_{-\infty}^x f(x)dx
 \parinterval 根据图\ref {fig:2-5} 易知$E$只和$C$有关，所以$\textrm{P}(E \mid A,B,C,D)=\textrm{P}(E \mid C)$；$D$不依赖于其他事件，所以$\textrm{P}(D \mid A,B,C)=\textrm{P}(D)$；$C$只和$B$、$D$有关，所以$\textrm{P}(C \mid A,B)=\textrm{P}(C \mid B)$；$B$不依赖于其他事件，所以$\textrm{P}(B \mid  A)=\textrm{P}(B)$。最终化简可得：
 \begin{eqnarray}
-\textrm{P}(A,B,C,D,E)=\textrm{P}(E \mid C) \cdot \textrm{P}(D) \cdot \textrm{P}(C \mid B) \cdot \textrm{P}(B)
+\textrm{P}(A,B,C,D,E)=\textrm{P}(E \mid C) \cdot \textrm{P}(D) \cdot \textrm{P}(C \mid B) \cdot \textrm{P}(B)\cdot \textrm{P}(A \mid B)
 \label{eq:2-8}
 \end{eqnarray}
@@ -310,7 +310,7 @@ F(X)=\int_{-\infty}^x f(x)dx
 %----------------------------------------------
 \begin{figure}[htp]
 \centering
-\input{./Chapter2/Figures/figure-Self-information-function}
+\input{./Chapter2/Figures/figure-self-information-function}
 \caption{自信息函数$\textrm{I}(x)$关于$\textrm{P}(x)$的曲线}
 \label{fig:2-6}
 \end{figure}
@@ -360,7 +360,7 @@ F(X)=\int_{-\infty}^x f(x)dx
 \label{eq:2-16}
 \end{eqnarray}
-\parinterval 结合相对熵公式可知，交叉熵是KL距离公式中的右半部分。因此，求关于Q的交叉熵的最小值等价于求KL距离的最小值。从实践的角度来说，交叉熵与KL距离的目的相同：都是用来描述两个分布的差异，由于交叉熵计算上更加直观方便，因此在机器翻译中被广泛应用。
+\parinterval 结合相对熵公式可知，交叉熵是KL距离公式中的右半部分。因此，当概率分布P$(x)$固定时，求关于Q的交叉熵的最小值等价于求KL距离的最小值。从实践的角度来说，交叉熵与KL距离的目的相同：都是用来描述两个分布的差异，由于交叉熵计算上更加直观方便，因此在机器翻译中被广泛应用。
 %----------------------------------------------------------------------------------------
 %    NEW SECTION
@@ -429,7 +429,7 @@ F(X)=\int_{-\infty}^x f(x)dx
 %----------------------------------------------
 \begin{figure}[htp]
 \centering
-\input{./Chapter2/Figures/figure-Example-of-word-segmentation-based-on-dictionary}
+\input{./Chapter2/Figures/figure-example-of-word-segmentation-based-on-dictionary}
 \caption{基于词典进行分词的实例}
 \label{fig:2-8}
 \end{figure}
@@ -638,7 +638,7 @@ F(X)=\int_{-\infty}^x f(x)dx
 %----------------------------------------------
 \begin{figure}[htp]
 \centering
-\input{./Chapter2/Figures/figure-examples-of-Chinese-word-segmentation-based-on-1-gram-model}
+\input{./Chapter2/Figures/figure-examples-of-chinese-word-segmentation-based-on-1-gram-model}
 \caption{基于1-gram语言模型的中文分词实例}
 \label{fig:2-17}
 \end{figure}

--- a/Book/Chapter3/Chapter3.tex
+++ b/Book/Chapter3/Chapter3.tex
@@ -170,7 +170,7 @@
 %----------------------------------------------
 \begin{figure}[htp]
    \centering
-\input{./Chapter3/Figures/figure-processes-SMT}
+\input{./Chapter3/Figures/figure-processes-smt}
    \caption{简单的统计机器翻译流程}
    \label{fig:3-5}
 \end{figure}
@@ -472,7 +472,7 @@ g(\mathbf{s},\mathbf{t}) \equiv \prod_{j,i \in \widehat{A}}{\textrm{P}(s_j,t_i)}
 %----------------------------------------------
 \begin{figure}[htp]
    \centering
-\input{./Chapter3/Figures/figure-greedy-MT-decoding-pseudo-code}
+\input{./Chapter3/Figures/figure-greedy-mt-decoding-pseudo-code}
    \caption{贪婪的机器翻译解码算法的伪代码}
    \label{fig:3-10}
 \end{figure}
@@ -483,8 +483,8 @@ g(\mathbf{s},\mathbf{t}) \equiv \prod_{j,i \in \widehat{A}}{\textrm{P}(s_j,t_i)}
 %----------------------------------------------
 \begin{figure}[htp]
    \centering
-\subfigure{\input{./Chapter3/Figures/greedy-MT-decoding-process-1}}
+\subfigure{\input{./Chapter3/Figures/greedy-mt-decoding-process-1}}
-\subfigure{\input{./Chapter3/Figures/greedy-MT-decoding-process-3}}
+\subfigure{\input{./Chapter3/Figures/greedy-mt-decoding-process-3}}
 \setlength{\belowcaptionskip}{14.0em}
    \caption{贪婪的机器翻译解码过程实例}
    \label{fig:3-11}

--- a/Book/Chapter4/Figures/grid-search-1.tex
+++ b/Book/Chapter4/Figures/grid-search-1.tex
@@ -29,7 +29,7 @@
 \node [anchor=center,draw,circle,inner sep=1.5pt,blue!30,fill=blue!30] (f11) at ([xshift=0em,yshift=23em]y2.north) {};
 \node[anchor=south] (f12) at ([xshift=5em,yshift=-0.5em]f11.south) {\scriptsize{fixed}};
-\node [anchor=center,draw,circle,inner sep=1.5pt,purple!30,fill=ugreen!50] (f21) at ([xshift=0em,yshift=-4em]f11.north) {};
+\node [anchor=center,draw,circle,inner sep=1.5pt,ugreen!50,fill=ugreen!50] (f21) at ([xshift=0em,yshift=-4em]f11.north) {};
 \node[anchor=south] (f22) at ([xshift=8.5em,yshift=-0.5em]f21.south) {\scriptsize{valid choices}};
 \node [anchor=center,draw,circle,inner sep=1.5pt,red!30,fill=red!30] (f31) at ([xshift=0em,yshift=-4em]f21.north) {};
 \node[anchor=south] (f32) at ([xshift=9.5em,yshift=-0.5em]f31.south) {\scriptsize{invalid choices}};

--- a/Book/Chapter4/Figures/grid-search-2.tex
+++ b/Book/Chapter4/Figures/grid-search-2.tex
@@ -26,7 +26,7 @@
 \node [anchor=center,draw,circle,inner sep=1.5pt,red!30,fill=red!30] (r33) at (2,2) {};
 \node [anchor=center,draw,circle,inner sep=1.5pt,red!30,fill=red!30] (r35) at (2,1) {};
-\node [anchor=center,draw,circle,inner sep=1.5pt,purple!30,fill=purple!30] (r34) at (2,3) {};
+\node [anchor=center,draw,circle,inner sep=1.5pt,ugreen!50,fill=ugreen!50] (r34) at (2,3) {};
 \draw [-,very thick,red!50, dashed] (1,2) -- (2,4) -- (3,2) -- (2,3) -- (1,2) -- (3,2) -- (2,1) -- (1,2) -- (2,0) -- (3,2);
 \draw [-,very thick,blue!50] (0,1) -- (1,2);

--- a/Book/Chapter4/Figures/one-best-node-alignment-and-alignment-matrix.tex
+++ b/Book/Chapter4/Figures/one-best-node-alignment-and-alignment-matrix.tex
@@ -105,7 +105,7 @@
 \end{flushright}
 \begin{center}
 \vspace{-1em}
-(a)节点对齐矩阵（1-best vs. Matrix）
+\footnotesize{(a)节点对齐矩阵（1-best vs. Matrix）}
 \end{center}
 \begin{center}
@@ -147,7 +147,7 @@
 \begin{center}
 \vspace{-2em}
-(b) 抽取得到的树到树翻译规则
+\footnotesize{(b) 抽取得到的树到树翻译规则}
 \end{center}
 \end{center}
--- a/Book/Chapter4/chapter4.tex
+++ b/Book/Chapter4/chapter4.tex
@@ -1653,7 +1653,7 @@ r_9: \quad \textrm{IP(}\textrm{NN}_1\ \textrm{VP}_2) \rightarrow \textrm{S(}\tex
 \subsubsection{树到串翻译规则}
-\parinterval 基于树结构的文法可以很好的表示两个树片段之间的对应关系，即树到树翻译规则。那树到串翻译规则该如何表示呢？实际上，基于树结构的文法也同样适用于树到串模型。比如，如下是一个树片段到串的映射，它可以被看作是树到串规则的一种表示。
+\parinterval 基于树结构的文法可以很好的表示两个树片段之间的对应关系，即树到树翻译规则。那树到串翻译规则该如何表示呢？实际上，基于树结构的文法也同样适用于树到串模型。比如，图\ref{fig:4-49}是一个树片段到串的映射，它可以被看作是树到串规则的一种表示。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -2162,7 +2162,7 @@ d_1 = {d'} \circ {r_5}
 %----------------------------------------------
 \begin{figure}[htp]
 \centering
-\input{./Chapter4/Figures/structure-of-Chart}
+\input{./Chapter4/Figures/structure-of-chart}
 \caption{Chart结构}
 \label{fig:4-65}
 \end{figure}
@@ -2252,7 +2252,7 @@ d_1 = {d'} \circ {r_5}
 %----------------------------------------------
 \begin{figure}[htp]
 \centering
-\input{./Chapter4/Figures/content-of-Chart-in-tree-based-decoding}
+\input{./Chapter4/Figures/content-of-chart-in-tree-based-decoding}
 \caption{基于树的解码中Chart的内容}
 \label{fig:4-68}
 \end{figure}

--- a/Book/Chapter5/chapter5.tex
+++ b/Book/Chapter5/chapter5.tex
--- a/Book/Chapter6/Chapter6.tex
+++ b/Book/Chapter6/Chapter6.tex
@@ -232,7 +232,7 @@ NMT                     & $ 21.7^{\ast}$          & $18.7^{\ast}$           & -1
 \begin{figure}[htp]
 \centering
 \input{./Chapter6/Figures/figure-encoder-decoder-process}
-\caption{ encoder-decoder过程 }
+\caption{ Encoder-Decoder过程 }
 \label{fig:6-5}
 \end{figure}
 %----------------------------------------------
@@ -252,7 +252,7 @@ NMT                     & $ 21.7^{\ast}$          & $18.7^{\ast}$           & -1
 % 图3.6
 \begin{figure}[htp]
    \centering
-    \input{./Chapter6/Figures/figure-Presentation-space}
+    \input{./Chapter6/Figures/figure-presentation-space}
    \caption{统计机器翻译和神经机器翻译的表示空间}
    \label{fig:6-6}
 \end{figure}
@@ -288,7 +288,7 @@ NMT                     & $ 21.7^{\ast}$          & $18.7^{\ast}$           & -1
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-A-working-example-of-neural-machine-translation}
+\input{./Chapter6/Figures/figure-a-working-example-of-neural-machine-translation}
 \caption{神经机器翻译的运行实例}
 \label{fig:6-7}
 \end{figure}
@@ -384,19 +384,19 @@ NMT                     & $ 21.7^{\ast}$          & $18.7^{\ast}$           & -1
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Structure-of-a-recurrent-network-model}
+\input{./Chapter6/Figures/figure-structure-of-a-recurrent-network-model}
-\caption{循环网络模型的结构}
+\caption{循环神经网络处理序列的实例}
 \label{fig:6-9}
 \end{figure}
 %----------------------------------------------
-\parinterval 在神经机器翻译里使用循环神经网络也很简单。我们只需要把源语言句子和目标语言句子分别看作两个序列，之后使用两个循环神经网络分别对其进行建模。这种网络结构如图\ref{fig:6-10}所示。图中，下半部分是编码器，上半部分是解码器。编码器利用循环神经网络对源语言序列逐词进行编码处理，同时利用循环单元的记忆能力，不断累积序列信息，遇到终止符<eos>后便得到了包含源语言句子全部信息的表示结果。解码器利用编码器的输出和起始符<sos>开始逐词的进行解码，即逐词翻译，每得到一个译文单词，便将其作为当前时刻解码端循环单元的输入，这也是一个典型的神经语言模型的序列生成过程。解码器通过循环神经网络不断的累积已经得到的译文的信息，并继续生成下一个单词，直到遇到结束符<eos>，便得到了最终完整的译文。
+\parinterval 在神经机器翻译里使用循环神经网络也很简单。我们只需要把源语言句子和目标语言句子分别看作两个序列，之后使用两个循环神经网络分别对其进行建模。这个过程如图\ref{fig:6-10}所示。图中，下半部分是编码器，上半部分是解码器。编码器利用循环神经网络对源语言序列逐词进行编码处理，同时利用循环单元的记忆能力，不断累积序列信息，遇到终止符<eos>后便得到了包含源语言句子全部信息的表示结果。解码器利用编码器的输出和起始符<sos>开始逐词的进行解码，即逐词翻译，每得到一个译文单词，便将其作为当前时刻解码端循环单元的输入，这也是一个典型的神经语言模型的序列生成过程。解码器通过循环神经网络不断的累积已经得到的译文的信息，并继续生成下一个单词，直到遇到结束符<eos>，便得到了最终完整的译文。
 %----------------------------------------------
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Model-structure-based-on-recurrent-neural-network-translation}
+\input{./Chapter6/Figures/figure-model-structure-based-on-recurrent-neural-network-translation}
 \caption{基于循环神经网络翻译的模型结构}
 \label{fig:6-10}
 \end{figure}
@@ -480,22 +480,22 @@ $\textrm{P}({y_j | \mathbf{s}_{j-1} ,y_{j-1},\mathbf{C}})$由Softmax实现，Sof
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Word-embedding-structure}
+\input{./Chapter6/Figures/figure-word-embedding-structure}
-\caption{词嵌入层结构}
+\caption{词嵌入的生成过程}
 \label{fig:6-12}
 \end{figure}
 %----------------------------------------------
 \parinterval 需要注意的是，在上面这个过程中One-hot表示和词嵌入矩阵并不必须调用矩阵乘法才得到词嵌入结果。只需要获得One-hot向量中1对应的索引，从词嵌入矩阵中取出对应的行即可。这种利用索引``取''结果的方式避免了计算代价较高的矩阵乘法，因此在实际系统中很常用。
-\parinterval 在解码端，需要在每个位置预测输出的单词。在循环神经网络中，每一时刻循环单元的输出向量为$\mathbf{s}_j$，它可以被看作这个时刻的目标语单词的一种表示，但是我们无法根据这个向量得出要生成的目标语单词的概率。而输出层的目的便是通过向量$\mathbf{s}_j$计算词表中每个单词的生成概率，进而选取概率最高的单词作为当前时刻的输出。图\ref{fig:6-13}展示了一个输出层的运行实例。
+\parinterval 在解码端，需要在每个位置预测输出的单词。在循环神经网络中，每一时刻循环单元的输出向量为$\mathbf{s}_j$，它可以被看作这个时刻的目标语单词的一种表示，但是我们无法根据这个向量得出要生成的目标语单词的概率。而输出层的目的便是通过向量$\mathbf{s}_j$计算词表中每个单词的生成概率，进而选取概率最高的单词作为当前时刻的输出。图\ref{fig:6-13}展示了一个输出层进行单词预测的实例。
 %----------------------------------------------
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Output-layer-structur}
+\input{./Chapter6/Figures/figure-output-layer-structur}
-\caption{输出层结构}
+\caption{输出层的预测过程}
 \label{fig:6-13}
 \end{figure}
 %----------------------------------------------
@@ -525,7 +525,7 @@ $\textrm{P}({y_j | \mathbf{s}_{j-1} ,y_{j-1},\mathbf{C}})$由Softmax实现，Sof
 \begin{figure}[htp]
 \centering
 % \includegraphics[scale=0.7]{./Chapter6/Figures/Softmax.png}
-\input{./Chapter6/Figures/figure-Softmax}
+\input{./Chapter6/Figures/figure-softmax}
 \caption{ Softmax函数（一维）所对应的曲线}
 \label{fig:6-14}
 \end{figure}
@@ -680,7 +680,7 @@ $\textrm{P}({y_j | \mathbf{s}_{j-1} ,y_{j-1},\mathbf{C}})$由Softmax实现，Sof
 \begin{figure}[htp]
 \centering
 \input{./Chapter6/Figures/figure-bi-RNN}
-\caption{双向循环神经网络}
+\caption{基于双向循环神经网络的机器翻译模型结构}
 \label{fig:6-18}
 \end{figure}
 %----------------------------------------------
@@ -697,8 +697,8 @@ $\textrm{P}({y_j | \mathbf{s}_{j-1} ,y_{j-1},\mathbf{C}})$由Softmax实现，Sof
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Double-layer-RNN} \hspace{10em}
+\input{./Chapter6/Figures/figure-double-layer-RNN} \hspace{10em}
-\caption{双层循环神经网络}
+\caption{基于双层循环神经网络的机器翻译模型结构}
 \label{fig:6-19}
 \end{figure}
 %----------------------------------------------
@@ -744,7 +744,7 @@ $\textrm{P}({y_j | \mathbf{s}_{j-1} ,y_{j-1},\mathbf{C}})$由Softmax实现，Sof
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Attention-of-source-and-target-words}
+\input{./Chapter6/Figures/figure-attention-of-source-and-target-words}
 \caption{源语词和目标语词的关注度}
 \label{fig:6-21}
 \end{figure}
@@ -758,7 +758,7 @@ $\textrm{P}({y_j | \mathbf{s}_{j-1} ,y_{j-1},\mathbf{C}})$由Softmax实现，Sof
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-encoder-decoder-with-Attention}
+\input{./Chapter6/Figures/figure-encoder-decoder-with-attention}
 \caption{不使用(a)和使用(b)注意力机制的翻译模型对比}
 \label{fig:6-22}
 \end{figure}
@@ -780,7 +780,7 @@ $\textrm{P}({y_j | \mathbf{s}_{j-1} ,y_{j-1},\mathbf{C}})$由Softmax实现，Sof
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Calculation-process-of-context-vector-C}
+\input{./Chapter6/Figures/figure-calculation-process-of-context-vector-C}
 \caption{上下文向量$\mathbf{C}_j$的计算过程}
 \label{fig:6-23}
 \end{figure}
@@ -824,20 +824,20 @@ a (\mathbf{s},\mathbf{h}) =  \left\{ \begin{array}{ll}
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Matrix-Representation-of-Attention-Weights-Between-Chinese-English-Sentence-Pairs}
+\input{./Chapter6/Figures/figure-matrix-representation-of-attention-weights-between-chinese-english-sentence-pairs}
 \caption{一个汉英句对之间的注意力权重{$\alpha_{i,j}$}的矩阵表示}
 \label{fig:6-24}
 \end{figure}
 %----------------------------------------------
 \end{itemize}
-\parinterval 图\ref{fig:6-25}展示了一个上下文向量的计算过程。首先，计算目标语第一个单词``Have''与源语中的所有单词的相关性，即注意力权重，对应图中第一列$\alpha_{i,1}$，则当前时刻所使用的上下文向量$\mathbf{C}_1 = \sum_{i=1}^8 \alpha_{i,1} \mathbf{h}_i$；然后，计算第二个单词``you''的注意力权重对应第二列$\alpha_{i,2}$，其上下文向量$\mathbf{C}_2 = \sum_{i=1}^8 \alpha_{i,2} \mathbf{h}_i$，以此类推，可以得到任意目标语位置$j$的上下文向量$\mathbf{C}_j$。很容易看出，不同目标语单词的上下文向量对应的源语言词的权重$\alpha_{i,j}$是不同的，不同的注意力权重为不同位置赋予了不同重要性，对应了注意力机制的思想。
+\parinterval 图\ref{fig:6-25}展示了一个上下文向量的计算过程实例。首先，计算目标语第一个单词``Have''与源语中的所有单词的相关性，即注意力权重，对应图中第一列$\alpha_{i,1}$，则当前时刻所使用的上下文向量$\mathbf{C}_1 = \sum_{i=1}^8 \alpha_{i,1} \mathbf{h}_i$；然后，计算第二个单词``you''的注意力权重对应第二列$\alpha_{i,2}$，其上下文向量$\mathbf{C}_2 = \sum_{i=1}^8 \alpha_{i,2} \mathbf{h}_i$，以此类推，可以得到任意目标语位置$j$的上下文向量$\mathbf{C}_j$。很容易看出，不同目标语单词的上下文向量对应的源语言词的权重$\alpha_{i,j}$是不同的，不同的注意力权重为不同位置赋予了不同重要性，对应了注意力机制的思想。
 %----------------------------------------------
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Example-of-context-vector-calculation-process}
+\input{./Chapter6/Figures/figure-example-of-context-vector-calculation-process}
 \caption{上下文向量计算过程实例}
 \label{fig:6-25}
 \end{figure}
@@ -872,14 +872,14 @@ a (\mathbf{s},\mathbf{h}) =  \left\{ \begin{array}{ll}
 \parinterval 那么，如何理解这个过程？注意力机制的本质又是什么呢？换一个角度来看，实际上，目标语位置$j$本质上是一个查询，我们希望从源语言端找到与之最匹配的源语言位置，并返回相应的表示结果。为了描述这个问题，可以建立一个查询系统。假设有一个库，里面包含若干个$\mathrm{key}$-$\mathrm{value}$单元，其中$\mathrm{key}$代表这个单元的索引关键字，$\mathrm{value}$代表这个单元的值。比如，对于学生信息系统，$\mathrm{key}$可以是学号，$\mathrm{value}$可以是学生的身高。当输入一个查询$\mathrm{query}$，我们希望这个系统返回与之最匹配的结果。也就是，希望找到匹配的$\mathrm{key}$，并输出其对应的$\mathrm{value}$。比如，当查询某个学生的身高信息时，可以输入学生的学号，之后在库中查询与这个学号相匹配的记录，并把这个记录中的$\mathrm{value}$（即身高）作为结果返回。
-\parinterval 图\ref{fig:6-26}(a)展示了一个这样的查询系统。里面包含四个$\mathrm{key}$-$\mathrm{value}$单元，当输入查询$\mathrm{query}$，就把$\mathrm{query}$与这四个$\mathrm{key}$逐个进行匹配，如果完全匹配就返回相应的$\mathrm{value}$。在图中的例子中，$\mathrm{query}$和$\mathrm{key}_3$是完全匹配的（因为都是横纹），因此系统返回第三个单元的值，即$\mathrm{value}_3$。当然，如果库中没有与$\mathrm{query}$匹配的$\mathrm{key}$，则返回一个空结果。
+\parinterval 图\ref{fig:6-26}展示了一个这样的查询系统。里面包含四个$\mathrm{key}$-$\mathrm{value}$单元，当输入查询$\mathrm{query}$，就把$\mathrm{query}$与这四个$\mathrm{key}$逐个进行匹配，如果完全匹配就返回相应的$\mathrm{value}$。在图中的例子中，$\mathrm{query}$和$\mathrm{key}_3$是完全匹配的（因为都是横纹），因此系统返回第三个单元的值，即$\mathrm{value}_3$。当然，如果库中没有与$\mathrm{query}$匹配的$\mathrm{key}$，则返回一个空结果。
 %----------------------------------------------
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Query-model-corresponding-to-traditional-query-model-vs-attention-mechanism}
+\input{./Chapter6/Figures/figure-query-model-corresponding-to-traditional-query-model-vs-attention-mechanism}
-\caption{传统查询模型(a)和注意力机制所对应的查询模型(b)}
+\caption{传统查询模型}
 \label{fig:6-26}
 \end{figure}
 %----------------------------------------------
@@ -898,7 +898,7 @@ a (\mathbf{s},\mathbf{h}) =  \left\{ \begin{array}{ll}
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Query-model-corresponding-to-attention-mechanism}
+\input{./Chapter6/Figures/figure-query-model-corresponding-to-attention-mechanism}
 \caption{注意力机制所对应的查询模型}
 \label{fig:6-27}
 \end{figure}
@@ -993,7 +993,7 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 \noindent 其中$\gamma$是手工设定的梯度大小阈值， $\| \cdot \|_2$是L2范数，$\mathbf{w}'$表示梯度裁剪后的参数。这个公式的含义在于只要梯度大小超过阈值，就按照阈值与当前梯度大小的比例进行放缩。
 %%%%%%%%%%%%%%%%%%
 \subsubsection{学习率策略}
-\parinterval 在公式\ref{eq:6-30}中， $\alpha$决定了每次参数更新时更新的步幅大小，称之为{\small\bfnew{学习率}}\index{学习率}（Learning Rate）\index{Learning Rate}。学习率作为基于梯度方法中的重要超参数，它决定目标函数能否收敛到较好的局部最优点以及收敛的速度。合理的学习率能够使模型快速、稳定地达到较好的状态。但是，如果学习率太小，收敛过程会很慢；而学习率太大，则模型的状态可能会出现震荡，很难达到稳定，甚至使模型无法收敛。图\ref{fig:6-28} 对比了不同学习率对损失函数的影响。
+\parinterval 在公式\ref{eq:6-30}中， $\alpha$决定了每次参数更新时更新的步幅大小，称之为{\small\bfnew{学习率}}\index{学习率}（Learning Rate）\index{Learning Rate}。学习率作为基于梯度方法中的重要超参数，它决定目标函数能否收敛到较好的局部最优点以及收敛的速度。合理的学习率能够使模型快速、稳定地达到较好的状态。但是，如果学习率太小，收敛过程会很慢；而学习率太大，则模型的状态可能会出现震荡，很难达到稳定，甚至使模型无法收敛。图\ref{fig:6-28} 对比了不同学习率对优化过程的影响。
 %----------------------------------------------
 % 图3.10
@@ -1012,7 +1012,7 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Relationship-between-learning-rate-and-number-of-updates}
+\input{./Chapter6/Figures/figure-relationship-between-learning-rate-and-number-of-updates}
 \caption{学习率与更新次数的变化关系}
 \label{fig:6-29}
 \end{figure}
@@ -1026,7 +1026,7 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 \end{eqnarray}
 %-------
-\noindent 另一方面，当模型训练逐渐接近收敛的时候，使用太大学习率会很容易让模型在局部最优解附近震荡，从而错过局部极小，因此需要通过减小学习率来调整更新的步长，以此来不断的逼近局部最优，这一阶段也称为学习率的衰减阶段。学习率衰减的方法有很多，比如指数衰减，余弦衰减等，图\ref{fig:6-29}展示的是{\small\bfnew{分段常数衰减}}\index{分段常数衰减}（Piecewise Constant Decay）\index{Piecewise Constant Decay}，即每经过$m$次更新，学习率衰减为原来的$\beta_m$（$\beta_m<1$）倍，其中$m$和$\beta_m$为经验设置的超参。
+\noindent 另一方面，当模型训练逐渐接近收敛的时候，使用太大学习率会很容易让模型在局部最优解附近震荡，从而错过局部极小，因此需要通过减小学习率来调整更新的步长，以此来不断的逼近局部最优，这一阶段也称为学习率的衰减阶段。学习率衰减的方法有很多，比如指数衰减，余弦衰减等，图\ref{fig:6-29}右侧展示的是{\small\bfnew{分段常数衰减}}\index{分段常数衰减}（Piecewise Constant Decay）\index{Piecewise Constant Decay}，即每经过$m$次更新，学习率衰减为原来的$\beta_m$（$\beta_m<1$）倍，其中$m$和$\beta_m$为经验设置的超参。
 %%%%%%%%%%%%%%%%%%
 \subsubsection{并行训练}
 \parinterval 机器翻译是自然语言处理中很``重''的任务。因为数据量巨大而且模型较为复杂，模型训练的时间往往很长。比如，使用一千万句的训练数据，性能优异的系统往往需要几天甚至一周的时间。更大规模的数据会导致训练时间更长。特别是使用多层网络同时增加模型容量时（比如增加隐层宽度）时，神经机器翻译的训练会更加缓慢。对于这个问题，一个思路是从模型训练算法上进行改进。比如前面提到的Adam就是一种高效的训练策略。另一种思路是利用多设备进行加速，也称作分布式训练。
@@ -1054,7 +1054,7 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Data-parallel-process}
+\input{./Chapter6/Figures/figure-data-parallel-process}
 \caption{数据并行过程}
 \label{fig:6-30}
 \end{figure}
@@ -1112,7 +1112,7 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Decoding-process-based-on-greedy-method}
+\input{./Chapter6/Figures/figure-decoding-process-based-on-greedy-method}
 \caption{基于贪婪方法的解码过程}
 \label{fig:6-32}
 \end{figure}
@@ -1124,7 +1124,7 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Decode-the-word-probability-distribution-at-the-first-position}
+\input{./Chapter6/Figures/figure-decode-the-word-probability-distribution-at-the-first-position}
 \caption{解码第一个位置输出的单词概率分布（``Have''的概率最高）}
 \label{fig:6-33}
 \end{figure}
@@ -1147,7 +1147,7 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Beam-search-process}
+\input{./Chapter6/Figures/figure-beam-search-process}
 \caption{束搜索过程}
 \label{fig:6-34}
 \end{figure}
@@ -1285,7 +1285,7 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Dependencies-between-words-in-a-recurrent-neural-network}
+\input{./Chapter6/Figures/figure-dependencies-between-words-in-a-recurrent-neural-network}
 \caption{循环神经网络中单词之间的依赖关系}
 \label{fig:6-36}
 \end{figure}
@@ -1297,7 +1297,7 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Dependencies-between-words-of-Attention}
+\input{./Chapter6/Figures/figure-dependencies-between-words-of-attention}
 \caption{自注意力机制中单词之间的依赖关系}
 \label{fig:6-37}
 \end{figure}
@@ -1309,7 +1309,7 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Example-of-self-attention-mechanism-calculation}
+\input{./Chapter6/Figures/figure-example-of-self-attention-mechanism-calculation}
 \caption{自注意力计算实例}
 \label{fig:6-38}
 \end{figure}
@@ -1377,13 +1377,13 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{位置编码}
-\parinterval 在使用循环神经网络进行序列的信息提取时，每个时刻的运算都要依赖前一个时刻的输出，具有一定的时序性，这也与语言具有顺序的特点相契合。而采用自注意力机制对源语言和目标语言序列进行处理时，直接对当前位置和序列中的任意位置进行建模，忽略了词之间的顺序关系，例如图\ref{fig:6-41}中两个语义不同的句子，通过自注意力得到的表示$\mathbf{C}$(``机票'')却是相同的。
+\parinterval 在使用循环神经网络进行序列的信息提取时，每个时刻的运算都要依赖前一个时刻的输出，具有一定的时序性，这也与语言具有顺序的特点相契合。而采用自注意力机制对源语言和目标语言序列进行处理时，直接对当前位置和序列中的任意位置进行建模，忽略了词之间的顺序关系，例如图\ref{fig:6-41}中两个语义不同的句子，通过自注意力得到的表示$\tilde{\mathbf{h}}$(``机票'')却是相同的。
 %----------------------------------------------
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Calculation-of-context-vector-C}
+\input{./Chapter6/Figures/figure-calculation-of-context-vector-C}
 \caption{上下文向量$\mathbf{C}$的计算}
 \label{fig:6-41}
 \end{figure}
@@ -1418,7 +1418,7 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-A-combination-of-position-encoding-and-word-encoding}
+\input{./Chapter6/Figures/figure-a-combination-of-position-encoding-and-word-encoding}
 \caption{位置编码与词编码的组合}
 \label{fig:6-43}
 \end{figure}
@@ -1448,7 +1448,7 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Position-of-self-attention-mechanism-in-the-model}
+\input{./Chapter6/Figures/figure-position-of-self-attention-mechanism-in-the-model}
 \caption{自注意力机制在模型中的位置}
 \label{fig:6-44}
 \end{figure}
@@ -1479,7 +1479,7 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Point-product-attention-model}
+\input{./Chapter6/Figures/figure-point-product-attention-model}
 \caption{点乘注意力力模型 }
 \label{fig:6-45}
 \end{figure}
@@ -1511,7 +1511,7 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Mask-instance-for-future-positions-in-Transformer}
+\input{./Chapter6/Figures/figure-mask-instance-for-future-positions-in-transformer}
 \caption{Transformer中对于未来位置进行的屏蔽的Mask实例}
 \label{fig:6-47}
 \end{figure}
@@ -1535,7 +1535,7 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Multi-Head-Attention-Model}
+\input{./Chapter6/Figures/figure-multi-head-attention-model}
 \caption{多头注意力模型}
 \label{fig:6-48}
 \end{figure}
@@ -1560,7 +1560,7 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Residual-network-structure}
+\input{./Chapter6/Figures/figure-residual-network-structure}
 \caption{残差网络结构}
 \label{fig:6-49}
 \end{figure}
@@ -1579,7 +1579,7 @@ x_{l+1} = x_l + \digamma (x_l)
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Position-of-difference-and-layer-regularization-in-the-model}
+\input{./Chapter6/Figures/figure-position-of-difference-and-layer-regularization-in-the-model}
 \caption{残差和层正则化在模型中的位置}
 \label{fig:6-50}
 \end{figure}
@@ -1594,13 +1594,13 @@ x_{l+1} = x_l + \digamma (x_l)
 \noindent 该公式使用均值$\mu$和方差$\sigma$对样本进行平移缩放，将数据规范化为均值为0，方差为1的标准分布。$g$和$b$是可学习的参数。
-\parinterval 在Transformer中经常使用的层正则化操作有两种结构，分别是{\small\bfnew{后正则化}}\index{后正则化}（Post-norm）\index{Post-norm}和{\small\bfnew{前正则化}}\index{前正则化}（Pre-norm）\index{Pre-norm}。后正则化中先进行残差连接再进行层正则化，而前正则化则是在子层输入之前进行层正则化操作。在很多实践中已经发现，前正则化的方式更有利于信息传递，因此适合训练深层的Transformer模型\cite{WangLearning}。
+\parinterval 在Transformer中经常使用的层正则化操作有两种结构，分别是{\small\bfnew{后正则化}}\index{后正则化}（Post-norm）\index{Post-norm}和{\small\bfnew{前正则化}}\index{前正则化}（Pre-norm）\index{Pre-norm}，结构如图\ref{fig:6-51}。后正则化中先进行残差连接再进行层正则化，而前正则化则是在子层输入之前进行层正则化操作。在很多实践中已经发现，前正则化的方式更有利于信息传递，因此适合训练深层的Transformer模型\cite{WangLearning}。
 %----------------------------------------------
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Different-regularization-methods}
+\input{./Chapter6/Figures/figure-different-regularization-methods}
 \caption{不同正则化方式 }
 \label{fig:6-51}
 \end{figure}
@@ -1613,7 +1613,7 @@ x_{l+1} = x_l + \digamma (x_l)
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Position-of-feedforward-neural-network-in-the-model}
+\input{./Chapter6/Figures/figure-position-of-feedforward-neural-network-in-the-model}
 \caption{前馈神经网络在模型中的位置}
 \label{fig:6-52}
 \end{figure}
@@ -1636,7 +1636,7 @@ x_{l+1} = x_l + \digamma (x_l)
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Structure-of-the-network-during-Transformer-training}
+\input{./Chapter6/Figures/figure-structure-of-the-network-during-transformer-training}
 \caption{Transformer训练时网络的结构}
 \label{fig:6-53}
 \end{figure}
@@ -1676,7 +1676,7 @@ lrate = d_{model}^{-0.5} \cdot \textrm{min} (step^{-0.5} , step \cdot warmup\_st
 % 图3.10
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Comparison-of-the-number-of-padding-in-batch}
+\input{./Chapter6/Figures/figure-comparison-of-the-number-of-padding-in-batch}
 \caption{batch中padding数量对比（白色部分为padding）}
 \label{fig:6-55}
 \end{figure}
@@ -1752,7 +1752,7 @@ Transformer Deep(48层) & 30.2            & 43.1            & 194$\times 10^{6}$
 % 图3.6.1
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Generate-summary}
+\input{./Chapter6/Figures/figure-generate-summary}
 \caption{文本自动摘要实例}
 \label{fig:6-57}
 \end{figure}
@@ -1764,7 +1764,7 @@ Transformer Deep(48层) & 30.2            & 43.1            & 194$\times 10^{6}$
 % 图3.6.1
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Example-of-automatic-translation-of-classical-Chinese}
+\input{./Chapter6/Figures/figure-example-of-automatic-translation-of-classical-chinese}
 \caption{文言文自动翻译实例}
 \label{fig:6-58}
 \end{figure}
@@ -1780,7 +1780,7 @@ Transformer Deep(48层) & 30.2            & 43.1            & 194$\times 10^{6}$
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Automatically-generate-instances-of-couplets}
+\input{./Chapter6/Figures/figure-automatically-generate-instances-of-couplets}
 \caption{对联自动生成实例（人工给定上联）}
 \label{fig:6-59}
 \end{figure}
@@ -1796,7 +1796,7 @@ Transformer Deep(48层) & 30.2            & 43.1            & 194$\times 10^{6}$
 \begin{figure}[htp]
 \centering
-\input{./Chapter6/Figures/figure-Automatic-generation-of-ancient-poems-based-on-encoder-decoder-framework}
+\input{./Chapter6/Figures/figure-automatic-generation-of-ancient-poems-based-on-encoder-decoder-framework}
 \caption{基于编码器-解码器框架的古诗自动生成}
 \label{fig:6-60}
 \end{figure}

--- a/Book/Chapter6/Figures/Big learning rate vs Small learning rate.png
+++ b/Book/Chapter6/Figures/Big learning rate vs Small learning rate.png
--- a/Book/Chapter6/Figures/figure-A-working-example-of-neural-machine-translation.tex
+++ b/Book/Chapter6/Figures/figure-A-working-example-of-neural-machine-translation.tex
@@ -4,12 +4,13 @@
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  \begin{tikzpicture}
        \setlength{\base}{0.9cm}
        \tikzstyle{rnnnode} = [rounded corners=1pt,minimum size=0.5\base,draw,inner sep=0pt,outer sep=0pt]
-        \tikzstyle{wordnode} = [font=\tiny]
+        \tikzstyle{wordnode} = [font=\scriptsize]
        % RNN translation model
        \begin{scope}[local bounding box=RNNMT]
@@ -23,8 +24,12 @@
 			\node[wordnode,anchor=east] (init2) at ([xshift=-3.0em]init.west){};
               {
                \node[rnnnode,fill=purple] (repr) at (enc4) {};
-                \node[wordnode] (label) at ([xshift=3.5em]enc4.east) {源语言句子表示};
+                \node[wordnode] (label) at ([yshift=2.5em]enc4.north) {
-                \draw[->,dashed,thick] (label.west) -- (enc4.east);
+                \begin{tabular}{c}
+                源语言句\\子表示
+                \end{tabular}
+                };
+                \draw[->,dashed,thick] (label.south) -- (enc4.north);
            }
            \node[wordnode,below=0pt of eemb1,font=\scriptsize] (encwordin1) {我};
@@ -37,7 +42,7 @@
            % RNN Decoder
            \foreach \x in {1,2,...,4}
-                 \node[rnnnode,minimum height=0.5\base,fill=green!30!white,anchor=south] (demb\x) at ([xshift=0.0em,yshift=3.0em]enc\x.north) {};
+                 \node[rnnnode,minimum height=0.5\base,fill=green!30!white,anchor=south] (demb\x) at ([xshift=9.0em,yshift=-3.5em]enc\x.north) {};
            \foreach \x in {1,2,...,4}
                \node[rnnnode,fill=blue!30!white,anchor=south] (dec\x) at ([yshift=0.5\base]demb\x.north) {};
            \foreach \x in {1,2,...,4}
@@ -86,7 +91,7 @@
                \draw[-latex'] (dec\x.east) to (dec\y.west);
            }
            \coordinate (bridge) at ([yshift=-1.15\base]demb2);
-          \draw[-latex'] (enc4.north) .. controls +(north:0.4\base) and +(east:0.5\base) .. (bridge) .. controls +(west:2.4\base) and +(west:0.5\base) .. (dec1.west);
+          \draw[-latex'] (enc4.east) -- (dec1.west);
        \end{scope}
    \end{tikzpicture}
@@ -124,3 +129,4 @@
--- a/Book/Chapter6/Figures/figure-Calculation-of-context-vector-C.tex
+++ b/Book/Chapter6/Figures/figure-Calculation-of-context-vector-C.tex
@@ -50,7 +50,7 @@
 \vspace{-1.0em}
 \footnotesize{
 \begin{eqnarray}
-\textbf{C}(\textrm{''机票''}) & = & 0.2 \times \textbf{h}(\textrm{``沈阳''}) + 0.3 \times \textbf{h}(\textrm{``到''}) + \nonumber \\
+\tilde{\mathbf{h}} (\textrm{''机票''}) & = & 0.2 \times \textbf{h}(\textrm{``沈阳''}) + 0.3 \times \textbf{h}(\textrm{``到''}) + \nonumber \\
             &   & 0.1 \times \textbf{h}(\textrm{``广州''}) + ... + 0.3 \times \textbf{h}(\textrm{``机票''}) \nonumber
 \end{eqnarray}
 }
\ No newline at end of file
--- a/Book/Chapter6/Figures/figure-Generate-summary.tex
+++ b/Book/Chapter6/Figures/figure-Generate-summary.tex
@@ -16,7 +16,7 @@ Jenson Button was denied his 100th race for McLaren after an ERS prevented him f
 };
 %译文1--------------mt1
 \node[font=\small] (mt1) at ([xshift=0em,yshift=-16.8em]original0.south) {系统生成\quad};
-\node[font=\small] (mt-2) at ([xshift=0em,yshift=-0.5em]mt1.south) {\quad 的摘要：};
+\node[font=\small] (mt-2) at ([xshift=0em,yshift=-0.5em]mt1.south) {的摘要：\quad };
 \node[font=\small] (ts1) at ([xshift=0em,yshift=-3em]original1.south)  {
 \begin{tabular}[t]{l}
 \parbox{32em}{

--- a/Book/Chapter6/Figures/figure-Multi-Head-Attention-Model.tex
+++ b/Book/Chapter6/Figures/figure-Multi-Head-Attention-Model.tex
@@ -4,28 +4,28 @@
 \begin{tikzpicture}
 \begin{scope}
-\node [anchor=west,draw=black!30,inner sep=4pt,fill=ugreen!20!white] (Linear0) at (0,0) {\tiny{Linear}};
+\node [anchor=west,draw=black!30,inner sep=4pt,fill=ugreen!20!white] (Linear0) at (0,0) {\footnotesize{Linear}};
-\node [anchor=south west,draw=black!50,fill=ugreen!20!white,draw,inner sep=4pt] (Linear01) at ([shift={(-0.2em,-0.2em)}]Linear0.south west) {\tiny{Linear}};
+\node [anchor=south west,draw=black!50,fill=ugreen!20!white,draw,inner sep=4pt] (Linear01) at ([shift={(-0.2em,-0.2em)}]Linear0.south west) {\footnotesize{Linear}};
-\node [anchor=south west,fill=ugreen!20!white,draw,inner sep=4pt] (Linear02) at ([shift={(-0.2em,-0.2em)}]Linear01.south west) {\tiny{Linear}};
+\node [anchor=south west,fill=ugreen!20!white,draw,inner sep=4pt] (Linear02) at ([shift={(-0.2em,-0.2em)}]Linear01.south west) {\footnotesize{Linear}};
 \node [anchor=north] (Q) at ([xshift=0em,yshift=-1em]Linear02.south) {\footnotesize{$\mathbf{Q}$}};
-\node [anchor=west,draw=black!30,inner sep=4pt,fill=ugreen!20!white] (Linear1) at ([xshift=1.5em]Linear0.east) {\tiny{Linear}};
+\node [anchor=west,draw=black!30,inner sep=4pt,fill=ugreen!20!white] (Linear1) at ([xshift=1.5em]Linear0.east) {\footnotesize{Linear}};
-\node [anchor=south west,draw=black!50,fill=ugreen!20!white,draw,inner sep=4pt] (Linear11) at ([shift={(-0.2em,-0.2em)}]Linear1.south west) {\tiny{Linear}};
+\node [anchor=south west,draw=black!50,fill=ugreen!20!white,draw,inner sep=4pt] (Linear11) at ([shift={(-0.2em,-0.2em)}]Linear1.south west) {\footnotesize{Linear}};
-\node [anchor=south west,fill=ugreen!20!white,draw,inner sep=4pt] (Linear12) at ([shift={(-0.2em,-0.2em)}]Linear11.south west) {\tiny{Linear}};
+\node [anchor=south west,fill=ugreen!20!white,draw,inner sep=4pt] (Linear12) at ([shift={(-0.2em,-0.2em)}]Linear11.south west) {\footnotesize{Linear}};
 \node [anchor=north] (K) at ([xshift=0em,yshift=-1em]Linear12.south) {\footnotesize{$\mathbf{K}$}};
-\node [anchor=west,draw=black!30,inner sep=4pt,fill=ugreen!20!white] (Linear2) at ([xshift=1.5em]Linear1.east) {\tiny{Linear}};
+\node [anchor=west,draw=black!30,inner sep=4pt,fill=ugreen!20!white] (Linear2) at ([xshift=1.5em]Linear1.east) {\footnotesize{Linear}};
-\node [anchor=south west,draw=black!50,fill=ugreen!20!white,draw,inner sep=4pt] (Linear21) at ([shift={(-0.2em,-0.2em)}]Linear2.south west) {\tiny{Linear}};
+\node [anchor=south west,draw=black!50,fill=ugreen!20!white,draw,inner sep=4pt] (Linear21) at ([shift={(-0.2em,-0.2em)}]Linear2.south west) {\footnotesize{Linear}};
-\node [anchor=south west,fill=ugreen!20!white,draw,inner sep=4pt] (Linear22) at ([shift={(-0.2em,-0.2em)}]Linear21.south west) {\tiny{Linear}};
+\node [anchor=south west,fill=ugreen!20!white,draw,inner sep=4pt] (Linear22) at ([shift={(-0.2em,-0.2em)}]Linear21.south west) {\footnotesize{Linear}};
 \node [anchor=north] (V) at ([xshift=0em,yshift=-1em]Linear22.south) {\footnotesize{$\mathbf{V}$}};
-\node [anchor=south,draw=black!30,minimum width=9em,inner sep=4pt,fill=blue!20!white] (Scale) at ([yshift=1em]Linear1.north) {\tiny{Scaled Dot-Product Attention}};
+\node [anchor=south,draw=black!30,minimum width=12em,minimum height=2em,inner sep=4pt,fill=blue!20!white] (Scale) at ([yshift=1em]Linear1.north) {\footnotesize{Scaled Dot-Product Attention}};
-\node [anchor=south west,draw=black!50,minimum width=9em,fill=blue!20!white,draw,inner sep=4pt] (Scale1) at ([shift={(-0.2em,-0.2em)}]Scale.south west) {\tiny{Scaled Dot-Product Attention}};
+\node [anchor=south west,draw=black!50,minimum width=12em,minimum height=2em,fill=blue!20!white,draw,inner sep=4pt] (Scale1) at ([shift={(-0.2em,-0.2em)}]Scale.south west) {\footnotesize{Scaled Dot-Product Attention}};
-\node [anchor=south west,fill=blue!20!white,draw,minimum width=9em,inner sep=4pt] (Scale2) at ([shift={(-0.2em,-0.2em)}]Scale1.south west) {\tiny{Scaled Dot-Product Attention}};
+\node [anchor=south west,fill=blue!20!white,draw,minimum width=12em,minimum height=2em,inner sep=4pt] (Scale2) at ([shift={(-0.2em,-0.2em)}]Scale1.south west) {\footnotesize{Scaled Dot-Product Attention}};
-\node [anchor=south,draw,minimum width=4em,inner sep=4pt,fill=yellow!30] (Concat) at ([yshift=1em]Scale2.north) {\tiny{Concat}};
+\node [anchor=south,draw,minimum width=4em,inner sep=4pt,fill=yellow!30] (Concat) at ([yshift=1em]Scale2.north) {\footnotesize{Concat}};
-\node [anchor=south,draw,minimum width=4em,inner sep=4pt,fill=ugreen!20!white] (Linear) at ([yshift=1em]Concat.north) {\tiny{Linear}};
+\node [anchor=south,draw,minimum width=4em,inner sep=4pt,fill=ugreen!20!white] (Linear) at ([yshift=1em]Concat.north) {\footnotesize{Linear}};
 \draw [->] ([yshift=0.1em]Q.north) -- ([yshift=-0.1em]Linear02.south);

--- a/Book/Chapter6/Figures/figure-Point-product-attention-model.tex
+++ b/Book/Chapter6/Figures/figure-Point-product-attention-model.tex
@@ -23,11 +23,11 @@
 \draw [->] ([yshift=0.1em]Scale3.north) -- ([yshift=-0.1em]Mask.south);
 \draw [->] ([yshift=0.1em]Mask.north) -- ([yshift=-0.1em]SoftMax.south);
 \draw [->] ([yshift=0.1em]SoftMax.north) -- ([yshift=0.9em]SoftMax.north);
-\draw [->] ([yshift=0.1em]V1.north) -- ([yshift=9.1em]V1.north);
+\draw [->] ([yshift=0.1em]V1.north) -- ([yshift=9.3em]V1.north);
 \draw [->] ([yshift=0.1em]MatMul1.north) -- ([yshift=0.8em]MatMul1.north);
 {
-\node [anchor=east] (line1) at ([xshift=-3em,yshift=1em]MatMul.west) {\scriptsize{自注意力机制的Query}};
+\node [anchor=east] (line1) at ([xshift=-4em,yshift=1em]MatMul.west) {\scriptsize{自注意力机制的Query}};
 \node [anchor=north west] (line2) at ([yshift=0.3em]line1.south west) {\scriptsize{Key和Value均来自同一句子}};
 \node [anchor=north west] (line3) at ([yshift=0.3em]line2.south west) {\scriptsize{编码-解码注意力机制}};
 \node [anchor=north west] (line4) at ([yshift=0.3em]line3.south west) {\scriptsize{与前面讲的一样}};
@@ -60,7 +60,7 @@
 {
 \node [rectangle,inner sep=0.2em,rounded corners=1pt,fill=green!10,drop shadow,draw=ugreen] [fit = (line1) (line2) (line3) (line4)] (box1) {};
 \node [rectangle,inner sep=0.1em,rounded corners=1pt,very thick,dotted,draw=ugreen] [fit = (Q1) (K1) (V1)] (box0) {};
-\draw [->,dotted,very thick,ugreen] ([yshift=-1.5em,xshift=0.8em]box1.east) -- ([yshift=-1.5em,xshift=0.1em]box1.east);
+\draw [->,dotted,very thick,ugreen] ([yshift=-1.5em,xshift=1.2em]box1.east) -- ([yshift=-1.5em,xshift=0.1em]box1.east);
 }
 {
 \node [rectangle,inner sep=0.2em,rounded corners=1pt,fill=blue!20!white,drop shadow,draw=blue] [fit = (line11) (line12) (line13)] (box2) {};
@@ -74,7 +74,7 @@
 {
 \node [rectangle,inner sep=0.2em,rounded corners=1pt,fill=red!10,drop shadow,draw=red] [fit = (line31) (line32) (line33) (line34)] (box4) {};
-\draw [->,dotted,very thick,red] ([yshift=-1.5em,xshift=1.5em]box4.east) -- ([yshift=-1.5em,xshift=0.1em]box4.east);
+\draw [->,dotted,very thick,red] ([yshift=-1.2em,xshift=2.2em]box4.east) -- ([yshift=-1.2em,xshift=0.1em]box4.east);
 }
 {

--- a/Book/Chapter6/Figures/figure-Query-model-corresponding-to-traditional-query-model-vs-attention-mechanism.tex
+++ b/Book/Chapter6/Figures/figure-Query-model-corresponding-to-traditional-query-model-vs-attention-mechanism.tex
@@ -28,46 +28,7 @@
 \draw [->] ([yshift=1pt]query.north) .. controls +(90:2em) and +(90:2em) .. ([yshift=1pt]key3.north) node [pos=0.5,below,yshift=0.2em] {\scriptsize{匹配}};
 \node [anchor=north] (result) at (value3.south) {\scriptsize{ {\red 返回结果} }};
-\node [anchor=north] (result2) at ([xshift=-2em,yshift=-2em]value2.south) {\footnotesize{ { (a)索引的查询过程} }};
 \end{scope}
 \end{tikzpicture}
-\begin{tikzpicture}
-\begin{scope}
-\tikzstyle{rnode} = [draw,minimum width=3em,minimum height=1.2em]
-\node [rnode,anchor=south west,fill=red!20!white] (value1) at (0,0) {\scriptsize{value$_1$}};
-\node [rnode,anchor=south west,fill=red!20!white] (value2) at ([xshift=1em]value1.south east) {\scriptsize{value$_2$}};
-\node [rnode,anchor=south west,fill=red!20!white] (value3) at ([xshift=1em]value2.south east) {\scriptsize{value$_3$}};
-\node [rnode,anchor=south west,fill=red!20!white] (value4) at ([xshift=1em]value3.south east) {\scriptsize{value$_4$}};
-\node [rnode,anchor=south west,pattern=north east lines] (key1) at ([yshift=0.2em]value1.north west) {};
-\node [rnode,anchor=south west,pattern=dots] (key2) at ([yshift=0.2em]value2.north west) {};
-\node [rnode,anchor=south west,pattern=horizontal lines] (key3) at ([yshift=0.2em]value3.north west) {};
-\node [rnode,anchor=south west,pattern=crosshatch dots] (key4) at ([yshift=0.2em]value4.north west) {};
-\node [fill=white,inner sep=1pt] (key1label) at (key1) {\scriptsize{key$_1$}};
-\node [fill=white,inner sep=1pt] (key1label) at (key2) {\scriptsize{key$_2$}};
-\node [fill=white,inner sep=1pt] (key1label) at (key3) {\scriptsize{key$_3$}};
-\node [fill=white,inner sep=1pt] (key1label) at (key4) {\scriptsize{key$_4$}};
-\node [rnode,anchor=east,pattern=vertical lines] (query) at ([xshift=-3em]key1.west) {};
-\node [anchor=east] (querylabel) at ([xshift=-0.2em]query.west) {\scriptsize{query}};
-\draw [->] ([yshift=1pt,xshift=6pt]query.north) .. controls +(90:1em) and +(90:1em) .. ([yshift=1pt]key1.north);
-\draw [->] ([yshift=1pt,xshift=3pt]query.north) .. controls +(90:1.5em) and +(90:1.5em) .. ([yshift=1pt]key2.north);
-\draw [->] ([yshift=1pt]query.north) .. controls +(90:2em) and +(90:2em) .. ([yshift=1pt]key3.north);
-\draw [->] ([yshift=1pt,xshift=-3pt]query.north) .. controls +(90:2.5em) and +(90:2.5em) .. ([yshift=1pt]key4.north);
-\node [anchor=south east] (alpha1) at (key1.north east) {\scriptsize{$\alpha_1$}};
-\node [anchor=south east] (alpha2) at (key2.north east) {\scriptsize{$\alpha_2$}};
-\node [anchor=south east] (alpha3) at (key3.north east) {\scriptsize{$\alpha_3$}};
-\node [anchor=south east] (alpha4) at (key4.north east) {\scriptsize{$\alpha_4$}};
-\node [anchor=north] (result) at ([xshift=-1.5em]value2.south east) {\scriptsize{{\red 返回结果}=$\alpha_1 \cdot \textrm{value}_1 + \alpha_2 \cdot \textrm{value}_2 + \alpha_3 \cdot \textrm{value}_3 + \alpha_4 \cdot \textrm{value}_4$}};
-\node [anchor=north] (result2) at ([xshift=-1em,yshift=-2.5em]value2.south) {\footnotesize{ { (b)注意力机制查询过程} }};
-\end{scope}
-\end{tikzpicture}
\ No newline at end of file
--- a/Book/Chapter6/Figures/figure-convergence&lr.tex
+++ b/Book/Chapter6/Figures/figure-convergence&lr.tex
@@ -2,14 +2,14 @@
 \begin{tikzpicture}
 \begin{axis}[
  name=s1,  
-  width=7cm, height=4cm, 
+  width=7cm, height=4.5cm, 
  xtick={-4,-3,-2,-1,0,1,2,3,4},
  ytick={0,1,...,4},
  xticklabel style={opacity=0},
  yticklabel style={opacity=0},
  xlabel={$w$},
  ylabel={$L(w)$},
-  axis line style={->},
+  axis line style={->,very thick},
  xlabel style={xshift=2.2cm,yshift=1.2cm},
  ylabel style={rotate=-90,xshift=1.5cm,yshift=1.6cm},
  tick align=inside,
@@ -19,7 +19,7 @@
  xmin=-4,
  xmax=4,
  ymin=0,
-  ymax=4]
+  ymax=4.5]
 \addplot [dashed,ublue,thick] {x^2/4};
 \addplot [quiver={u=1,v=x/2,scale arrows = 0.25},domain=-4:-0.3,->,samples=10,red!60,ultra thick] {x^2/4};
 \addplot [draw=ublue,fill=red,mark=*] coordinates{(0,0)};
@@ -29,14 +29,14 @@
  anchor=south, 
  xshift=6cm,
  yshift=0cm,
-  width=7cm, height=4cm, 
+  width=7cm, height=4.5cm, 
  xtick={-4,-3,-2,-1,0,1,2,3,4},
  ytick={0,1,...,4},
  xticklabel style={opacity=0},
  yticklabel style={opacity=0},
  xlabel={$w$},
  ylabel={$L(w)$},
-  axis line style={->},
+  axis line style={->,very thick},
  xlabel style={xshift=2.2cm,yshift=1.2cm},
  ylabel style={rotate=-90,xshift=1.5cm,yshift=1.6cm},
  tick align=inside,
@@ -46,7 +46,7 @@
  xmin=-4,
  xmax=4,
  ymin=0,
-  ymax=4]
+  ymax=4.5]
 \addplot [dashed,ublue,thick] {x^2/4};
 \addplot [quiver={u=-x-(x/abs(x))*(1+x^2-4)^(1/2),v=-0.7},domain=-4:3.6,->,samples=2,red!60,ultra thick] {x^2/4};
 \addplot [quiver={u=-x-(x/abs(x))*(1+x^2-4)^(1/2),v=-0.7},domain=-3.13:2.6,->,samples=2,red!60,ultra thick] {x^2/4};

--- a/Book/Chapter6/Figures/figure-transformer.tex
+++ b/Book/Chapter6/Figures/figure-transformer.tex
@@ -62,5 +62,8 @@
 \node [rectangle,inner sep=0.7em,rounded corners=1pt,very thick,dotted,draw=ugreen!70] [fit = (sa1) (res1) (ffn1) (res2)] (box0) {};
 \node [rectangle,inner sep=0.7em,rounded corners=1pt,very thick,dotted,draw=red!60] [fit = (sa2) (res3) (res5)] (box1) {};
+\node [ugreen] (count) at ([xshift=-1.7em,yshift=-1em]encoder.south) {$6\times$};
+\node [red] (count) at ([xshift=11em,yshift=0em]decoder.south) {$\times 6$};
 \end{scope}
 \end{tikzpicture}
\ No newline at end of file
--- a/Book/Chapter7/Chapter7.tex
+++ b/Book/Chapter7/Chapter7.tex
@@ -90,7 +90,7 @@
 %----------------------------------------------
 \begin{figure}[htp]
 \centering
-\input{./Chapter7/Figures/figure-construction-steps-of-MT-system}
+\input{./Chapter7/Figures/figure-construction-steps-of-mt-system}
 \caption{构建神经机器翻译系统的主要步骤}
 \label{fig:7-2}
 \end{figure}
@@ -364,7 +364,7 @@
 \begin{figure}[htp]
 \centering
 \input{./Chapter7/Figures/figure-unk-of-bpe}
-\caption{BPE中<UNK>的生成}
+\caption{BPE中的子词切分过程}
 \label{fig:7-10}
 \end{figure}
 %----------------------------------------------
@@ -417,7 +417,7 @@ y = f(x)
 % 图7.
 \begin{figure}[htp]
 \centering
-\input{./Chapter7/Figures/figure-Underfitting-vs-Overfitting}
+\input{./Chapter7/Figures/figure-underfitting-vs-overfitting}
 \caption{欠拟合 vs 过拟合}
 \label{fig:7-11}
 \end{figure}
@@ -1155,7 +1155,7 @@ b &=& \omega_{\textrm{high}}\cdot |\mathbf{x}|
 \parinterval 有了lattice这样的结构，多模型融合又有了新的思路。首先，可以将多个模型的译文融合为lattice。注意，这个lattice会包含这些模型无法生成的完整译文句子。之后，用一个更强的模型在lattice上搜索最优的结果。这个过程有可能找到一些``新''的译文，即结果可能是从多个模型的结果中重组而来的。lattice上的搜索模型可以基于多模型的融合，也可以使用一个简单的模型，这里需要考虑的是将神经机器翻译模型适应到lattice上进行推断\cite{DBLP:conf/aaai/SuTXJSL17}。其过程基本与原始的模型推断没有区别，只是需要把模型预测的结果附着到lattice中的每条边上，再进行推断。
-\parinterval 图\ref{fig:7-27}对比了不同模型集成方法的区别。从系统开发的角度看，假设选择和模型预测融合的复杂度较低，适合快速原型，而且性能稳定。译文重组需要更多的模块，系统调试的复杂度较高，但是由于看到了更大的搜索空间，因此系统性能提升的潜力较大\footnote{一般来说lattice上的Oracle要比$n$-best译文上的oracle的质量高。}。
+\parinterval 图\ref{fig:7-27}对比了不同模型集成方法的区别。从系统开发的角度看，假设选择和模型预测融合的复杂度较低，适合快速原型，而且性能稳定。译文重组需要更多的模块，系统调试的复杂度较高，但是由于看到了更大的搜索空间，因此系统性能提升的潜力较大\footnote{一般来说lattice上的Oracle要比$n$-best译文上的Oracle的质量高。}。
 %----------------------------------------------
 % 图7.
@@ -1191,7 +1191,7 @@ b &=& \omega_{\textrm{high}}\cdot |\mathbf{x}|
 % 图7.5.1
 \begin{figure}[htp]
 \centering
-\input{./Chapter7/Figures/Post-Norm-vs-Pre-Norm}
+\input{./Chapter7/Figures/figure-post-norm-vs-pre-norm}
 \caption{Post-Norm Transformer vs Pre-Norm Transformer}
 \label{fig:7-28}
 \end{figure}
@@ -1261,7 +1261,7 @@ z_{l}=\textrm{LN}(x_{l+1})
 \end{eqnarray}
 注意，$z_0$表示词嵌入层的输出，$z_l(l>0)$表示Transformer网络中最终的各层输出。
 \vspace{0.5em}
-\item 	定义一个维度为$(L+1)\times(L+1)$的权值矩阵$\mathbf{W}$，矩阵中每一行表示之前各子层对当前层计算的贡献度，其中$L$是编码端（或解码端）的层数。令$\mathbf{W}_{l,i}$代表权值矩阵$\mathbf{W}$第$l$行第$i$列的权重，则层聚合的输出为$z_i$的线性加权和：
+\item 	定义一个维度为$(L+1)\times(L+1)$的权值矩阵$\mathbf{W}$，矩阵中每一行表示之前各层对当前层计算的贡献度，其中$L$是编码端（或解码端）的层数。令$\mathbf{W}_{l,i}$代表权值矩阵$\mathbf{W}$第$l$行第$i$列的权重，则层聚合的输出为$z_i$的线性加权和：
 \begin{eqnarray}
 g_l=\sum_{i=0}^{l}z_i\times \mathbf{W}_{l,i}
 \label{eq:7-21}
@@ -1273,7 +1273,7 @@ $g_l$会作为输入的一部分送入第$l+1$层。其网络的结构图\ref{fi
 % 图7.5.2
 \begin{figure}[htp]
 \centering
-\input{./Chapter7/Figures/dynamic-linear-aggregation-network-structure}
+\input{./Chapter7/Figures/figure-dynamic-linear-aggregation-network-structure}
 \caption{动态线性层聚合网络结构图}
 \label{fig:7-29}
 \end{figure}
@@ -1299,7 +1299,7 @@ $g_l$会作为输入的一部分送入第$l+1$层。其网络的结构图\ref{fi
 % 图7.5.3
 \begin{figure}[htp]
 \centering
-\input{./Chapter7/Figures/progressive-training}
+\input{./Chapter7/Figures/figure-progressive-training}
 \caption{渐进式深层网络训练过程}
 \label{fig:7-30}
 \end{figure}
@@ -1316,7 +1316,7 @@ $g_l$会作为输入的一部分送入第$l+1$层。其网络的结构图\ref{fi
 % 图7.5.4
 \begin{figure}[htp]
 \centering
-\input{./Chapter7/Figures/sparse-connections-between-different-groups}
+\input{./Chapter7/Figures/figure-sparse-connections-between-different-groups}
 \caption{不同组之间的稀疏连接}
 \label{fig:7-31}
 \end{figure}
@@ -1335,7 +1335,7 @@ $g_l$会作为输入的一部分送入第$l+1$层。其网络的结构图\ref{fi
 % 图7.5.5
 \begin{figure}[htp]
 \centering
-\input{./Chapter7/Figures/learning-rate}
+\input{./Chapter7/Figures/figure-learning-rate}
 \caption{学习率重置vs从头训练的学习率曲线}
 \label{fig:7-32}
 \end{figure}
@@ -1411,7 +1411,7 @@ p_l=\frac{l}{2L}\cdot \varphi
 % 图7.5.7
 \begin{figure}[htp]
 \centering
-\input{./Chapter7/Figures/expanded-residual-network}
+\input{./Chapter7/Figures/figure-expanded-residual-network}
 \caption{Layer Dropout中残差网络的展开图}
 \label{fig:7-34}
 \end{figure}
@@ -1633,7 +1633,7 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\textbf{y}} | \textbf{x})
 \begin{figure}[htp]
 \centering
 \input{./Chapter7/Figures/figure-ensemble-knowledge-distillation}
-\caption{Ensemble知识精炼}
+\caption{迭代式知识精炼}
 \label{fig:7-41}
 \end{figure}
 %-------------------------------------------

--- a/Book/Chapter7/Figures/figure-application-process-of-back-translation.tex
+++ b/Book/Chapter7/Figures/figure-application-process-of-back-translation.tex
 \begin{tikzpicture}
 \begin{scope}
-\node [anchor=center] (node1) at (-2.9,1) {\small{训练：}};
+\node [anchor=center] (node1) at (4.9,1) {\small{训练：}};
-\node [anchor=center] (node11) at (-2.5,1) {};
+\node [anchor=center] (node11) at (5.5,1) {};
-\node [anchor=center] (node12) at (-1.7,1) {};
+\node [anchor=center] (node12) at (6.7,1) {};
-\node [anchor=center] (node2) at (-2.9,0.5) {\small{推理：}};
+\node [anchor=center] (node2) at (4.9,0.5) {\small{推理：}};
-\node [anchor=center] (node21) at (-2.5,0.5) {};
+\node [anchor=center] (node21) at (5.5,0.5) {};
-\node [anchor=center] (node22) at (-1.7,0.5) {};
+\node [anchor=center] (node22) at (6.7,0.5) {};
 \node [anchor=west,line width=0.6pt,draw=black,minimum width=5.6em,minimum height=2.2em,fill=blue!20,rounded corners=2pt] (node1-1) at (0,0) {\footnotesize{双语数据}};
 \node [anchor=south,line width=0.6pt,draw=black,minimum width=4.5em,minimum height=2.2em,fill=blue!20,rounded corners=2pt] (node1-2) at ([yshift=-5em]node1-1.south) {\footnotesize{目标语伪数据}};
 \node [anchor=west,line width=0.6pt,draw=black,minimum width=4.5em,minimum height=2.2em,fill=red!20,rounded corners=2pt] (node2-1) at ([xshift=-8.8em,yshift=-2.5em]node1-1.west) {\footnotesize{反向NMT系统}};

--- a/Book/Chapter7/Figures/figure-different-softmax.tex
+++ b/Book/Chapter7/Figures/figure-different-softmax.tex
 \begin{tikzpicture}
-\tikzstyle{layer} = [rectangle,draw,rounded corners=3pt,minimum width=1cm,minimum height=0.5cm];
+\tikzstyle{layer} = [rectangle,draw,rounded corners=3pt,minimum width=1cm,minimum height=0.5cm,line width=1pt];
 \tikzstyle{prob} = [minimum width=0.3cm,rectangle,fill=ugreen!20!white,inner sep=0pt];
 \begin{scope}[local bounding box=STANDARD]
@@ -22,8 +22,8 @@
    \path [fill=blue!20!white,draw=white] (out1.north west) -- (prob1.south west) -- (prob9.south east) -- (out1.north east) -- (out1.north west);
-    \draw [->] (input1) to (net1);
+    \draw [->,line width=1pt] (input1) to (net1);
-    \draw [->] (net1) to (out1);
+    \draw [->,line width=1pt] (net1) to (out1);
    \node [font=\small] (label1) at ([yshift=0.6cm]out1.north) {Softmax};
 \end{scope}
@@ -51,8 +51,8 @@
    \path [fill=blue!20!white,draw=white] (out2.north west) -- (prob1.south west) -- (prob9.south east) -- (out2.north east) -- (out2.north west);
-    \draw [->] (input2) to (net2);
+    \draw [->,line width=1pt] (input2) to (net2);
-    \draw [->] (net2) to (out2);
+    \draw [->,line width=1pt] (net2) to (out2);
    \node [font=\small] (label2) at ([yshift=0.6cm]out2.north) {Softmax};
@@ -60,9 +60,9 @@
    \node [anchor=north,font=\scriptsize] (input3) at ([yshift=-0.5cm]net3.south) {源语};
    \node [anchor=south,layer,align=center,font=\scriptsize,fill=yellow!10!white] (out3) at ([yshift=0.9cm]net3.north) {Candidate\\List};
-    \draw [->] (input3) to (net3);
+    \draw [->,line width=1pt] (input3) to (net3);
-    \draw [->] (net3) to (out3);
+    \draw [->,line width=1pt] (net3) to (out3);
-    \draw [->] (out3) |- (plabel9.east);
+    \draw [->,line width=1pt] (out3) |- (plabel9.east);
 \end{scope}
 \node [anchor=north,font=\scriptsize] () at ([yshift=-0.2em]STANDARD.south) {(a) 标准方法};

--- a/Book/Chapter7/Figures/dynamic-linear-aggregation-network-structure.tex
+++ b/Book/Chapter7/Figures/dynamic-linear-aggregation-network-structure.tex
--- a/Book/Chapter7/Figures/figure-encoder-fin.tex
+++ b/Book/Chapter7/Figures/figure-encoder-fin.tex
@@ -50,8 +50,8 @@
 \node [rectangle,inner sep=1em,fill=black!5,rounded corners=4pt] [fit =(w4) (w6) (w9) (encoder0) ] (box) {};
 \end{pgfonlayer}
-\node [] (left) at ([yshift=-1.5em]box.south) {编码器使用单语数据预训练};
+\node [font=\footnotesize] (left) at ([yshift=-1.5em]box.south) {编码器使用单语数据预训练};
-\node [] (right) at ([xshift=11em]left.east) {在翻译任务上进行微调};
+\node [font=\footnotesize] (right) at ([xshift=11em]left.east) {在翻译任务上进行微调};
 \node[anchor=north] (arrow1) at (3.85,0.1){};

--- a/Book/Chapter7/Figures/figure-example-of-iterative-back-translation.tex
+++ b/Book/Chapter7/Figures/figure-example-of-iterative-back-translation.tex
 \begin{tikzpicture}
 \begin{scope}
-\node [anchor=center] (node1) at (-2.6,1) {\small{训练：}};
+\node [anchor=center] (node1) at (9.6,1) {\small{训练：}};
-\node [anchor=center] (node11) at (-2.2,1) {};
+\node [anchor=center] (node11) at (10.2,1) {};
-\node [anchor=center] (node12) at (-1.1,1) {};
+\node [anchor=center] (node12) at (11.4,1) {};
-\node [anchor=center] (node2) at (-2.6,0.5) {\small{推理：}};
+\node [anchor=center] (node2) at (9.6,0.5) {\small{推理：}};
-\node [anchor=center] (node21) at (-2.2,0.5) {};
+\node [anchor=center] (node21) at (10.2,0.5) {};
-\node [anchor=center] (node22) at (-1.1,0.5) {};
+\node [anchor=center] (node22) at (11.4,0.5) {};
 \node [anchor=west,draw=black,line width=0.6pt,minimum width=5.6em,minimum height=2.2em,fill=blue!20,rounded corners=2pt] (node1-1) at (0,0) {\footnotesize{双语数据}};
 \node [anchor=south,draw=black,line width=0.6pt,minimum width=4.5em,minimum height=2.2em,fill=blue!20,rounded corners=2pt] (node1-2) at ([yshift=-5em]node1-1.south) {\footnotesize{目标语伪数据}};
 \node [anchor=west,draw=black,line width=0.6pt,minimum width=4.5em,minimum height=2.2em,fill=red!20,rounded corners=2pt] (node2-1) at ([xshift=-7.7em,yshift=-2.5em]node1-1.west) {\footnotesize{前向NMT系统}};

--- a/Book/Chapter7/Figures/expanded-residual-network.tex
+++ b/Book/Chapter7/Figures/expanded-residual-network.tex
--- a/Book/Chapter7/Figures/learning-rate.tex
+++ b/Book/Chapter7/Figures/learning-rate.tex
--- a/Book/Chapter7/Figures/Post-Norm-vs-Pre-Norm.tex
+++ b/Book/Chapter7/Figures/Post-Norm-vs-Pre-Norm.tex
--- a/Book/Chapter7/Figures/progressive-training.tex
+++ b/Book/Chapter7/Figures/progressive-training.tex
--- a/Book/Chapter7/Figures/sparse-connections-between-different-groups.tex
+++ b/Book/Chapter7/Figures/sparse-connections-between-different-groups.tex
--- a/Book/bibliography.bib
+++ b/Book/bibliography.bib
@@ -685,6 +685,12 @@
  year ={2016},
  publisher ={清华大学出版社}
 }
+@book{李航2012统计学习方法,
+  title ={统计学习方法},
+  author ={李航},
+  year ={2012},
+  publisher ={清华大学出版社}
+}
 @book{宗成庆2013统计自然语言处理,
  title ={统计自然语言处理},
  author ={宗成庆},

--- a/Book/mt-book-xelatex.tex
+++ b/Book/mt-book-xelatex.tex
@@ -122,13 +122,13 @@
 %	CHAPTERS
 %----------------------------------------------------------------------------------------
-\include{Chapter1/chapter1}
+%\include{Chapter1/chapter1}
 %\include{Chapter2/chapter2}
 %\include{Chapter3/chapter3}
 %\include{Chapter4/chapter4}
 %\include{Chapter5/chapter5}
 %\include{Chapter6/chapter6}
-%\include{Chapter7/chapter7}
+\include{Chapter7/chapter7}
 %\include{ChapterAppend/chapterappend}