合并分支 'caorunzhe' 到 'mengxia'

Caorunzhe 查看合并请求 !216

合并分支 'caorunzhe' 到 'mengxia'
Caorunzhe 查看合并请求 !216
0415f784 · 孟霞 · 739a7e94 · 85d22968 · 0415f784 · 0415f784
Commit 0415f784 authored Sep 17, 2020 by 孟霞
--- a/Chapter1/Figures/example-of-source-structure.tex
+++ b/Chapter1/Figures/example-of-source-structure.tex
@@ -2,7 +2,7 @@
 %%%  句法树(层次短语)
 \begin{tikzpicture}
 {\small
-\begin{scope}[sibling distance=15pt, level distance = 20pt]
+\begin{scope}[sibling distance=25pt, level distance = 20pt]
 {\scriptsize
 \Tree[.\node(r){IP};
        [.\node(n11){NP}; [.\node(n21){PN};  [.\node(l1){她};]]]

--- a/Chapter1/chapter1.tex
+++ b/Chapter1/chapter1.tex
@@ -504,15 +504,15 @@
 %----------------------------------------------------------------------------------------
 \subsection{经典书籍}
-\parinterval 首先，推荐一本书$Statistical\ Machine\ Translation$\upcite{koehn2009statistical}，其作者是机器翻译领域著名学者Philipp Koehn教授。该书是机器翻译领域内的经典之作，介绍了统计机器翻译技术的进展。该书从语言学和概率学两个方面介绍了统计机器翻译的构成要素，然后介绍了统计机器翻译的主要模型：基于词、基于短语和基于树的模型，以及机器翻译评价、语言建模、判别式训练等方法。此外，作者在该书的最新版本中增加了神经机器翻译的章节，方便研究人员全面了解机器翻译的最新发展趋势\upcite{DBLP:journals/corr/abs-1709-07809}。
+\parinterval 首先，推荐一本书\emph{Statistical Machine Translation}\upcite{koehn2009statistical}，其作者是机器翻译领域著名学者Philipp Koehn教授。该书是机器翻译领域内的经典之作，介绍了统计机器翻译技术的进展。该书从语言学和概率学两个方面介绍了统计机器翻译的构成要素，然后介绍了统计机器翻译的主要模型：基于词、基于短语和基于树的模型，以及机器翻译评价、语言建模、判别式训练等方法。此外，作者在该书的最新版本中增加了神经机器翻译的章节，方便研究人员全面了解机器翻译的最新发展趋势\upcite{DBLP:journals/corr/abs-1709-07809}。
-\parinterval $Foundations\ of\ Statistical\ Natural\ Language\ Processing$\upcite{manning1999foundations}中文译名《统计自然语言处理基础》，作者是自然语言处理领域的权威Chris Manning教授和Hinrich Sch$\ddot{\textrm{u}}$tze教授。该书对统计自然语言处理方法进行了全面介绍。书中讲解了统计自然语言处理所需的语言学和概率论基础知识，介绍了机器翻译评价、语言建模、判别式训练以及整合语言学信息等基础方法。其中也包含了构建自然语言处理工具所需的基本理论和算法，并且涵盖了数学和语言学基础内容以及相关的统计方法。
+\parinterval \emph{Foundations of Statistical Natural Language Processing}\upcite{manning1999foundations}中文译名《统计自然语言处理基础》，作者是自然语言处理领域的权威Chris Manning教授和Hinrich Sch$\ddot{\textrm{u}}$tze教授。该书对统计自然语言处理方法进行了全面介绍。书中讲解了统计自然语言处理所需的语言学和概率论基础知识，介绍了机器翻译评价、语言建模、判别式训练以及整合语言学信息等基础方法。其中也包含了构建自然语言处理工具所需的基本理论和算法，并且涵盖了数学和语言学基础内容以及相关的统计方法。
 \parinterval 《统计自然语言处理（第2版）》\upcite{宗成庆2013统计自然语言处理}由中国科学院自动化所宗成庆教授所著。该书中系统介绍了统计自然语言处理的基本概念、理论方法和最新研究进展，既有对基础知识和理论模型的介绍，也有对相关问题的研究背景、实现方法和技术现状的详细阐述。可供从事自然语言处理、机器翻译等研究的相关人员参考。
 \parinterval  由Ian Goodfellow、Yoshua Bengio、Aaron Courville三位机器学习领域的学者所写的\emph{Deep Learning}\upcite{Goodfellow-et-al-2016}也是值得一读的参考书。其讲解了有关深度学习常用的方法，其中很多都会在深度学习模型设计和使用中用到。同时在该书的应用一章中也简单讲解了神经机器翻译的任务定义和发展过程。
-\parinterval $Neural\ Network\ Methods\ for\ Natural\ Language\ Processing$\upcite{goldberg2017neural}是Yoav Goldberg编写的面向自然语言处理的深度学习参考书。相比\emph{Deep Learning}，该书聚焦在自然语言处理中的深度学习方法，内容更加易读，非常适合刚入门自然语言处理及深度学习应用的人员参考。
+\parinterval \emph{Neural Network Methods for Natural Language Processing}\upcite{goldberg2017neural}是Yoav Goldberg编写的面向自然语言处理的深度学习参考书。相比\emph{Deep Learning}，该书聚焦在自然语言处理中的深度学习方法，内容更加易读，非常适合刚入门自然语言处理及深度学习应用的人员参考。
 \parinterval 《机器学习》\upcite{周志华2016机器学习}由南京大学周志华教授所著，作为机器学习领域入门教材，该书尽可能地涵盖了机器学习基础知识的各个方面，试图尽可能少地使用数学知识介绍机器学习方法与思想。

--- a/Chapter10/Figures/figure-3-base-problom-of-p.tex
+++ b/Chapter10/Figures/figure-3-base-problom-of-p.tex
@@ -15,9 +15,9 @@
 					\node[rnnnode,minimum height=0.5\base,fill=green!30!white,anchor=west] (eemb\x) at ([xshift=0.4\base]eemb\y.east) {\tiny{$e_x()$}};
 				\foreach \x in {1,2,...,3}
 					\node[rnnnode,fill=blue!30!white,anchor=south] (enc\x) at ([yshift=0.3\base]eemb\x.north) {};
-			        \node[] (enclabel1) at (enc1) {\tiny{$\vectorn{h}_{m-2}$}};
+			        \node[] (enclabel1) at (enc1) {\tiny{$\vectorn{\emph{h}}_{m-2}$}};
-			        \node[] (enclabel2) at (enc2) {\tiny{$\vectorn{h}_{m-1}$}};
+			        \node[] (enclabel2) at (enc2) {\tiny{$\vectorn{\emph{h}}_{m-1}$}};
-			        \node[rnnnode,fill=purple!30!white] (enclabel3) at (enc3) {\tiny{$\vectorn{h}_{m}$}};
+			        \node[rnnnode,fill=purple!30!white] (enclabel3) at (enc3) {\tiny{$\vectorn{\emph{h}}_{m}$}};
 				\node[wordnode,left=0.4\base of enc1] (init1) {$\cdots$};
 				\node[wordnode,left=0.4\base of eemb1] (init2) {$\cdots$};
@@ -29,7 +29,7 @@
 				\foreach \x in {1,2,...,3}
 					\node[rnnnode,minimum height=0.5\base,fill=green!30!white,anchor=south] (demb\x) at ([yshift=\base]enc\x.north) {\tiny{$e_y()$}};
 				\foreach \x in {1,2,...,3}
-					\node[rnnnode,fill=blue!30!white,anchor=south] (dec\x) at ([yshift=0.3\base]demb\x.north) {{\tiny{$\vectorn{s}_\x$}}};
+					\node[rnnnode,fill=blue!30!white,anchor=south] (dec\x) at ([yshift=0.3\base]demb\x.north) {{\tiny{$\vectorn{\emph{s}}_\x$}}};
 				\foreach \x in {1,2,...,3}
 					\node[rnnnode,minimum height=0.5\base,fill=red!30!white,anchor=south] (softmax\x) at ([yshift=0.3\base]dec\x.north) {\tiny{Softmax}};
 				\node[wordnode,right=0.4\base of demb3] (end1) {$\cdots$};
@@ -73,10 +73,10 @@
 				\draw[-latex'] (enc3.north) .. controls +(north:0.3\base) and +(east:\base) .. (bridge) .. controls +(west:2.7\base) and +(west:0.3\base) .. (dec1.west);
 				{
-				\node [anchor=east] (line1) at ([xshift=-3em,yshift=0.5em]softmax1.west) {\scriptsize{基于RNN的隐层状态$\vectorn{s}_i$}};
+				\node [anchor=east] (line1) at ([xshift=-3em,yshift=0.5em]softmax1.west) {\scriptsize{基于RNN的隐层状态$\vectorn{\emph{s}}_i$}};
 				\node [anchor=north west] (line2) at ([yshift=0.3em]line1.south west) {\scriptsize{预测目标词的概率}};
 				\node [anchor=north west] (line3) at ([yshift=0.3em]line2.south west) {\scriptsize{通常，用Softmax函数}};
-				\node [anchor=north west] (line4) at ([yshift=0.3em]line3.south west) {\scriptsize{实现 $\textrm{P}(y_i|...)$}};
+				\node [anchor=north west] (line4) at ([yshift=0.3em]line3.south west) {\scriptsize{实现 $\funp{P}(y_i|...)$}};
 				}
 				{
@@ -90,7 +90,7 @@
 				\node [anchor=west] (line21) at ([xshift=1.3em,yshift=1.5em]enc3.east)  {\scriptsize{源语编码器最后一个}};
 				\node [anchor=north west] (line22) at ([yshift=0.3em]line21.south west) {\scriptsize{循环单元的输出被}};
 				\node [anchor=north west] (line23) at ([yshift=0.3em]line22.south west) {\scriptsize{看作是句子的表示,}};
-				\node [anchor=north west] (line24) at ([yshift=0.3em]line23.south west) {\scriptsize{记为$\vectorn{C}$}};
+				\node [anchor=north west] (line24) at ([yshift=0.3em]line23.south west) {\scriptsize{记为$\vectorn{\emph{C}}$}};
 				}
 				\begin{pgfonlayer}{background}

--- a/Chapter10/Figures/figure-beam-search-process.tex
+++ b/Chapter10/Figures/figure-beam-search-process.tex
@@ -20,13 +20,13 @@
 \node [anchor=west,inner sep=2pt] (t4) at ([xshift=0.3em]t3.east) {\scriptsize{...}};
 }
 {
-\node [rnnnode,anchor=south] (s1) at ([yshift=1em]t1.north) {\scriptsize{$\textbf{s}_1$}};
+\node [rnnnode,anchor=south] (s1) at ([yshift=1em]t1.north) {\scriptsize{$\vectorn{\emph{s}}_1$}};
 }
 {
-\node [rnnnode,anchor=south] (s2) at ([yshift=1em]t2.north) {\scriptsize{$\textbf{s}_2$ ($\times 3$)}};
+\node [rnnnode,anchor=south] (s2) at ([yshift=1em]t2.north) {\scriptsize{$\vectorn{\emph{s}}_2$ ($\times 3$)}};
 }
 {
-\node [rnnnode,anchor=south] (s3) at ([yshift=1em]t3.north) {\scriptsize{$\textbf{s}_3$ ($\times 3$)}};
+\node [rnnnode,anchor=south] (s3) at ([yshift=1em]t3.north) {\scriptsize{$\vectorn{\emph{s}}_3$ ($\times 3$)}};
 \node [anchor=west,inner sep=2pt] (s4) at ([xshift=0.3em]s3.east) {\scriptsize{...}};
 }
 {
@@ -121,17 +121,17 @@
 }
 {
-\node [circle,draw,anchor=north,inner sep=2pt,fill=orange!20,text=orange!20] (c2) at ([yshift=-2.5em]t1.south) {\scriptsize{$\textbf{C}_2$}};
+\node [circle,draw,anchor=north,inner sep=2pt,fill=orange!20,text=orange!20] (c2) at ([yshift=-2.5em]t1.south) {\scriptsize{$\vectorn{\emph{C}}_2$}};
-\node [circle,draw,inner sep=2pt,fill=orange!20] (c2copy1) at ([yshift=-0.1em,xshift=-0.1em]c2) {\scriptsize{$\textbf{C}_2$}};
+\node [circle,draw,inner sep=2pt,fill=orange!20] (c2copy1) at ([yshift=-0.1em,xshift=-0.1em]c2) {\scriptsize{$\vectorn{\emph{C}}_2$}};
-\node [circle,draw,inner sep=2pt,fill=orange!20] (c2copy2) at ([yshift=-0.2em,xshift=-0.2em]c2) {\scriptsize{$\textbf{C}_2$}};
+\node [circle,draw,inner sep=2pt,fill=orange!20] (c2copy2) at ([yshift=-0.2em,xshift=-0.2em]c2) {\scriptsize{$\vectorn{\emph{C}}_2$}};
 \draw [->] ([xshift=-0.9em]c2.west) -- ([xshift=-0.3em]c2.west);
 \draw [->] ([xshift=0.1em]c2.east) .. controls +(east:1.5) and +(west:0.8) ..([yshift=-0.3em,xshift=-0.1em]s2.west);
 }
 {
-\node [circle,draw,anchor=north,inner sep=2pt,fill=orange!20,text=orange!20] (c3) at ([yshift=-2.5em]t2.south) {\scriptsize{$\textbf{C}_3$}};
+\node [circle,draw,anchor=north,inner sep=2pt,fill=orange!20,text=orange!20] (c3) at ([yshift=-2.5em]t2.south) {\scriptsize{$\vectorn{\emph{C}}_3$}};
-\node [circle,draw,inner sep=2pt,fill=orange!20,text=orange!20] (c3copy1) at ([yshift=-0.1em,xshift=-0.1em]c3) {\scriptsize{$\textbf{C}_3$}};
+\node [circle,draw,inner sep=2pt,fill=orange!20,text=orange!20] (c3copy1) at ([yshift=-0.1em,xshift=-0.1em]c3) {\scriptsize{$\vectorn{\emph{C}}_3$}};
-\node [circle,draw,inner sep=2pt,fill=orange!20] (c3copy2) at ([yshift=-0.2em,xshift=-0.2em]c3) {\scriptsize{$\textbf{C}_3$}};
+\node [circle,draw,inner sep=2pt,fill=orange!20] (c3copy2) at ([yshift=-0.2em,xshift=-0.2em]c3) {\scriptsize{$\vectorn{\emph{C}}_3$}};
 \draw [->] ([xshift=-0.9em]c3.west) -- ([xshift=-0.3em]c3.west);
 \draw [->] ([xshift=0.1em]c3.east) .. controls +(east:1.5) and +(west:0.8) ..([yshift=-0.3em,xshift=-0.1em]s3.west);
 }

--- a/Chapter10/Figures/figure-bi-rnn.tex
+++ b/Chapter10/Figures/figure-bi-rnn.tex
@@ -132,7 +132,7 @@
            \begin{pgfonlayer}{background}
                \node[draw=red,thick,densely dashed,inner sep=5pt] [fit = (backinit) (backenc1) (backenc10)] (backrnn) {};
            \end{pgfonlayer}
-            \node[font=\scriptsize,anchor=south] (backrnnlabel) at ([xshift=-0.5\base,yshift=\base]backrnn.north east) {反向RNN};
+            \node[font=\scriptsize,anchor=south] (backrnnlabel) at ([xshift=-0.5\base,yshift=\base]backrnn.north east) {反向};
            \draw[->,dashed] (backrnnlabel.south) to ([xshift=-0.5\base]backrnn.north east);
        \end{scope}
    \end{tikzpicture}
\ No newline at end of file
--- a/Chapter10/Figures/figure-decode-the-word-probability-distribution-at-the-first-position.tex
+++ b/Chapter10/Figures/figure-decode-the-word-probability-distribution-at-the-first-position.tex
@@ -13,7 +13,7 @@
 }
 {
-\node [rnnnode,anchor=south] (s1) at ([yshift=1em]t1.north) {\scriptsize{$\textbf{s}_1$}};
+\node [rnnnode,anchor=south] (s1) at ([yshift=1em]t1.north) {\scriptsize{$\vectorn{\emph{s}}_1$}};
 }
 \node [wnode,anchor=north] (wt1) at ([yshift=-0.8em]t1.south) {\scriptsize{$\langle$sos$\rangle$}};

--- a/Chapter10/Figures/figure-decoding-process-based-on-greedy-method.tex
+++ b/Chapter10/Figures/figure-decoding-process-based-on-greedy-method.tex
@@ -2,9 +2,9 @@
 \begin{scope}
 \tikzstyle{rnnnode} = [minimum height=1.1em,minimum width=2.1em,inner sep=2pt,rounded corners=1pt,draw,fill=red!20];
-\node [rnnnode,anchor=west] (h1) at (0,0) {\tiny{$\textbf{h}_1$}};
+\node [rnnnode,anchor=west] (h1) at (0,0) {\tiny{$\vectorn{\emph{h}}_1$}};
 \node [anchor=west] (h2) at ([xshift=1em]h1.east) {\tiny{...}};
-\node [rnnnode,anchor=west] (h3) at ([xshift=1em]h2.east) {\tiny{$\textbf{h}_m$}};
+\node [rnnnode,anchor=west] (h3) at ([xshift=1em]h2.east) {\tiny{$\vectorn{\emph{h}}_m$}};
 \node [rnnnode,anchor=north,fill=green!20] (e1) at ([yshift=-1em]h1.south) {\tiny{$e_x()$}};
 \node [anchor=west] (e2) at ([xshift=1em]e1.east) {\tiny{...}};
 \node [rnnnode,anchor=west,fill=green!20] (e3) at ([xshift=1em]e2.east) {\tiny{$e_x()$}};
@@ -19,7 +19,7 @@
 \draw [->] ([xshift=0.1em]h1.east) -- ([xshift=-0.1em]h2.west);
 \draw [->] ([xshift=0.1em]h2.east) -- ([xshift=-0.1em]h3.west);
 \draw [->] ([xshift=-0.8em]h1.west) -- ([xshift=-0.1em]h1.west) node [pos=0,left,inner sep=2pt] {\tiny{0}};
-\node [anchor=south] (encoder) at ([xshift=-0.2em]h1.north west) {\scriptsize{\textbf{编码器}}};
+\node [anchor=south] (encoder) at ([xshift=-0.2em]h1.north west) {\scriptsize{\vectorn{编码器}}};
 {
 \node [rnnnode,anchor=west,fill=green!20] (t1) at ([xshift=3em]h3.east) {\tiny{$e_y()$}};
@@ -33,14 +33,14 @@
 \node [anchor=west,inner sep=2pt] (t5) at ([xshift=0.3em]t4.east) {\tiny{...}};
 }
 {
-\node [rnnnode,anchor=south] (s1) at ([yshift=1em]t1.north) {\tiny{$\textbf{s}_1$}};
+\node [rnnnode,anchor=south] (s1) at ([yshift=1em]t1.north) {\tiny{$\vectorn{\emph{s}}_1$}};
 }
 {
-\node [rnnnode,anchor=south] (s2) at ([yshift=1em]t2.north) {\tiny{$\textbf{s}_2$}};
+\node [rnnnode,anchor=south] (s2) at ([yshift=1em]t2.north) {\tiny{$\vectorn{\emph{s}}_2$}};
 }
 {
-\node [rnnnode,anchor=south] (s3) at ([yshift=1em]t3.north) {\tiny{$\textbf{s}_3$}};
+\node [rnnnode,anchor=south] (s3) at ([yshift=1em]t3.north) {\tiny{$\vectorn{\emph{s}}_3$}};
-\node [rnnnode,anchor=south] (s4) at ([yshift=1em]t4.north) {\tiny{$\textbf{s}_4$}};
+\node [rnnnode,anchor=south] (s4) at ([yshift=1em]t4.north) {\tiny{$\vectorn{\emph{s}}_4$}};
 \node [anchor=west,inner sep=2pt] (s5) at ([xshift=0.3em]s4.east) {\tiny{...}};
 }
 {
@@ -131,7 +131,7 @@
 }
 {
-\node [circle,draw,anchor=south,inner sep=3pt,fill=orange!20] (c2) at ([yshift=2em]h2.north) {\tiny{$\textbf{C}_2$}};
+\node [circle,draw,anchor=south,inner sep=3pt,fill=orange!20] (c2) at ([yshift=2em]h2.north) {\tiny{$\vectorn{\emph{C}}_2$}};
 \node [anchor=south] (c2label) at (c2.north) {\tiny{\textbf{注意力机制：上下文}}};
 \node [anchor=south] (c2more) at ([yshift=-1.5em]c2.south) {\tiny{...}};
 \draw [->] (h1.north) .. controls +(north:0.6) and +(250:0.9) .. (c2.250);
@@ -143,12 +143,12 @@
 }
 {
-\node [circle,draw,anchor=north,inner sep=3pt,fill=orange!20] (c3) at ([yshift=-2em]t2.south) {\tiny{$\textbf{C}_3$}};
+\node [circle,draw,anchor=north,inner sep=3pt,fill=orange!20] (c3) at ([yshift=-2em]t2.south) {\tiny{$\vectorn{\emph{C}}_3$}};
 \draw [->] ([xshift=-0.7em]c3.west) -- ([xshift=-0.1em]c3.west);
 \draw [->] ([xshift=0.1em]c3.east) .. controls +(east:0.6) and +(west:0.8) ..([yshift=-0.3em,xshift=-0.1em]s3.west);
 }
 {
-\node [circle,draw,anchor=north,inner sep=3pt,fill=orange!20] (c4) at ([yshift=-2em]t3.south) {\tiny{$\textbf{C}_4$}};
+\node [circle,draw,anchor=north,inner sep=3pt,fill=orange!20] (c4) at ([yshift=-2em]t3.south) {\tiny{$\vectorn{\emph{C}}_4$}};
 \draw [->] ([xshift=-0.7em]c4.west) -- ([xshift=-0.1em]c4.west);
 \draw [->] ([xshift=0.1em]c4.east) .. controls +(east:0.6) and +(west:0.8) ..([yshift=-0.3em,xshift=-0.1em]s4.west);
 }

--- a/Chapter10/Figures/figure-double-layer-rnn.tex
+++ b/Chapter10/Figures/figure-double-layer-rnn.tex
@@ -131,7 +131,7 @@
                \node[draw=red,thick,densely dashed,inner sep=5pt] [fit = (init2) (enc21) (enc210)] (enc2) {};
                \node[draw=red,thick,densely dashed,inner sep=5pt] [fit = (dec21) (dec210)] (dec2) {};
            \end{pgfonlayer}
-            \node[font=\scriptsize,anchor=west] (label) at ([xshift=0.4\base]demb10.east) {堆叠RNN};
+            \node[font=\scriptsize,anchor=west] (label) at ([xshift=0.4\base]demb10.east) {堆叠};
            \draw[->,dashed] (label.north) to (dec2.east);
            \draw[->,dashed] (label.south) to (enc2.east);

--- a/Chapter10/Figures/figure-encoder-decoder-process.tex
+++ b/Chapter10/Figures/figure-encoder-decoder-process.tex
@@ -2,9 +2,9 @@
 \begin{scope}
 \small{
-\node [anchor=south west,minimum width=15em] (source) at (0,0) {\textbf{源语}: 我\ \ \ \ 对\ \ \ \ 你\ \ \ \ 感到\ \ \ \ 满意};
+\node [anchor=south west,minimum width=15em] (source) at (0,0) {\textbf{源语言}: 我\ \ \ \ 对\ \ \ \ 你\ \ \ \ 感到\ \ \ \ 满意};
 {
-\node [anchor=south west,minimum width=15em] (target) at ([yshift=12em]source.north west) {\textbf{目标语}: I\ \ am\ \ \ satisfied\ \ \ with\ \ \ you};
+\node [anchor=south west,minimum width=15em] (target) at ([yshift=12em]source.north west) {\textbf{目标语言}: I\ \ am\ \ \ satisfied\ \ \ with\ \ \ you};
 }
 {
 \node [anchor=center,minimum width=9.6em,minimum height=1.8em,draw,rounded corners=0.3em] (hidden) at ([yshift=6em]source.north) {};
@@ -24,7 +24,7 @@
 \node [anchor=west,minimum width=1.5em,minimum size=1.5em] (cell08) at (cell06.east){\small{
 \hspace{0.6em}
 \begin{tabular}{l}
-源语句子的“表示”
+源语言句子的“表示”
 \end{tabular}
 }
 };
@@ -47,10 +47,10 @@
 }
 {
-\node [anchor=south] (enclabel) at ([yshift=2em]source.north) {\small{\textbf{Encoder}}};
+\node [anchor=south] (enclabel) at ([yshift=2em]source.north) {\small{\textbf{编码器（Encoder）}}};
-\node [anchor=north] (declabel) at ([yshift=-2em]target.south) {\small{\textbf{Decoder}}};
+\node [anchor=north] (declabel) at ([yshift=-2em]target.south) {\small{\textbf{解码器（Decoder）}}};
 }

--- a/Chapter10/Figures/figure-gru01.tex
+++ b/Chapter10/Figures/figure-gru01.tex
@@ -78,8 +78,8 @@
        \end{scope}
        \begin{scope}
-            \node[wordnode,anchor=south] () at (aux71) {$\vectorn{h}_{t-1}$};
+            \node[wordnode,anchor=south] () at (aux71) {$\vectorn{\emph{h}}_{t-1}$};
-            \node[wordnode,anchor=west] () at (aux12) {$\vectorn{x}_t$};
+            \node[wordnode,anchor=west] () at (aux12) {$\vectorn{\emph{x}}_t$};
        \end{scope}

--- a/Chapter10/Figures/figure-gru02.tex
+++ b/Chapter10/Figures/figure-gru02.tex
@@ -91,8 +91,8 @@
        \end{scope}
        \begin{scope}
-            \node[wordnode,anchor=south] () at (aux71) {$\vectorn{h}_{t-1}$};
+            \node[wordnode,anchor=south] () at (aux71) {$\vectorn{\emph{h}}_{t-1}$};
-            \node[wordnode,anchor=west] () at (aux12) {$\vectorn{x}_t$};
+            \node[wordnode,anchor=west] () at (aux12) {$\vectorn{\emph{x}}_t$};
        \end{scope}

--- a/Chapter10/Figures/figure-gru03.tex
+++ b/Chapter10/Figures/figure-gru03.tex
@@ -109,11 +109,11 @@
        \end{scope}
        \begin{scope}
-             \node[wordnode,anchor=south] () at (aux71) {$\vectorn{h}_{t-1}$};
+             \node[wordnode,anchor=south] () at (aux71) {$\vectorn{\emph{h}}_{t-1}$};
-            \node[wordnode,anchor=west] () at (aux12) {$\vectorn{x}_t$};
+            \node[wordnode,anchor=west] () at (aux12) {$\vectorn{\emph{x}}_t$};
            {
-                \node[wordnode,anchor=east] () at (aux87) {$\vectorn{h}_{t}$};
+                \node[wordnode,anchor=east] () at (aux87) {$\vectorn{\emph{h}}_{t}$};
-                \node[wordnode,anchor=south] () at (aux78) {$\vectorn{h}_{t}$};
+                \node[wordnode,anchor=south] () at (aux78) {$\vectorn{\emph{h}}_{t}$};
            }
        \end{scope}

--- a/Chapter10/Figures/figure-lstm01.tex
+++ b/Chapter10/Figures/figure-lstm01.tex
@@ -84,9 +84,9 @@
        \end{scope}
        \begin{scope}
-            \node[wordnode,anchor=south] () at ([xshift=0.5\base]aux21) {$\vectorn{h}_{t-1}$};
+            \node[wordnode,anchor=south] () at ([xshift=0.5\base]aux21) {$\vectorn{\emph{h}}_{t-1}$};
-            \node[wordnode,anchor=west] () at (aux12) {$\vectorn{x}_t$};
+            \node[wordnode,anchor=west] () at (aux12) {$\vectorn{\emph{x}}_t$};
-            \node[wordnode,anchor=south] () at ([xshift=0.5\base]aux51) {$\vectorn{c}_{t-1}$};
+            \node[wordnode,anchor=south] () at ([xshift=0.5\base]aux51) {$\vectorn{\emph{c}}_{t-1}$};
        \end{scope}

--- a/Chapter10/Figures/figure-lstm02.tex
+++ b/Chapter10/Figures/figure-lstm02.tex
@@ -99,9 +99,9 @@
         \end{scope}
        \begin{scope}
-            \node[wordnode,anchor=south] () at ([xshift=0.5\base]aux21) {$\vectorn{h}_{t-1}$};
+            \node[wordnode,anchor=south] () at ([xshift=0.5\base]aux21) {$\vectorn{\emph{h}}_{t-1}$};
-            \node[wordnode,anchor=west] () at (aux12) {$\vectorn{x}_t$};
+            \node[wordnode,anchor=west] () at (aux12) {$\vectorn{\emph{x}}_t$};
-            \node[wordnode,anchor=south] () at ([xshift=0.5\base]aux51) {$\vectorn{c}_{t-1}$};
+            \node[wordnode,anchor=south] () at ([xshift=0.5\base]aux51) {$\vectorn{\emph{c}}_{t-1}$};
        \end{scope}

--- a/Chapter10/Figures/figure-lstm03.tex
+++ b/Chapter10/Figures/figure-lstm03.tex
@@ -113,11 +113,11 @@
        \end{scope}
        \begin{scope}
-            \node[wordnode,anchor=south] () at ([xshift=0.5\base]aux21) {$\vectorn{h}_{t-1}$};
+            \node[wordnode,anchor=south] () at ([xshift=0.5\base]aux21) {$\vectorn{\emph{h}}_{t-1}$};
-            \node[wordnode,anchor=west] () at (aux12) {$\vectorn{x}_t$};
+            \node[wordnode,anchor=west] () at (aux12) {$\vectorn{\emph{x}}_t$};
-            \node[wordnode,anchor=south] () at ([xshift=0.5\base]aux51) {$\vectorn{c}_{t-1}$};
+            \node[wordnode,anchor=south] () at ([xshift=0.5\base]aux51) {$\vectorn{\emph{c}}_{t-1}$};
            {
-                \node[wordnode,anchor=south] () at ([xshift=-0.5\base]aux59) {$\vectorn{c}_{t}$};
+                \node[wordnode,anchor=south] () at ([xshift=-0.5\base]aux59) {$\vectorn{\emph{c}}_{t}$};
            }
        \end{scope}

--- a/Chapter10/Figures/figure-lstm04.tex
+++ b/Chapter10/Figures/figure-lstm04.tex
@@ -131,15 +131,15 @@
        \end{scope}
        \begin{scope}
-            \node[wordnode,anchor=south] () at ([xshift=0.5\base]aux21) {$\vectorn{h}_{t-1}$};
+            \node[wordnode,anchor=south] () at ([xshift=0.5\base]aux21) {$\vectorn{\emph{h}}_{t-1}$};
-            \node[wordnode,anchor=west] () at (aux12) {$\vectorn{x}_t$};
+            \node[wordnode,anchor=west] () at (aux12) {$\vectorn{\emph{x}}_t$};
-            \node[wordnode,anchor=south] () at ([xshift=0.5\base]aux51) {$\vectorn{c}_{t-1}$};
+            \node[wordnode,anchor=south] () at ([xshift=0.5\base]aux51) {$\vectorn{\emph{c}}_{t-1}$};
            {
-                \node[wordnode,anchor=south] () at ([xshift=-0.5\base]aux59) {$\vectorn{c}_{t}$};
+                \node[wordnode,anchor=south] () at ([xshift=-0.5\base]aux59) {$\vectorn{\emph{c}}_{t}$};
            }
            {
-                \node[wordnode,anchor=east] () at (aux68) {$\vectorn{h}_{t}$};
+                \node[wordnode,anchor=east] () at (aux68) {$\vectorn{\emph{h}}_{t}$};
-                \node[wordnode,anchor=south] () at ([xshift=-0.5\base]aux29) {$\vectorn{h}_{t}$};
+                \node[wordnode,anchor=south] () at ([xshift=-0.5\base]aux29) {$\vectorn{\emph{h}}_{t}$};
            }
        \end{scope}

--- a/Chapter10/Figures/figure-output-layer-structur.tex
+++ b/Chapter10/Figures/figure-output-layer-structur.tex
@@ -17,9 +17,9 @@
                    \node[rnnnode,minimum height=0.5\base,fill=green!30!white,anchor=west] (eemb\x) at ([xshift=0.4\base]eemb\y.east) {\tiny{$e_x()$}};
                \foreach \x in {1,2,...,3}
                    \node[rnnnode,fill=blue!30!white,anchor=south] (enc\x) at ([yshift=0.3\base]eemb\x.north) {};
-                    \node[] (enclabel1) at (enc1) {\tiny{$\textbf{h}_{m-2}$}};
+                    \node[] (enclabel1) at (enc1) {\tiny{$\vectorn{\emph{h}}_{m-2}$}};
-                    \node[] (enclabel2) at (enc2) {\tiny{$\textbf{h}_{m-1}$}};
+                    \node[] (enclabel2) at (enc2) {\tiny{$\vectorn{\emph{h}}_{m-1}$}};
-                    \node[rnnnode,fill=purple!30!white] (enclabel3) at (enc3) {\tiny{$\textbf{h}_{m}$}};
+                    \node[rnnnode,fill=purple!30!white] (enclabel3) at (enc3) {\tiny{$\vectorn{\emph{h}}_{m}$}};
                \node[wordnode,left=0.4\base of enc1] (init1) {$\cdots$};
                \node[wordnode,left=0.4\base of eemb1] (init2) {$\cdots$};
@@ -31,7 +31,7 @@
                \foreach \x in {1,2,...,3}
                    \node[rnnnode,minimum height=0.5\base,fill=green!30!white,anchor=south] (demb\x) at ([yshift=\base]enc\x.north) {\tiny{$e_y()$}};
                \foreach \x in {1,2,...,3}
-                    \node[rnnnode,fill=blue!30!white,anchor=south] (dec\x) at ([yshift=0.3\base]demb\x.north) {{\tiny{$\textbf{s}_\x$}}};
+                    \node[rnnnode,fill=blue!30!white,anchor=south] (dec\x) at ([yshift=0.3\base]demb\x.north) {{\tiny{$\vectorn{\emph{s}}_\x$}}};
                \foreach \x in {1,2,...,3}
                    \node[rnnnode,minimum height=0.5\base,fill=red!30!white,anchor=south] (softmax\x) at ([yshift=0.3\base]dec\x.north) {\tiny{Softmax}};
                \node[wordnode,right=0.4\base of demb3] (end1) {$\cdots$};
@@ -123,13 +123,13 @@
                    \draw [->,thick] ([xshift=0.2em,yshift=0.1em]hidden.north west) -- (target.south west);
                    \draw [->,thick] ([xshift=-0.2em,yshift=0.1em]hidden.north east) -- (target.south east);
-                    \node [anchor=south] () at ([yshift=0.3em]hidden.north) {\scriptsize{$\hat{\vectorn{s}}=\vectorn{Ws}$}};
+                    \node [anchor=south] () at ([yshift=0.3em]hidden.north) {\scriptsize{$\hat{\vectorn{\emph{s}}}_j=\vectorn{\emph{s}}_j \vectorn{\emph{W}}_o$}};
                }
                {
                    \node [rounded corners=0.3em] (softmax) at ([yshift=1.25em]target.north) {\scriptsize{$p(\hat{s}_i)=\frac{e^{\hat{s}_i}}{\sum_j e^{\hat{s}_j}}$}};
                    \filldraw [fill=blue!20,draw=white] ([yshift=0.1em]cell11.north west) {[rounded corners=0.3em] -- (softmax.west)} -- (label1.south west) -- (label8.south east) {[rounded corners=0.3em] -- (softmax.east)} -- ([yshift=0.1em]cell18.north east) -- ([yshift=0.1em]cell11.north west);
-                    \node [rounded corners=0.3em] (softmax) at ([yshift=1.25em]target.north) {\scriptsize{$p(\hat{s}_i)=\frac{e^{\hat{s}_i}}{\sum_j e^{\hat{s}_j}}$}};
+                    \node [rounded corners=0.3em] (softmax) at ([yshift=1.25em]target.north) {\scriptsize{$p(\hat{s}_{jk})=\frac{e^{\hat{s}_{jk}}}{\sum_n e^{\hat{s}_{jn}}}$}};
                }
                \draw [-latex'] ([yshift=-0.3cm]hidden.south) to (hidden.south);
                {

--- a/Chapter10/Figures/figure-query-model-corresponding-to-attention-mechanism.tex
+++ b/Chapter10/Figures/figure-query-model-corresponding-to-attention-mechanism.tex
 \begin{tikzpicture}
 \begin{scope}
 \tikzstyle{rnode} = [draw,minimum width=3.5em,minimum height=1.2em]
-\node [rnode,anchor=south west,fill=red!20!white] (value1) at (0,0) {\scriptsize{$\vectorn{h}(\textrm{“你”})$}};
+\node [rnode,anchor=south west,fill=red!20!white] (value1) at (0,0) {\scriptsize{$\vectorn{h}(\textrm{你})$}};
-\node [rnode,anchor=south west,fill=red!20!white] (value2) at ([xshift=1em]value1.south east) {\scriptsize{$\vectorn{h}(\textrm{“什么”})$}};
+\node [rnode,anchor=south west,fill=red!20!white] (value2) at ([xshift=1em]value1.south east) {\scriptsize{$\vectorn{h}(\textrm{什么})$}};
-\node [rnode,anchor=south west,fill=red!20!white] (value3) at ([xshift=1em]value2.south east) {\scriptsize{$\vectorn{h}(\textrm{“也”})$}};
+\node [rnode,anchor=south west,fill=red!20!white] (value3) at ([xshift=1em]value2.south east) {\scriptsize{$\vectorn{h}(\textrm{也})$}};
-\node [rnode,anchor=south west,fill=red!20!white] (value4) at ([xshift=1em]value3.south east) {\scriptsize{$\vectorn{h}(\textrm{“没”})$}};
+\node [rnode,anchor=south west,fill=red!20!white] (value4) at ([xshift=1em]value3.south east) {\scriptsize{$\vectorn{h}(\textrm{没})$}};
-\node [rnode,anchor=south west,fill=green!20!white] (key1) at ([yshift=0.2em]value1.north west) {\scriptsize{$\vectorn{h}(\textrm{“你”})$}};
+\node [rnode,anchor=south west,fill=green!20!white] (key1) at ([yshift=0.2em]value1.north west) {\scriptsize{$\vectorn{h}(\textrm{你})$}};
-\node [rnode,anchor=south west,fill=green!20!white] (key2) at ([yshift=0.2em]value2.north west) {\scriptsize{$\vectorn{h}(\textrm{“什么”})$}};
+\node [rnode,anchor=south west,fill=green!20!white] (key2) at ([yshift=0.2em]value2.north west) {\scriptsize{$\vectorn{h}(\textrm{什么})$}};
-\node [rnode,anchor=south west,fill=green!20!white] (key3) at ([yshift=0.2em]value3.north west) {\scriptsize{$\vectorn{h}(\textrm{“也”})$}};
+\node [rnode,anchor=south west,fill=green!20!white] (key3) at ([yshift=0.2em]value3.north west) {\scriptsize{$\vectorn{h}(\textrm{也})$}};
-\node [rnode,anchor=south west,fill=green!20!white] (key4) at ([yshift=0.2em]value4.north west) {\scriptsize{$\vectorn{h}(\textrm{“没”})$}};
+\node [rnode,anchor=south west,fill=green!20!white] (key4) at ([yshift=0.2em]value4.north west) {\scriptsize{$\vectorn{h}(\textrm{没})$}};
-\node [rnode,anchor=east] (query) at ([xshift=-2em]key1.west) {\scriptsize{$\vectorn{s}(\textrm{“you”})$}};
+\node [rnode,anchor=east] (query) at ([xshift=-2em]key1.west) {\scriptsize{$\vectorn{s}(\textrm{you})$}};
 \node [anchor=east] (querylabel) at ([xshift=-0.2em]query.west) {\scriptsize{query}};
 \draw [->] ([yshift=1pt,xshift=6pt]query.north) .. controls +(90:1em) and +(90:1em) .. ([yshift=1pt]key1.north);

--- a/Chapter10/Figures/figure-score-of-mter.tex
+++ b/Chapter10/Figures/figure-score-of-mter.tex
@@ -7,7 +7,7 @@ symbolic x coords={1-15,16-25,26-35,>35},
 xtick=data,
 ytick={6,12,...,28},
 xlabel={句子长度（范围）},
-ylabel={$\%$\footnotesize{mTER}},
+ylabel={\footnotesize{mTER}[\%]},
 xlabel style={align=center},
 ylabel style={},
 y tick style={opacity=0},

--- a/Chapter10/Figures/figure-the-whole-of-lstm.tex
+++ b/Chapter10/Figures/figure-the-whole-of-lstm.tex
@@ -141,15 +141,15 @@
 \end{scope}
 \begin{scope}
-\node[wordnode,anchor=south] () at ([xshift=0.5\base]aux21) {$\vectorn{h}_{t-1}$};
+\node[wordnode,anchor=south] () at ([xshift=0.5\base]aux21) {$\vectorn{\emph{h}}_{t-1}$};
-\node[wordnode,anchor=west] () at (aux12) {$\vectorn{x}_t$};
+\node[wordnode,anchor=west] () at (aux12) {$\vectorn{\emph{x}}_t$};
-\node[wordnode,anchor=south] () at ([xshift=0.5\base]aux51) {$\vectorn{c}_{t-1}$};
+\node[wordnode,anchor=south] () at ([xshift=0.5\base]aux51) {$\vectorn{\emph{c}}_{t-1}$};
 {
-\node[wordnode,anchor=south] () at ([xshift=-0.5\base]aux59) {$\vectorn{c}_{t}$};
+\node[wordnode,anchor=south] () at ([xshift=-0.5\base]aux59) {$\vectorn{\emph{c}}_{t}$};
 }
 {
-\node[wordnode,anchor=east] () at (aux68) {$\vectorn{h}_{t}$};
+\node[wordnode,anchor=east] () at (aux68) {$\vectorn{\emph{h}}_{t}$};
-\node[wordnode,anchor=south] () at ([xshift=-0.5\base]aux29) {$\vectorn{h}_{t}$};
+\node[wordnode,anchor=south] () at ([xshift=-0.5\base]aux29) {$\vectorn{\emph{h}}_{t}$};
 }
 \end{scope}
@@ -170,19 +170,19 @@
 \begin{scope}
 {
 % forget gate formula
-\node[formulanode,anchor=south east,text width=10em] () at ([shift={(4\base,1.5\base)}]aux51) {遗忘门\\$\vectorn{f}_t=\sigma(\vectorn{W}_f[\vectorn{h}_{t-1},\vectorn{x}_t]+\vectorn{b}_f)$};
+\node[formulanode,anchor=south east,text width=10em] () at ([shift={(4\base,1.5\base)}]aux51) {遗忘门\\$\vectorn{\emph{f}}_t=\sigma(\vectorn{\emph{W}}_f[\vectorn{\emph{h}}_{t-1},\vectorn{\emph{x}}_t]+\vectorn{\emph{b}}_f)$};
 }
 {
 % input gate formula
-\node[formulanode,anchor=north east,text width=10em] () at ([shift={(4\base,-1.5\base)}]aux21) {输入门\\$\vectorn{i}_t=\sigma(\vectorn{W}_i[\vectorn{h}_{t-1},\vectorn{x}_t]+\vectorn{b}_i)$\\$\hat{\vectorn{c}}_t=\mathrm{tanh}(\vectorn{W}_c[\vectorn{h}_{t-1},\vectorn{x}_t]+\vectorn{b}_c)$};
+\node[formulanode,anchor=north east,text width=10em] () at ([shift={(4\base,-1.5\base)}]aux21) {输入门\\$\vectorn{\emph{i}}_t=\sigma(\vectorn{\emph{W}}_i[\vectorn{\emph{h}}_{t-1},\vectorn{\emph{x}}_t]+\vectorn{\emph{b}}_i)$\\$\hat{\vectorn{\emph{c}}}_t=\mathrm{tanh}(\vectorn{\emph{W}}_c[\vectorn{\emph{h}}_{t-1},\vectorn{\emph{x}}_t]+\vectorn{\emph{b}}_c)$};
 }
 {
 % cell update formula
-\node[formulanode,anchor=south west,text width=10em] () at ([shift={(-4\base,1.5\base)}]aux59) {记忆更新\\$\vectorn{c}_{t}=\vectorn{f}_t\cdot \vectorn{c}_{t-1}+\vectorn{i}_t\cdot \hat{\vectorn{c}}_t$};
+\node[formulanode,anchor=south west,text width=10em] () at ([shift={(-4\base,1.5\base)}]aux59) {记忆更新\\$\vectorn{\emph{c}}_{t}=\vectorn{\emph{f}}_t\cdot \vectorn{\emph{c}}_{t-1}+\vectorn{\emph{i}}_t\cdot \hat{\vectorn{\emph{c}}}_t$};
 }
 {
 % output gate formula
-\node[formulanode,anchor=north west,text width=10em] () at ([shift={(-4\base,-1.5\base)}]aux29) {输出门\\$\vectorn{o}_t=\sigma(\vectorn{W}_o[\vectorn{h}_{t-1},\vectorn{x}_t]+\vectorn{b}_o)$\\$\vectorn{h}_{t}=\vectorn{o}_t\cdot \mathrm{tanh}(\vectorn{c}_{t})$};
+\node[formulanode,anchor=north west,text width=10em] () at ([shift={(-4\base,-1.5\base)}]aux29) {输出门\\$\vectorn{\emph{o}}_t=\sigma(\vectorn{\emph{W}}_o[\vectorn{\emph{h}}_{t-1},\vectorn{\emph{x}}_t]+\vectorn{\emph{b}}_o)$\\$\vectorn{\emph{h}}_{t}=\vectorn{\emph{o}}_t\cdot \mathrm{tanh}(\vectorn{\emph{c}}_{t})$};
 }
 \end{scope}
 \end{tikzpicture}

--- a/Chapter10/Figures/figure-word-embedding-structure.tex
+++ b/Chapter10/Figures/figure-word-embedding-structure.tex
@@ -14,9 +14,9 @@
                    \node[rnnnode,minimum height=0.5\base,fill=green!30!white,anchor=west] (eemb\x) at ([xshift=0.4\base]eemb\y.east) {\tiny{$e_x()$}};
                \foreach \x in {1,2,...,3}
                    \node[rnnnode,fill=blue!30!white,anchor=south] (enc\x) at ([yshift=0.3\base]eemb\x.north) {};
-                    \node[] (enclabel1) at (enc1) {\tiny{$\vectorn{h}_{m-2}$}};
+                    \node[] (enclabel1) at (enc1) {\tiny{$\vectorn{\emph{h}}_{m-2}$}};
-                    \node[] (enclabel2) at (enc2) {\tiny{$\vectorn{h}_{m-1}$}};
+                    \node[] (enclabel2) at (enc2) {\tiny{$\vectorn{\emph{h}}_{m-1}$}};
-                    \node[rnnnode,fill=purple!30!white] (enclabel3) at (enc3) {\tiny{$\vectorn{h}_{m}$}};
+                    \node[rnnnode,fill=purple!30!white] (enclabel3) at (enc3) {\tiny{$\vectorn{\emph{h}}_{m}$}};
                \node[wordnode,left=0.4\base of enc1] (init1) {$\cdots$};
                \node[wordnode,left=0.4\base of eemb1] (init2) {$\cdots$};
@@ -28,7 +28,7 @@
                \foreach \x in {1,2,...,3}
                    \node[rnnnode,minimum height=0.5\base,fill=green!30!white,anchor=south] (demb\x) at ([yshift=\base]enc\x.north) {\tiny{$e_y()$}};
                \foreach \x in {1,2,...,3}
-                    \node[rnnnode,fill=blue!30!white,anchor=south] (dec\x) at ([yshift=0.3\base]demb\x.north) {{\tiny{$\vectorn{s}_\x$}}};
+                    \node[rnnnode,fill=blue!30!white,anchor=south] (dec\x) at ([yshift=0.3\base]demb\x.north) {{\tiny{$\vectorn{\emph{s}}_\x$}}};
                \foreach \x in {1,2,...,3}
                    \node[rnnnode,minimum height=0.5\base,fill=red!30!white,anchor=south] (softmax\x) at ([yshift=0.3\base]dec\x.north) {\tiny{Softmax}};
                \node[wordnode,right=0.4\base of demb3] (end1) {$\cdots$};

--- a/Chapter10/chapter10.tex
+++ b/Chapter10/chapter10.tex
--- a/Chapter12/Figures/dog-hat.jpg
+++ b/Chapter12/Figures/dog-hat.jpg
--- a/Chapter12/Figures/figure-a-combination-of-position-encoding-and-word-encoding.tex
+++ b/Chapter12/Figures/figure-a-combination-of-position-encoding-and-word-encoding.tex
+\begin{tikzpicture}
+\begin{scope}
+\tikzstyle{rnode} = [draw,minimum width=3.5em,minimum height=1.2em]
+\node [rnode,anchor=south west,fill=red!20!white] (e1) at (0,0) {\scriptsize{$e(\textrm{沈阳})$}};
+\node [rnode,anchor=south west,fill=red!20!white] (e2) at ([xshift=1em]e1.south east) {\scriptsize{$e(\textrm{到})$}};
+\node [rnode,anchor=south west,fill=red!20!white] (e3) at ([xshift=1em]e2.south east) {\scriptsize{$e(\textrm{广州})$}};
+\node [rnode,anchor=south west,fill=red!20!white] (e4) at ([xshift=1em]e3.south east) {\scriptsize{$e(\textrm{的})$}};
+\node [rnode,anchor=south west,fill=red!20!white] (e5) at ([xshift=1em]e4.south east) {\scriptsize{$e(\textrm{机票})$}};
+\node [rnode,anchor=south west,fill=green!20!white] (h1) at ([yshift=1.5em]e1.north west) {\scriptsize{$h(\textrm{沈阳})$}};
+\node [rnode,anchor=south west,fill=green!20!white] (h2) at ([yshift=1.5em]e2.north west) {\scriptsize{$h(\textrm{到})$}};
+\node [rnode,anchor=south west,fill=green!20!white] (h3) at ([yshift=1.5em]e3.north west) {\scriptsize{$h(\textrm{广州})$}};
+\node [rnode,anchor=south west,fill=green!20!white] (h4) at ([yshift=1.5em]e4.north west) {\scriptsize{$h(\textrm{的})$}};
+\node [rnode,anchor=south west,fill=green!20!white] (h5) at ([yshift=1.5em]e5.north west) {\scriptsize{$h(\textrm{机票})$}};
+\foreach \x in {1,2,3,4,5}{
+	\node [anchor=north] (plus\x) at ([yshift=-0em]e\x.south) {\scriptsize{$\mathbf{\oplus}$}};
+}
+\node [rnode,anchor=north,fill=yellow!20!white] (pos1) at ([yshift=-1.1em]e1.south) {\scriptsize{$\textrm{PE}(1)$}};
+\node [rnode,anchor=north,fill=yellow!20!white] (pos2) at ([yshift=-1.1em]e2.south) {\scriptsize{$\textrm{PE}(2)$}};
+\node [rnode,anchor=north,fill=yellow!20!white] (pos3) at ([yshift=-1.1em]e3.south) {\scriptsize{$\textrm{PE}(3)$}};
+\node [rnode,anchor=north,fill=yellow!20!white] (pos4) at ([yshift=-1.1em]e4.south) {\scriptsize{$\textrm{PE}(4)$}};
+\node [rnode,anchor=north,fill=yellow!20!white] (pos5) at ([yshift=-1.1em]e5.south) {\scriptsize{$\textrm{PE}(5)$}};
+\foreach \x in {1,2,3,4,5}{
+	\node [rectangle,inner sep=0.1em,rounded corners=1pt,very thick,dotted,draw=red!60] [fit = (e\x) (pos\x)] (box\x) {};
+}
+\node [anchor=north] (inputs1) at ([yshift=-1em]pos1.south) {\scriptsize{沈阳}};
+\node [anchor=north] (inputs2) at ([yshift=-1em]pos2.south) {\scriptsize{到}};
+\node [anchor=north] (inputs3) at ([yshift=-1em]pos3.south) {\scriptsize{广州}};
+\node [anchor=north] (inputs4) at ([yshift=-1em]pos4.south) {\scriptsize{的}};
+\node [anchor=north] (inputs5) at ([yshift=-1em]pos5.south) {\scriptsize{机票}};
+\draw [->] ([yshift=0.1em]e1.north) .. controls +(north:0.5) and +(south:0.5) .. ([xshift=-1em,yshift=-0.1em]h3.south);
+\draw [->] ([yshift=0.1em]e2.north) .. controls +(north:0.3) and +(south:0.6) .. ([xshift=-0.5em,yshift=-0.1em]h3.south);
+\draw [->] ([yshift=0.1em]e3.north) -- ([yshift=-0.1em]h3.south);
+\draw [->] ([yshift=0.1em]e4.north) .. controls +(north:0.3) and +(south:0.6) .. ([xshift=0.5em,yshift=-0.1em]h3.south);
+\draw [->] ([yshift=0.1em]e5.north) .. controls +(north:0.5) and +(south:0.5) .. ([xshift=1em,yshift=-0.1em]h3.south);
+\draw [->] ([yshift=0.1em]e1.north) -- ([yshift=-0.1em]h1.south);
+\draw [->] ([yshift=0.1em]e2.north) -- ([yshift=-0.1em]h2.south);
+\draw [->] ([yshift=0.1em]e4.north) -- ([yshift=-0.1em]h4.south);
+\draw [->] ([yshift=0.1em]e5.north) -- ([yshift=-0.1em]h5.south);
+\foreach \x in {1,2,3,4,5}{
+	\draw [->] ([yshift=-0.1em]inputs\x.north) -- ([yshift=-0.2em]pos\x.south);
+}
+\node [anchor=north] (dot1) at ([xshift=0.4em,yshift=-0.2em]h1.south) {\tiny{...}};
+\node [anchor=north] (dot2) at ([xshift=0.4em,yshift=-0.2em]h2.south) {\tiny{...}};
+\node [anchor=north] (dot4) at ([xshift=-0.4em,yshift=-0.2em]h4.south) {\tiny{...}};
+\node [anchor=north] (dot5) at ([xshift=-0.4em,yshift=-0.2em]h5.south) {\tiny{...}};
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter12/Figures/figure-attention-of-source-and-target-words.tex
+++ b/Chapter12/Figures/figure-attention-of-source-and-target-words.tex
+%
+%---------------------------------------
+\begin{tikzpicture}
+\begin{scope}
+%\newlength{\mystep}
+%\setlength{\mystep}{1.6em}
+\foreach \x in {1,2,...,6}
+    \node[] (s\x) at (\x * 1.6em,0) {};
+\node [] (ws1) at (s1) {\scriptsize{这}};
+\node [] (ws2) at (s2) {\scriptsize{是}};
+\node [] (ws3) at (s3) {\scriptsize{个}};
+\node [] (ws4) at (s4) {\scriptsize{很长}};
+\node [] (ws5) at (s5) {\scriptsize{的}};
+\node [] (ws6) at (s6) {\scriptsize{句子}};
+\foreach \x in {1,2,...,6}
+    \node[] (t\x) at (\x * 1.6em + 2.4in,0) {};
+\node [] (wt1) at (t1) {\scriptsize{This}};
+\node [] (wt2) at (t2) {\scriptsize{is}};
+\node [] (wt3) at ([yshift=-1pt]t3) {\scriptsize{a}};
+\node [] (wt4) at ([yshift=-0.1em]t4) {\scriptsize{very}};
+\node [] (wt5) at (t5) {\scriptsize{long}};
+\node [] (wt6) at ([xshift=1em]t6) {\scriptsize{sentence}};
+\node [anchor=south west,fill=red!30,minimum width=1.6in,minimum height=1.5em] (encoder) at ([yshift=1.0em]ws1.north west) {\footnotesize{Encoder}};
+\node [anchor=west,fill=blue!30,minimum width=1.9in,minimum height=1.5em] (decoder) at ([xshift=4.5em]encoder.east) {\footnotesize{Decoder}};
+\node [anchor=west,fill=green!30,minimum height=1.5em] (representation) at ([xshift=1em]encoder.east) {\footnotesize{表示}};
+\draw [->,thick] ([xshift=1pt]encoder.east)--([xshift=-1pt]representation.west);
+\draw [->,thick] ([xshift=1pt]representation.east)--([xshift=-1pt]decoder.west);
+\foreach \x in {1,2,...,6}
+    \draw[->] ([yshift=0.1em]s\x.north) -- ([yshift=1.2em]s\x.north);
+\foreach \x in {1,2,...,5}
+    \draw[<-] ([yshift=0.1em]t\x.north) -- ([yshift=1.2em]t\x.north);
+\draw[<-] ([yshift=0.1em,xshift=1em]t6.north) -- ([yshift=1.2em,xshift=1em]t6.north);
+{
+\draw [<->,ublue,thick] ([xshift=0.3em]ws4.south) .. controls +(-60:1) and +(south:1) .. (wt4.south);
+\draw [<->,ublue,thick] (ws4.south) .. controls +(south:1.0) and +(south:1.5) .. (wt5.south);
+}
+{
+\node [anchor=north,fill=green!30] (attentionlabel) at ([yshift=-3.4em]representation.south) {\footnotesize{词语的关注度}};
+\draw [->,dotted,very thick,ublue] ([yshift=0.1em]attentionlabel.north)--([yshift=-0.1em]representation.south);
+}
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter12/Figures/figure-calculation-of-context-vector-c.tex
+++ b/Chapter12/Figures/figure-calculation-of-context-vector-c.tex
+\begin{tikzpicture}
+\begin{scope}
+\tikzstyle{rnode} = [draw,minimum width=3.5em,minimum height=1.2em]
+\node [rnode,anchor=south west,fill=green!20!white] (key1) at (0,0) {\scriptsize{$h(\textrm{沈阳})$}};
+\node [rnode,anchor=south west,fill=green!20!white] (key2) at ([xshift=1em]key1.south east) {\scriptsize{$h(\textrm{到})$}};
+\node [rnode,anchor=south west,fill=green!20!white] (key3) at ([xshift=1em]key2.south east) {\scriptsize{$h(\textrm{广州})$}};
+\node [rnode,anchor=south west,fill=green!20!white] (key4) at ([xshift=2em]key3.south east) {\scriptsize{$h(\textrm{机票})$}};
+\node [rnode,anchor=south west] (key5) at ([xshift=1em]key4.south east) {\scriptsize{$h(\textrm{机票})$}};
+\node [anchor=west] (sep1) at ([xshift=0.3em]key3.east) {\scriptsize{$\textbf{...}$}};
+\draw [->] ([yshift=1pt,xshift=-3pt]key5.north) .. controls +(90:1em) and +(90:0.7em) .. ([yshift=1pt]key4.north);
+\draw [->] ([yshift=1pt,xshift=0pt]key5.north) .. controls +(90:1.4em) and +(90:1.4em) .. ([yshift=1pt]key3.north);
+\draw [->] ([yshift=1pt,xshift=3pt]key5.north) .. controls +(90:1.8em) and +(90:1.8em) .. ([yshift=1pt]key2.north);
+\draw [->] ([yshift=1pt,xshift=6pt]key5.north) .. controls +(90:2.2em) and +(90:2.2em) .. ([yshift=1pt]key1.north);
+\node [anchor=south west] (alpha1) at ([xshift=-1em]key1.north west) {\scriptsize{$\alpha_1=.2$}};
+\node [anchor=south west] (alpha2) at ([xshift=-1em]key2.north west) {\scriptsize{$\alpha_2=.3$}};
+\node [anchor=south west] (alpha3) at ([xshift=-1em]key3.north west) {\scriptsize{$\alpha_3=.1$}};
+\node [anchor=south west] (alpha4) at ([xshift=-1em]key4.north west) {\scriptsize{$\alpha_4=.3$}};
+\vspace{0.5em}
+\node [rnode,anchor=south west,fill=green!20!white] (key6) at ([yshift=2em]key1.north west) {\scriptsize{$h(\textrm{广州})$}};
+\node [rnode,anchor=south west,fill=green!20!white] (key7) at ([yshift=2em]key2.north west) {\scriptsize{$h(\textrm{到})$}};
+\node [rnode,anchor=south west,fill=green!20!white] (key8) at ([yshift=2em]key3.north west) {\scriptsize{$h(\textrm{沈阳})$}};
+\node [rnode,anchor=south west,fill=green!20!white] (key9) at ([yshift=2em]key4.north west) {\scriptsize{$h(\textrm{机票})$}};
+\node [rnode,anchor=south west] (key10) at ([yshift=2em]key5.north west) {\scriptsize{$h(\textrm{``机票''})$}};
+\node [anchor=west] (sep1) at ([xshift=0.3em]key8.east) {\scriptsize{$\textbf{...}$}};
+\draw [->] ([yshift=1pt,xshift=-3pt]key10.north) .. controls +(90:1em) and +(90:0.7em) .. ([yshift=1pt]key9.north);
+\draw [->] ([yshift=1pt,xshift=0pt]key10.north) .. controls +(90:1.4em) and +(90:1.4em) .. ([yshift=1pt]key8.north);
+\draw [->] ([yshift=1pt,xshift=3pt]key10.north) .. controls +(90:1.8em) and +(90:1.8em) .. ([yshift=1pt]key7.north);
+\draw [->] ([yshift=1pt,xshift=6pt]key10.north) .. controls +(90:2.2em) and +(90:2.2em) .. ([yshift=1pt]key6.north);
+\node [anchor=south west] (alpha5) at ([xshift=-1em]key6.north west) {\scriptsize{$\alpha_1=.1$}};
+\node [anchor=south west] (alpha6) at ([xshift=-1em]key7.north west) {\scriptsize{$\alpha_2=.3$}};
+\node [anchor=south west] (alpha7) at ([xshift=-1em]key8.north west) {\scriptsize{$\alpha_3=.2$}};
+\node [anchor=south west] (alpha8) at ([xshift=-1em]key9.north west) {\scriptsize{$\alpha_4=.3$}};
+\end{scope}
+\end{tikzpicture}
+\vspace{-1.0em}
+\footnotesize{
+\begin{eqnarray}
+\tilde{\mathbf{h}} (\textrm{机票}) & = & 0.2 \times h(\textrm{沈阳}) + 0.3 \times h(\textrm{到}) + \nonumber \\
+             &   & 0.1 \times h(\textrm{广州}) + ... + 0.3 \times h(\textrm{机票}) \nonumber
+\end{eqnarray}
+}
\ No newline at end of file
--- a/Chapter12/Figures/figure-calculation-process-of-context-vector-c.tex
+++ b/Chapter12/Figures/figure-calculation-process-of-context-vector-c.tex
+\begin{tikzpicture}
+\begin{scope}
+\node [anchor=west,draw,fill=red!20!white,inner sep=3pt,minimum width=2em,minimum height=1.2em] (h1) at (0,0) {\scriptsize{$\vectorn{\emph{h}}_1$}};
+\node [anchor=west,draw,fill=red!20!white,inner sep=3pt,minimum width=2em,minimum height=1.2em] (h2) at ([xshift=1em]h1.east) {\scriptsize{$\vectorn{\emph{h}}_2$}};
+\node [anchor=west,inner sep=0pt,minimum width=3em] (h3) at ([xshift=0.5em]h2.east) {\scriptsize{...}};
+\node [anchor=west,draw,fill=red!20!white,inner sep=3pt,minimum width=2em,minimum height=1.2em] (h4) at ([xshift=0.5em]h3.east) {\scriptsize{$\vectorn{\emph{h}}_m$}};
+\node [anchor=south,circle,minimum size=1.0em,draw,ublue,thick] (sum) at ([yshift=2em]h2.north east) {};
+\draw [thick,-,ublue] (sum.north) -- (sum.south);
+\draw [thick,-,ublue] (sum.west) -- (sum.east);
+\node [anchor=south,draw,fill=green!20!white,inner sep=3pt,minimum width=2em,minimum height=1.2em] (th1) at ([yshift=2em,xshift=-1em]sum.north west) {\scriptsize{$\vectorn{\emph{s}}_{j-1}$}};
+\node [anchor=west,draw,fill=green!20!white,inner sep=3pt,minimum width=2em,minimum height=1.2em] (th2) at ([xshift=2em]th1.east) {\scriptsize{$\vectorn{\emph{s}}_{j}$}};
+\draw [->] (h1.north) .. controls +(north:0.8) and +(west:1) ..  (sum.190) node [pos=0.2,left] {\scriptsize{$\alpha_{1,j}$}};
+\draw [->] (h2.north) .. controls +(north:0.6) and +(220:0.2) ..  (sum.220) node [pos=0.2,right] {\scriptsize{$\alpha_{2,j}$}};
+\draw [->] (h4.north) .. controls +(north:0.8) and +(east:1) ..  (sum.-10) node [pos=0.1,left] (alphan) {\scriptsize{$\alpha_{m,j}$}};
+\draw [->] ([xshift=-1.5em]th1.west) -- ([xshift=-0.1em]th1.west);
+\draw [->] ([xshift=0.1em]th1.east) -- ([xshift=-0.1em]th2.west);
+\draw [->] ([xshift=0.1em]th2.east) -- ([xshift=1.5em]th2.east);
+\draw [->] (sum.north) .. controls +(north:0.8) and +(west:0.2) ..  ([yshift=-0.4em,xshift=-0.1em]th2.west) node [pos=0.2,right] (ci) {\scriptsize{$\vectorn{\emph{C}}_{j}$}};
+\node [anchor=south,inner sep=1pt] (output) at ([yshift=0.8em]th2.north) {\scriptsize{输出层}};
+\draw [->] ([yshift=0.1em]th2.north) -- ([yshift=-0.1em]output.south);
+\node [anchor=north] (enc1) at (h1.south west) {\scriptsize{编码器输出}};
+\node [anchor=north] (enc12) at ([yshift=0.5em]enc1.south) {\scriptsize{(位置$1$)}};
+\node [anchor=north] (enc2) at (h2.south) {\scriptsize{编码器输出}};
+\node [anchor=north] (enc22) at ([yshift=0.5em]enc2.south) {\scriptsize{(位置$2$)}};
+\node [anchor=north] (enc4) at (h4.south) {\scriptsize{编码器输出}};
+\node [anchor=north] (enc42) at ([yshift=0.5em]enc4.south) {\scriptsize{(位置$4$)}};
+{
+\node [anchor=west] (math1) at ([xshift=5em,yshift=1em]th2.east) {$\vectorn{\emph{C}}_j = \sum_{i} \alpha_{i,j} \vectorn{\emph{h}}_i \ \ $};
+}
+{
+\node [anchor=north west] (math2) at ([yshift=-2em]math1.south west) {$\alpha_{i,j} = \frac{\exp(\beta_{i,j})}{\sum_{i'} \exp(\beta_{i',j})}$};
+\node [anchor=north west] (math3) at ([yshift=-0em]math2.south west) {$\beta_{i,j} = a(\vectorn{\emph{s}}_{j-1}, \vectorn{\emph{h}}_i)$};
+}
+\begin{pgfonlayer}{background}
+{
+\node [rectangle,inner sep=0.4em,rounded corners=1pt,fill=blue!10,drop shadow] [fit = (math1)] (box1) {};
+}
+{
+\node [rectangle,inner sep=0.4em,rounded corners=1pt,fill=orange!10,drop shadow] [fit = (math2) (math3)] (box2) {};
+}
+\end{pgfonlayer}
+{
+\draw [->,dotted,thick,blue] (box1.west) .. controls +(west:1.2) and +(east:2.0) .. ([xshift=-0.3em]ci.east);
+}
+{
+\draw [->,dotted,thick,orange] ([yshift=1em]box2.west) .. controls +(west:1.2) and +(east:1.0) .. ([xshift=-0.35em]alphan.east);
+}
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter12/Figures/figure-comparison-of-the-number-of-padding-in-batch.tex
+++ b/Chapter12/Figures/figure-comparison-of-the-number-of-padding-in-batch.tex
+\begin{tikzpicture}
+\begin{scope}[scale=1.5]
+{\Large
+\tikzstyle{snode} = [draw,inner sep=1pt,minimum width=3em,minimum height=0.5em,rounded corners=1pt,fill=green!30!white]
+\tikzstyle{pnode} = [draw,inner sep=1pt,minimum width=1em,minimum height=0.5em,rounded corners=1pt]
+\node [anchor=west,snode] (s1) at (0,0) {};
+\node [anchor=north west,snode,minimum width=6.5em] (s2) at ([yshift=-0.3em]s1.south west) {};
+\node [anchor=north west,snode,minimum width=2em] (s3) at ([yshift=-0.3em]s2.south west) {};
+\node [anchor=east] (label1) at ([xshift=-0.8em,yshift=0.6em]s1.west) {\scriptsize{Shuffled:}};
+\node [anchor=west,pnode,minimum width=3em] (p1) at ([xshift=0.3em]s1.east) {};
+\node [anchor=west,pnode,minimum width=4em] (p3) at ([xshift=0.3em]s3.east) {};
+\node [anchor=west,snode,minimum width=5em] (s4) at ([xshift=4em]p1.east) {};
+\node [anchor=north west,snode,minimum width=5em] (s5) at ([yshift=-0.3em]s4.south west) {};
+\node [anchor=north west,snode,minimum width=6.5em] (s6) at ([yshift=-0.3em]s5.south west) {};
+\node [anchor=east] (label2) at ([xshift=-0.8em,yshift=0.6em]s4.west) {\scriptsize{Sorted:}};
+\node [anchor=west,pnode,minimum width=1em] (p4) at ([xshift=0.3em]s4.east) {};
+\node [anchor=west,pnode,minimum width=1em] (p5) at ([xshift=0.3em]s5.east) {};
+\node [rectangle,inner sep=0.5em,rounded corners=2pt,very thick,dotted,draw=ugreen!80] [fit = (s1) (s3) (p1) (p3)] (box0) {};
+\node [rectangle,inner sep=0.5em,rounded corners=2pt,very thick,dotted,draw=ugreen!80] [fit = (s4) (s6) (p4) (p5)] (box0) {};
+}
+\end{scope}
+\end{tikzpicture}
--- a/Chapter12/Figures/figure-decode-of-transformer.tex
+++ b/Chapter12/Figures/figure-decode-of-transformer.tex
+   \begin{tikzpicture}
+    \begin{scope}
+    \tikzstyle{rnnnode} = [minimum height=1.1em,minimum width=2.1em,inner sep=2pt,rounded corners=1pt,draw,fill=red!20];
+    \node [rnnnode,anchor=west] (h1) at (0,0) {\tiny{$\vectorn{\emph{h}}_1$}};
+    \node [rnnnode,anchor=west] (h2) at ([xshift=1em]h1.east) {\tiny{$\vectorn{\emph{h}}_2$}};
+    \node [rnnnode,anchor=west] (h3) at ([xshift=1em]h2.east) {\tiny{$\vectorn{\emph{h}}_3$}};
+    \node [rnnnode,anchor=north,fill=green!20] (e1) at ([yshift=-1em]h1.south) {\tiny{$e_x()$}};
+    \node [rnnnode,anchor=west,fill=green!20] (e2) at ([xshift=1em]e1.east) {\tiny{$e_x()$}};
+    \node [rnnnode,anchor=west,fill=green!20] (e3) at ([xshift=1em]e2.east) {\tiny{$e_x()$}};
+    \node [anchor=north,inner sep=2pt] (w1) at ([yshift=-0.6em]e1.south) {\tiny{你}};
+    \node [anchor=north,inner sep=2pt] (w2) at ([yshift=-0.6em]e2.south) {\tiny{好}};
+    \node [anchor=north,inner sep=2pt] (w3) at ([yshift=-0.6em]e3.south) {\tiny{$\langle$eos$\rangle$}};
+    %\node [anchor=south] (dot1) at ([xshift=0.4em,yshift=-0.7em]h1.south) {\tiny{...}};
+    %\node [anchor=south] (dot2) at ([xshift=-0.4em,yshift=-0.7em]h3.south) {\tiny{...}};
+    \draw [->] (w1.north) -- ([yshift=-0.1em]e1.south);
+    \draw [->] (w2.north) -- ([yshift=-0.1em]e2.south);
+    \draw [->] (w3.north) -- ([yshift=-0.1em]e3.south);
+    \draw [->] ([yshift=0.1em]e1.north) -- ([yshift=-0.1em]h1.south);
+    \draw [->] ([yshift=0.1em]e2.north) -- ([yshift=-0.1em]h2.south);
+    \draw [->] ([yshift=0.1em]e3.north) -- ([yshift=-0.1em]h3.south);
+    \draw [->] ([xshift=0.2em,yshift=0.1em]e1.north) .. controls +(north:0.3) and +(south:0.4) .. ([xshift=-0.3em,yshift=-0.1em]h2.south);
+    \draw [->] ([xshift=-0.2em,yshift=0.1em]e3.north) .. controls +(north:0.3) and +(south:0.4) .. ([xshift=0.3em,yshift=-0.1em]h2.south);
+    \draw [->] ([xshift=0.4em,yshift=-0.4em]h1.south) -- ([xshift=0.3em,yshift=-0.1em]h1.south);
+    \draw [->] ([xshift=0.8em,yshift=-0.4em]h1.south) -- ([xshift=0.6em,yshift=-0.1em]h1.south);
+    \draw [->] ([xshift=-0.4em,yshift=-0.4em]h3.south) -- ([xshift=-0.3em,yshift=-0.1em]h3.south);
+    \draw [->] ([xshift=-0.8em,yshift=-0.4em]h3.south) -- ([xshift=-0.6em,yshift=-0.1em]h3.south);
+    \node [anchor=south] (encoder) at ([xshift=-0.2em]h1.north west) {\scriptsize{\textbf{编码器}}};
+{
+    \node [rnnnode,anchor=west,fill=green!20] (t1) at ([xshift=3em]e3.east) {\tiny{$e_y()$}};
+    }
+{
+    \node [rnnnode,anchor=west,fill=green!20] (t2) at ([xshift=1.5em]t1.east) {\tiny{$e_y()$}};
+    }
+{
+    \node [rnnnode,anchor=west,fill=green!20] (t3) at ([xshift=1.5em]t2.east) {\tiny{$e_y()$}};
+    \node [rnnnode,anchor=west,fill=green!20] (t4) at ([xshift=1.5em]t3.east) {\tiny{$e_y()$}};
+    %\node [anchor=west,inner sep=2pt] (t5) at ([xshift=0.3em]t4.east) {\tiny{...}};
+    }
+{
+    \node [rnnnode,anchor=south] (s1) at ([yshift=1em]t1.north) {\tiny{$\vectorn{\emph{s}}_1$}};
+    \node [rnnnode,anchor=south] (f1) at ([yshift=1em]s1.north) {\tiny{$\vectorn{\emph{f}}_1$}};
+    }
+{
+    \node [rnnnode,anchor=south] (s2) at ([yshift=1em]t2.north) {\tiny{$\vectorn{\emph{s}}_2$}};
+    \node [rnnnode,anchor=south] (f2) at ([yshift=1em]s2.north) {\tiny{$\vectorn{\emph{f}}_2$}};
+    }
+{
+    \node [rnnnode,anchor=south] (s3) at ([yshift=1em]t3.north) {\tiny{$\vectorn{\emph{s}}_3$}};
+    \node [rnnnode,anchor=south] (f3) at ([yshift=1em]s3.north) {\tiny{$\vectorn{\emph{f}}_3$}};
+    \node [rnnnode,anchor=south] (s4) at ([yshift=1em]t4.north) {\tiny{$\vectorn{\emph{s}}_4$}};
+    \node [rnnnode,anchor=south] (f4) at ([yshift=1em]s4.north) {\tiny{$\vectorn{\emph{f}}_4$}};
+    %\node [anchor=west,inner sep=2pt] (s5) at ([xshift=0.3em]s4.east) {\tiny{...}};
+    %\node [anchor=south] (dot3) at ([xshift=-0.4em,yshift=-0.7em]s3.south) {\tiny{...}};
+    \node [anchor=south] (dot4) at ([xshift=-0.4em,yshift=-0.7em]s4.south) {\tiny{...}};
+    }
+{
+    \node [rnnnode,anchor=south,fill=blue!20] (o1) at ([yshift=1em]f1.north) {\tiny{softmax}};
+    \node [anchor=east] (decoder) at ([xshift=-0.3em,yshift=0.5em]o1.north west) {\scriptsize{\textbf{解码器}}};
+    }
+{
+    \node [rnnnode,anchor=south,fill=blue!20] (o2) at ([yshift=1em]f2.north) {\tiny{softmax}};
+    }
+{
+    \node [rnnnode,anchor=south,fill=blue!20] (o3) at ([yshift=1em]f3.north) {\tiny{softmax}};
+    \node [rnnnode,anchor=south,fill=blue!20] (o4) at ([yshift=1em]f4.north) {\tiny{softmax}};
+    %\node [anchor=west,inner sep=2pt] (o5) at ([xshift=0.3em]o4.east) {\tiny{...}};
+    }
+{
+    \node [anchor=north,inner sep=2pt] (wt1) at ([yshift=-0.6em]t1.south) {\tiny{$\langle$eos$\rangle$}};
+    }
+{
+    \node [anchor=north,inner sep=2pt] (wt2) at ([yshift=-0.6em]t2.south) {\tiny{How}};
+    }
+{
+    \node [anchor=north,inner sep=2pt] (wt3) at ([yshift=-0.8em]t3.south) {\tiny{are}};
+    \node [anchor=north,inner sep=2pt] (wt4) at ([yshift=-0.8em]t4.south) {\tiny{you}};
+    }
+{
+    \node [anchor=center,inner sep=2pt] (wo1) at ([yshift=1.2em]o1.north) {\tiny{How}};
+    }
+{
+    \node [anchor=south,inner sep=2pt] (wos1) at (wo1.north) {\tiny{\textbf{[step 1]}}};
+    }
+{
+    \node [anchor=center,inner sep=2pt] (wo2) at ([yshift=1.2em]o2.north) {\tiny{are}};
+    }
+{
+    \node [anchor=south,inner sep=2pt] (wos2) at (wo2.north) {\tiny{\textbf{[step 2]}}};
+    }
+{
+    \node [anchor=center,inner sep=2pt] (wo3) at ([yshift=1.2em]o3.north) {\tiny{you}};
+    \node [anchor=south,inner sep=2pt] (wos3) at (wo3.north) {\tiny{\textbf{[step 3]}}};
+    \node [anchor=center,inner sep=2pt] (wo4) at ([yshift=1.2em]o4.north) {\tiny{$\langle$eos$\rangle$}};
+    \node [anchor=south,inner sep=2pt] (wos4) at (wo4.north) {\tiny{\textbf{[step 4]}}};
+    }
+{
+    \foreach \x in {1}{
+        \draw [->] ([yshift=-0.7em]t\x.south) -- ([yshift=-0.1em]t\x.south);
+        \draw [->] ([yshift=0.1em]t\x.north) -- ([yshift=-0.1em]s\x.south);
+        \draw [->] ([yshift=0.1em]s\x.north) -- ([yshift=-0.1em]f\x.south);
+        \draw [->] ([yshift=0.1em]f\x.north) -- ([yshift=-0.1em]o\x.south);
+        \draw [->] ([yshift=0.1em]o\x.north) -- ([yshift=0.8em]o\x.north) node [pos=0.5,right] {\tiny{top1}};
+    }
+    }
+{
+    \foreach \x in {2}{
+        \draw [->] ([yshift=-0.7em]t\x.south) -- ([yshift=-0.1em]t\x.south);
+        \draw [->] ([yshift=0.1em]t\x.north) -- ([yshift=-0.1em]s\x.south);
+        \draw [->] ([yshift=0.1em]s\x.north) -- ([yshift=-0.1em]f\x.south);
+        \draw [->] ([yshift=0.1em]f\x.north) -- ([yshift=-0.1em]o\x.south);
+        \draw [->] ([yshift=0.1em]o\x.north) -- ([yshift=0.8em]o\x.north) node [pos=0.5,right] {\tiny{top1}};
+    \draw [->] ([xshift=0.2em,yshift=0.1em]t1.north) .. controls +(north:0.3) and +(south:0.3) .. ([xshift=-0.3em,yshift=-0.1em]s2.south);
+    }
+    }
+{
+    \foreach \x in {3,4}{
+        \draw [->] ([yshift=-0.7em]t\x.south) -- ([yshift=-0.1em]t\x.south);
+        \draw [->] ([yshift=0.1em]t\x.north) -- ([yshift=-0.1em]s\x.south);
+        \draw [->] ([yshift=0.1em]s\x.north) -- ([yshift=-0.1em]f\x.south);
+        \draw [->] ([yshift=0.1em]f\x.north) -- ([yshift=-0.1em]o\x.south);
+        \draw [->] ([yshift=0.1em]o\x.north) -- ([yshift=0.8em]o\x.north) node [pos=0.5,right] {\tiny{top1}};
+    %\draw [->] ([xshift=0.4em,yshift=0.1em]t1.north) .. controls +(north:0.25) and +(south:0.3) .. ([xshift=-0.6em,yshift=-0.1em]s3.south);
+    %\draw [->] ([xshift=0.2em,yshift=0.1em]t2.north) .. controls +(north:0.2) and +(south:0.4) .. ([xshift=-0.3em,yshift=-0.1em]s3.south);
+    \draw [->] ([xshift=-0.6em,yshift=-0.5em]s3.south) .. controls +(north:0) and +(south:0.2) .. ([xshift=-0.3em,yshift=-0.1em]s3.south);
+    \draw [->] ([xshift=-1.5em,yshift=-0.5em]s3.south) .. controls +(north:0) and +(south:0.15) .. ([xshift=-0.6em,yshift=-0.1em]s3.south);
+    }
+    }
+{
+    \draw [->,thick,dotted] (wo1.east) .. controls +(east:1.0) and +(west:1.0) ..(wt2.west);
+    }
+{
+    \draw [->,thick,dotted] (wo2.east) .. controls +(east:1.3) and +(west:1.1) ..(wt3.west);
+    \draw [->,thick,dotted] (wo3.east) .. controls +(east:1.1) and +(west:0.9) ..(wt4.west);
+    }
+{
+    \node [circle,draw,anchor=south,inner sep=3pt,fill=orange!20] (c1) at ([yshift=2em]h2.north) {\tiny{$\vectorn{\emph{C}}_1$}};
+    \node [anchor=south] (c1label) at (c1.north) {\tiny{\textbf{编码-解码注意力机制：上下文}}};
+    \draw [->] (h1.north) .. controls +(north:0.6) and +(250:0.9) .. (c1.250);
+    \draw [->] (h2.north) .. controls +(north:0.6) and +(270:0.9) .. (c1.270);
+    \draw [->] (h3.north) .. controls +(north:0.6) and +(290:0.9) .. (c1.290);
+    \draw [->] ([yshift=0.3em]s1.west) .. controls +(west:1) and +(east:1) .. (c1.-30);
+    \draw [->] (c1.0) .. controls +(east:1) and +(west:1) .. ([yshift=0em]f1.west);
+    }
+{
+    \node [circle,draw,anchor=north,inner sep=3pt,fill=orange!20] (c2) at ([yshift=-2em]t1.south) {\tiny{$\vectorn{\emph{C}}_2$}};
+    \draw [->] ([xshift=-0.7em]c2.west) -- ([xshift=-0.1em]c2.west);
+    \draw [->] ([xshift=0.1em]c2.east) .. controls +(east:0.6) and +(west:0.8) ..([yshift=-0.3em,xshift=-0.1em]f2.west);
+    }
+{
+    \node [circle,draw,anchor=north,inner sep=3pt,fill=orange!20] (c3) at ([yshift=-2em]t2.south) {\tiny{$\vectorn{\emph{C}}_3$}};
+    \draw [->] ([xshift=-0.7em]c3.west) -- ([xshift=-0.1em]c3.west);
+    \draw [->] ([xshift=0.1em]c3.east) .. controls +(east:0.6) and +(west:0.8) ..([yshift=-0.3em,xshift=-0.1em]f3.west);
+    }
+{
+    \node [circle,draw,anchor=north,inner sep=3pt,fill=orange!20] (c4) at ([yshift=-2em]t3.south) {\tiny{$\vectorn{\emph{C}}_4$}};
+    \draw [->] ([xshift=-0.7em]c4.west) -- ([xshift=-0.1em]c4.west);
+    \draw [->] ([xshift=0.1em]c4.east) .. controls +(east:0.6) and +(west:0.8) ..([yshift=-0.3em,xshift=-0.1em]f4.west);
+    }
+    \end{scope}
+    \end{tikzpicture}
\ No newline at end of file
--- a/Chapter12/Figures/figure-dependencies-between-words-in-a-recurrent-neural-network.tex
+++ b/Chapter12/Figures/figure-dependencies-between-words-in-a-recurrent-neural-network.tex
+\begin{tikzpicture}
+\begin{scope}
+\node [anchor=west] (w0) at (0,0) {$w_1$};
+\node [anchor=west] (w1) at ([xshift=0.5em]w0.east) {$w_2$};
+\node [anchor=west] (w2) at ([xshift=0.5em]w1.east) {$w_3$};
+\node [anchor=west] (w3) at ([xshift=0.5em]w2.east) {$...$};
+\node [anchor=west] (w4) at ([xshift=0.5em]w3.east) {$w_{m-1}$};
+\node [anchor=west,fill=green!20!white] (w5) at ([xshift=0.5em]w4.east) {$w_{m}$};
+\draw [->,thick,red] (w1.north).. controls +(130:0.5) and +(50:0.5) .. (w0.north);
+\draw [->,thick,red] (w2.north).. controls +(130:0.5) and +(50:0.5) .. (w1.north);
+\draw [->,thick,red] ([yshift=0.2em]w3.north).. controls +(130:0.5) and +(50:0.5) .. (w2.north);
+\draw [->,thick,red] (w4.north).. controls +(130:0.5) and +(50:0.5) .. ([yshift=0.2em]w3.north);
+\draw [->,thick,red] (w5.north).. controls +(130:0.5) and +(50:0.5) .. (w4.north);
+\draw [->,very thick,red] ([xshift=-5em]w0.west) -- ([xshift=-6.5em]w0.west) node [pos=0,right] {\scriptsize{信息传递}};
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter12/Figures/figure-dependencies-between-words-of-attention.tex
+++ b/Chapter12/Figures/figure-dependencies-between-words-of-attention.tex
+\begin{tikzpicture}
+\begin{scope}
+\node [anchor=west] (w0) at (0,-2) {$w_1$};
+\node [anchor=west] (w1) at ([xshift=0.5em]w0.east) {$w_2$};
+\node [anchor=west] (w2) at ([xshift=0.5em]w1.east) {$w_3$};
+\node [anchor=west] (w3) at ([xshift=0.5em]w2.east) {$...$};
+\node [anchor=west] (w4) at ([xshift=0.5em]w3.east) {$w_{m-1}$};
+\node [anchor=west,fill=green!20!white] (w5) at ([xshift=0.5em]w4.east) {$w_{m}$};
+\draw [->,thick,red] (w5.north).. controls +(100:0.85) and +(50:0.85) .. (w0.north);
+\draw [->,thick,red] (w5.north).. controls +(110:0.75) and +(50:0.75) .. (w1.north);
+\draw [->,thick,red] (w5.north).. controls +(120:0.6) and +(50:0.6) .. ([yshift=0.2em]w3.north);
+\draw [->,thick,red] (w5.north).. controls +(130:0.5) and +(50:0.5) .. (w4.north);
+\draw [->,very thick,red] ([xshift=-5em]w0.west) -- ([xshift=-6.5em]w0.west) node [pos=0,right] {\scriptsize{信息传递}};
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter12/Figures/figure-different-regularization-methods.tex
+++ b/Chapter12/Figures/figure-different-regularization-methods.tex
+\begin{tikzpicture}
+\begin{scope}
+\tikzstyle{lnode} = [minimum height=1.5em,minimum width=3em,inner sep=3pt,rounded corners=1.5pt,draw,fill=orange!20];
+\tikzstyle{standard} = [rounded corners=3pt]
+\node [lnode,anchor=west] (l1) at (0,0) {\scriptsize{子层$n$}};
+\node [lnode,anchor=west] (l2) at ([xshift=3em]l1.east) {\scriptsize{层正则化}};
+\node [lnode,anchor=west] (l3) at ([xshift=4em]l2.east) {\scriptsize{层正则化}};
+\node [lnode,anchor=west] (l4) at ([xshift=1.5em]l3.east) {\scriptsize{子层$n$}};
+\node [anchor=west] (plus1) at ([xshift=0.9em]l1.east) {\scriptsize{$\mathbf{\oplus}$}};
+\node [anchor=west] (plus2) at ([xshift=0.9em]l4.east) {\scriptsize{$\mathbf{\oplus}$}};
+\node [anchor=north] (label1) at ([xshift=3em,yshift=-0.5em]l1.south) {\scriptsize{(a)后正则化}};
+\node [anchor=north] (label2) at ([xshift=3em,yshift=-0.5em]l3.south) {\scriptsize{(b)前正则化}};
+\draw [->,thick] ([xshift=-1.5em]l1.west) -- ([xshift=-0.1em]l1.west);
+\draw [->,thick] ([xshift=0.1em]l1.east) -- ([xshift=0.2em]plus1.west);
+\draw [->,thick] ([xshift=-0.2em]plus1.east) -- ([xshift=-0.1em]l2.west);
+\draw [->,thick] ([xshift=0.1em]l2.east) -- ([xshift=1em]l2.east);
+\draw [->,thick] ([xshift=-1.5em]l3.west) -- ([xshift=-0.1em]l3.west);
+\draw [->,thick] ([xshift=0.1em]l3.east) -- ([xshift=-0.1em]l4.west);
+\draw [->,thick] ([xshift=0.1em]l4.east) -- ([xshift=0.2em]plus2.west);
+\draw [->,thick] ([xshift=-0.2em]plus2.east) -- ([xshift=1em]plus2.east);
+\draw[->,standard,thick] ([xshift=-0.8em]l1.west) -- ([xshift=-0.8em,yshift=2em]l1.west) -- ([yshift=2em]plus1.center) -- ([yshift=-0.2em]plus1.north);
+\draw[->,standard,thick] ([xshift=-0.8em]l3.west) -- ([xshift=-0.8em,yshift=2em]l3.west) -- ([yshift=2em]plus2.center) -- ([yshift=-0.2em]plus2.north);
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter12/Figures/figure-encoder-decoder-with-attention.tex
+++ b/Chapter12/Figures/figure-encoder-decoder-with-attention.tex
+%---------------------------------------------------------
+\begin{tikzpicture}
+%\setlength{\mystep}{1.6em}
+%%% a simple encoder-decoder model
+\begin{scope}
+\foreach \x in {1,2,...,6}
+    \node[] (s\x) at (\x * 1.6em,0) {};
+\node [] (ws1) at (s1) {\scriptsize{这}};
+\node [] (ws2) at (s2) {\scriptsize{是}};
+\node [] (ws3) at (s3) {\scriptsize{个}};
+\node [] (ws4) at (s4) {\scriptsize{很长}};
+\node [] (ws5) at (s5) {\scriptsize{的}};
+\node [] (ws6) at (s6) {\scriptsize{句子}};
+\foreach \x in {1,2,...,6}
+    \node[] (t\x) at (\x * 1.6em + 2.4in,0) {};
+\node [] (wt1) at (t1) {\scriptsize{This}};
+\node [] (wt2) at (t2) {\scriptsize{is}};
+\node [] (wt3) at ([yshift=-1pt]t3) {\scriptsize{a}};
+\node [] (wt4) at ([yshift=-0.1em]t4) {\scriptsize{very}};
+\node [] (wt5) at (t5) {\scriptsize{long}};
+\node [] (wt6) at ([xshift=1em]t6) {\scriptsize{sentence}};
+\node [anchor=south west,fill=red!30,minimum width=1.6in,minimum height=1.5em] (encoder) at ([yshift=1.0em]ws1.north west) {\footnotesize{Encoder}};
+\node [anchor=west,fill=blue!30,minimum width=1.9in,minimum height=1.5em] (decoder) at ([xshift=4.5em]encoder.east) {\footnotesize{Decoder}};
+\node [anchor=west,fill=green!30,minimum height=1.5em] (representation) at ([xshift=1em]encoder.east) {\footnotesize{表示}};
+\draw [->,thick] ([xshift=1pt]encoder.east)--([xshift=-1pt]representation.west);
+\draw [->,thick] ([xshift=1pt]representation.east)--([xshift=-1pt]decoder.west);
+\foreach \x in {1,2,...,6}
+    \draw[->] ([yshift=0.1em]s\x.north) -- ([yshift=1.2em]s\x.north);
+\foreach \x in {1,2,...,5}
+    \draw[<-] ([yshift=0.1em]t\x.north) -- ([yshift=1.2em]t\x.north);
+\draw[<-] ([yshift=0.1em,xshift=1em]t6.north) -- ([yshift=1.2em,xshift=1em]t6.north);
+\node [anchor=north] (cap) at ([xshift=2em,yshift=-2.5em]encoder.south east) {\small{(a) 简单的编码器-解码器框架}};
+\end{scope}
+%%% a encoder-decoder model with attention
+\begin{scope}[yshift=-1.7in]
+\foreach \x in {1,2,...,6}
+    \node[] (s\x) at (\x * 1.6em,0) {};
+\node [] (ws1) at (s1) {\scriptsize{这}};
+\node [] (ws2) at (s2) {\scriptsize{是}};
+\node [] (ws3) at (s3) {\scriptsize{个}};
+\node [] (ws4) at (s4) {\scriptsize{很长}};
+\node [] (ws5) at (s5) {\scriptsize{的}};
+\node [] (ws6) at (s6) {\scriptsize{句子}};
+\foreach \x in {1,2,...,6}
+    \node[] (t\x) at (\x * 1.6em + 2.4in,0) {};
+\node [] (wt1) at (t1) {\scriptsize{This}};
+\node [] (wt2) at (t2) {\scriptsize{is}};
+\node [] (wt3) at ([yshift=-1pt]t3) {\scriptsize{a}};
+\node [] (wt4) at ([yshift=-0.1em]t4) {\scriptsize{very}};
+\node [] (wt5) at (t5) {\scriptsize{long}};
+\node [] (wt6) at ([xshift=1em]t6) {\scriptsize{sentence}};
+\node [anchor=south west,fill=red!30,minimum width=1.6in,minimum height=1.5em] (encoder) at ([yshift=1.0em]ws1.north west) {\footnotesize{Encoder}};
+\node [anchor=west,fill=blue!30,minimum width=1.9in,minimum height=1.5em] (decoder) at ([xshift=4.5em]encoder.east) {\footnotesize{Decoder}};
+\foreach \x in {1,2,...,6}
+    \draw[->] ([yshift=0.1em]s\x.north) -- ([yshift=1.2em]s\x.north);
+\foreach \x in {1,2,...,5}
+    \draw[<-] ([yshift=0.1em]t\x.north) -- ([yshift=1.2em]t\x.north);
+\draw[<-] ([yshift=0.1em,xshift=1em]t6.north) -- ([yshift=1.2em,xshift=1em]t6.north);
+\draw [->] ([yshift=3em]s6.north) -- ([yshift=4em]s6.north) -- ([yshift=4em]t1.north) node [pos=0.5,fill=green!30,inner sep=2pt] (c1) {\scriptsize{表示$\vectorn{\emph{C}}_1$}} -- ([yshift=3em]t1.north) ;
+\draw [->] ([yshift=3em]s5.north) -- ([yshift=5.3em]s5.north) -- ([yshift=5.3em]t2.north) node [pos=0.5,fill=green!30,inner sep=2pt] (c2) {\scriptsize{表示$\vectorn{\emph{C}}_2$}} -- ([yshift=3em]t2.north) ;
+\draw [->] ([yshift=3.5em]s3.north) -- ([yshift=6.6em]s3.north) -- ([yshift=6.6em]t4.north) node [pos=0.5,fill=green!30,inner sep=2pt] (c3) {\scriptsize{表示$\vectorn{\emph{C}}_i$}} -- ([yshift=3.5em]t4.north) ;
+\node [anchor=north] (smore) at ([yshift=3.5em]s3.north) {...};
+\node [anchor=north] (tmore) at ([yshift=3.5em]t4.north) {...};
+\node [anchor=north] (cap) at ([xshift=2em,yshift=-2.5em]encoder.south east) {\small{(b) 引入注意力机制的编码器-解码器框架}};
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter12/Figures/figure-example-of-context-vector-calculation-process.tex
+++ b/Chapter12/Figures/figure-example-of-context-vector-calculation-process.tex
+%-------------------------------------------
+\begin{tikzpicture}
+%\newlength{\mystep}
+%\newlength{\wseg}
+%\newlength{\hseg}
+%\newlength{\wnode}
+%\newlength{\hnode}
+\setlength{\wseg}{1.5cm}
+\setlength{\hseg}{1.0cm}
+\setlength{\wnode}{3.75cm}
+\setlength{\hnode}{1.0cm}
+\tikzstyle{elementnode} = [rectangle,text=white,anchor=center]
+\tikzstyle{srcnode} = [rotate=45,font=\small,anchor=south west]
+\tikzstyle{tgtnode} = [left,font=\small,anchor=north east]
+\tikzstyle{alignmentnode} = [rectangle,draw,minimum height=3.6\hnode,minimum width=0.36\hnode]
+\tikzstyle{probnode} = [fill=blue!30,minimum width=0.4\hnode]
+\tikzstyle{labelnode} = [above]
+% alignment matrix
+\begin{scope}[scale=0.9,yshift=0.12in]
+\foreach \i / \j / \c in
+    {0/7/0.2, 1/7/0.45, 2/7/0.15, 3/7/0.15, 4/7/0.15, 5/7/0.15,
+    0/6/0.35, 1/6/0.45, 2/6/0.15, 3/6/0.15, 4/6/0.15, 5/6/0.15,
+    0/5/0.25, 1/5/0.15, 2/5/0.15, 3/5/0.35, 4/5/0.15, 5/5/0.15,
+    0/4/0.15, 1/4/0.25, 2/4/0.2, 3/4/0.30, 4/4/0.15, 5/4/0.15,
+    0/3/0.15, 1/3/0.15, 2/3/0.8, 3/3/0.25, 4/3/0.15, 5/3/0.25,
+    0/2/0.15, 1/2/0.15, 2/2/0.15, 3/2/0.15, 4/2/0.25, 5/2/0.3,
+    0/1/0.15, 1/1/0.15, 2/1/0.15, 3/1/0.15, 4/1/0.8, 5/1/0.15,
+    0/0/0.15, 1/0/0.15, 2/0/0.15, 3/0/0.15, 4/0/0.25, 5/0/0.60}
+    \node[elementnode,minimum size=0.6*\hnode*\c,inner sep=0.1pt,fill=blue] (a\i\j) at (0.5*\hnode*\i-5.4*0.5*\hnode,0.5*\hnode*\j-1.05*\hnode) {};
+%attention score labels
+\node[align=center] (l17) at (a17) {\scriptsize{{\color{white} .4}}};
+\node[align=center] (l26) at (a06) {\scriptsize{{\color{white} .3}}};
+\node[align=center] (l26) at (a16) {\scriptsize{{\color{white} .4}}};
+\node[align=center] (l17) at (a35) {\scriptsize{{\color{white} .3}}};
+\node[align=center] (l17) at (a34) {\tiny{{\color{white} .3}}};
+\node[align=center] (l17) at (a23) {\small{{\color{white} .8}}};
+\node[align=center] (l17) at (a41) {\small{{\color{white} .8}}};
+\node[align=center] (l17) at (a50) {\small{{\color{white} .7}}};
+% source
+\node[srcnode] (src1) at (-5.4*0.5*\hnode,-1.05*\hnode+7.5*0.5*\hnode) {\scriptsize{Have}};
+\node[srcnode] (src2) at ([xshift=0.5\hnode]src1.south west) {\scriptsize{you}};
+\node[srcnode] (src3) at ([xshift=0.5\hnode]src2.south west) {\scriptsize{learned}};
+\node[srcnode] (src4) at ([xshift=0.5\hnode]src3.south west) {\scriptsize{nothing}};
+\node[srcnode] (src5) at ([xshift=0.5\hnode]src4.south west) {\scriptsize{?}};
+\node[srcnode] (src6) at ([xshift=0.5\hnode]src5.south west) {\scriptsize{$\langle$eos$\rangle$}};
+% target
+\node[tgtnode] (tgt1) at (-6.0*0.5*\hnode,-1.05*\hnode+7.5*0.5*\hnode) {\scriptsize{你}};
+\node[tgtnode] (tgt2) at ([yshift=-0.5\hnode]tgt1.north east) {\scriptsize{什么}};
+\node[tgtnode] (tgt3) at ([yshift=-0.5\hnode]tgt2.north east) {\scriptsize{都}};
+\node[tgtnode] (tgt4) at ([yshift=-0.5\hnode]tgt3.north east) {\scriptsize{没}};
+\node[tgtnode] (tgt5) at ([yshift=-0.5\hnode]tgt4.north east) {\scriptsize{学}};
+\node[tgtnode] (tgt6) at ([yshift=-0.5\hnode]tgt5.north east) {\scriptsize{到}};
+\node[tgtnode] (tgt7) at ([yshift=-0.5\hnode]tgt6.north east) {\scriptsize{?}};
+\node[tgtnode] (tgt8) at ([yshift=-0.5\hnode]tgt7.north east) {\scriptsize{$\langle$eos$\rangle$}};
+\end{scope}
+%\visible<2->
+{
+% alignment rectangle 2
+\node[alignmentnode, ugreen, anchor=north west] (alignment1) at ([xshift=-0.3em,yshift=0.4em]a07.north west) {};
+}
+%\visible<3->
+{
+% alignment rectangle 1
+\node[alignmentnode, red, anchor=north west] (alignment2) at ([xshift=-0.1em,yshift=0.2em]a17.north west) {};
+}
+%\visible<3->
+{
+% alignment bars 2
+\node[probnode,anchor=south west,minimum height=0.4\hnode,inner sep=0.1pt,fill=red!40,label=below:\scriptsize{$0.4$}] (attn21) at ([xshift=2.3\hnode,yshift=0.5\hnode]alignment2.east) {};
+\node[probnode,anchor=south west,minimum height=0.4\hnode,inner sep=0.1pt,fill=red!40,label=below:\scriptsize{$0.4$}] (attn22) at ([xshift=1pt]attn21.south east) {};
+\node[probnode,anchor=south west,minimum height=0.05\hnode,inner sep=0.1pt,fill=red!40,label=below:\scriptsize{$0$}] (attn23) at ([xshift=1pt]attn22.south east) {};
+\node[probnode,anchor=south west,minimum height=0.1\hnode,inner sep=0.1pt,fill=red!40,label=below:\scriptsize{$0.1$}] (attn24) at ([xshift=1pt]attn23.south east) {};
+\node[probnode,anchor=south west,minimum height=0.05\hnode,inner sep=0.1pt,fill=red!40,label=below:\scriptsize{$0$}] (attn25) at ([xshift=1pt]attn24.south east) {};
+\node[probnode,anchor=south west,minimum height=0.05\hnode,inner sep=0.1pt,fill=red!40,label=below:\scriptsize{$...$}] (attn26) at ([xshift=1pt]attn25.south east) {};
+}
+%\visible<2->
+{
+% alignment bars 1
+\node[probnode,anchor=south,minimum height=0.2\hnode,inner sep=0.1pt,fill=ugreen!40,label=below:\scriptsize{$0.2$}] (attn11) at ([xshift=2.5\hnode,yshift=-1em]alignment2.north east) {};
+\node[probnode,anchor=south west,minimum height=0.3\hnode,inner sep=0.1pt,fill=ugreen!40,label=below:\scriptsize{$0.3$}] (attn12) at ([xshift=1pt]attn11.south east) {};
+\node[probnode,anchor=south west,minimum height=0.2\hnode,inner sep=0.1pt,fill=ugreen!40,label=below:\scriptsize{$0.2$}] (attn13) at ([xshift=1pt]attn12.south east) {};
+\node[probnode,anchor=south west,minimum height=0.05\hnode,inner sep=0.1pt,fill=ugreen!40,label=below:\scriptsize{$0$}] (attn14) at ([xshift=1pt]attn13.south east) {};
+\node[probnode,anchor=south west,minimum height=0.05\hnode,inner sep=0.1pt,fill=ugreen!40,label=below:\scriptsize{$0$}] (attn15) at ([xshift=1pt]attn14.south east) {};
+\node[probnode,anchor=south west,minimum height=0.05\hnode,inner sep=0.1pt,fill=ugreen!40,label=below:\scriptsize{$...$}] (attn16) at ([xshift=1pt]attn15.south east) {};
+}
+%\visible<3->
+{
+% coverage score formula node
+\node [anchor=north west] (formula) at ([xshift=-0.3\hnode,yshift=-1.5\hnode]attn11.south) {\small{不同$\vectorn{\emph{C}}_j$所对应的源语言词的权重是不同的}};
+\node [anchor=north west] (example) at (formula.south west) {\footnotesize{$\vectorn{\emph{C}}_2=0.4 \times \vectorn{\emph{h}}(\textrm{“你”}) + 0.4 \times \vectorn{\emph{h}}(\textrm{“什么”}) +$}};
+\node [anchor=north west] (example2) at ([yshift=0.4em]example.south west) {\footnotesize{$\ \ \ \ \ \ \ \ 0 \times \vectorn{\emph{h}}(\textrm{“都”}) + 0.1 \times \vectorn{\emph{h}}(\textrm{“ 没”}) + ..$}};
+}
+%\visible<3->
+{
+% matrix -> attn2
+\draw[->,red] ([xshift=0.1em,yshift=2.3em]alignment2.east).. controls +(east:1.9cm) and +(west:1.0cm) ..([xshift=-0.15\hnode,yshift=-1em]attn21.north west);
+}
+%\visible<2->
+{
+\draw[->,ugreen] ([xshift=0.1em,yshift=-1.2em]alignment1.north east)--([xshift=2.2\hnode,yshift=-1.2em]alignment2.north east);
+}
+%\visible<3->
+{
+% attn2 -> cov2
+\draw[->] ([xshift=0.2\hnode,yshift=0.0\hnode]attn26.east)--([xshift=0.7\hnode,yshift=0]attn26.east) node[pos=0.5,above] (sum2) {\small{$\sum$}}; % 0.3 - 0.5 height of the
+}
+%\visible<2->
+{
+% attn1 -> cov1
+\draw[->] ([xshift=0.2\hnode]attn16.east)--([xshift=0.7\hnode]attn16.east) node[pos=0.5,above] (sum1) {\small{$\sum$}};
+}
+% coverage score for each source word
+%\visible<2->
+{
+\node[anchor=west] (sc1) at ([xshift=0.9\hnode]attn16.east) {$\vectorn{\emph{C}}_1 = \sum_{i=1}^{8} \alpha_{i1} \vectorn{\emph{h}}_{i}$};
+}
+%\visible<3->
+{
+\node[anchor=west] (sc2) at ([xshift=0.9\hnode,yshift=0.0\hnode]attn26.east) {$\vectorn{\emph{C}}_2 = \sum_{i=1}^{8} \alpha_{i2} \vectorn{\emph{h}}_{i}$};
+}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter12/Figures/figure-example-of-self-attention-mechanism-calculation.tex
+++ b/Chapter12/Figures/figure-example-of-self-attention-mechanism-calculation.tex
+\begin{tikzpicture}
+\begin{scope}
+\tikzstyle{rnode} = [draw,minimum width=2.8em,minimum height=1.2em]
+\node [rnode,anchor=south west,fill=green!20!white] (key11) at (0,0) {\scriptsize{$h(\textrm{你})$}};
+\node [rnode,anchor=south west,fill=green!20!white] (key12) at ([xshift=0.8em]key11.south east) {\scriptsize{$h(\textrm{什么})$}};
+\node [rnode,anchor=south west,fill=green!20!white] (key13) at ([xshift=0.8em]key12.south east) {\scriptsize{$h(\textrm{也})$}};
+\node [rnode,anchor=south west,fill=green!20!white] (key14) at ([xshift=0.8em]key13.south east) {\scriptsize{$h(\textrm{没})$}};
+\node [rnode,anchor=south west,fill=green!20!white] (key15) at ([xshift=0.8em]key14.south east) {\scriptsize{$h(\textrm{学})$}};
+\node [rnode,anchor=east] (query1) at ([xshift=-1em]key11.west) {\scriptsize{$h(\textrm{你})$}};
+\draw [->] ([yshift=1pt,xshift=4pt]query1.north) .. controls +(90:0.6em) and +(90:0.6em) .. ([yshift=1pt]key11.north);
+\draw [->] ([yshift=1pt,xshift=0pt]query1.north) .. controls +(90:1.0em) and +(90:1.0em) .. ([yshift=1pt]key12.north);
+\draw [->] ([yshift=1pt,xshift=-4pt]query1.north) .. controls +(90:1.4em) and +(90:1.4em) .. ([yshift=1pt]key13.north);
+\draw [->] ([yshift=1pt,xshift=-8pt]query1.north) .. controls +(90:1.8em) and +(90:1.8em) .. ([yshift=1pt]key14.north);
+\draw [->] ([yshift=1pt,xshift=-12pt]query1.north) .. controls +(90:2.2em) and +(90:2.2em) .. ([yshift=1pt]key15.north);
+\node [anchor=south west] (alpha11) at ([xshift=0.3em]key11.north) {\scriptsize{$\alpha_1$}};
+\node [anchor=south west] (alpha12) at ([xshift=0.3em]key12.north) {\scriptsize{$\alpha_2$}};
+\node [anchor=south west] (alpha13) at ([xshift=0.3em]key13.north) {\scriptsize{$\alpha_3$}};
+\node [anchor=south west] (alpha14) at ([xshift=0.3em]key14.north) {\scriptsize{$\alpha_4$}};
+\node [anchor=south west] (alpha15) at ([xshift=0.3em]key15.north) {\scriptsize{$\alpha_5$}};
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter12/Figures/figure-lrate-of-transformer.tex
+++ b/Chapter12/Figures/figure-lrate-of-transformer.tex
+  \begin{tikzpicture}
+    \footnotesize{
+      \begin{axis}[
+      width=.60\textwidth,
+      height=.40\textwidth,
+      legend style={at={(0.60,0.08)}, anchor=south west},
+      xlabel={\footnotesize{更新步数  (10k)}},
+      ylabel={\footnotesize{学习率  (\scriptsize{$10^{-3}$)}}},
+      ylabel style={yshift=-1em},xlabel style={yshift=0.0em},
+      yticklabel style={/pgf/number format/precision=2,/pgf/number format/fixed zerofill},
+      ymin=0,ymax=0.9, ytick={0.2, 0.4, 0.6, 0.8},
+      xmin=0,xmax=12,xtick={2,4,6,8,10},
+      legend style={yshift=-6pt, legend plot pos=right,font=\scriptsize,cells={anchor=west}}
+      ]
+      \addplot[orange,line width=1.25pt] coordinates {(0,0) (4,0.7) (5,0.63) (6,0.57) (7,0.525) (8,0.49) (9,0.465) (10,0.44) (11,0.42) (12,0.4)};
+      \end{axis}
+     }
+  \end{tikzpicture}
--- a/Chapter12/Figures/figure-mask-instance-for-future-positions-in-transformer.tex
+++ b/Chapter12/Figures/figure-mask-instance-for-future-positions-in-transformer.tex
+\begin{tikzpicture}
+\begin{scope}
+\tikzstyle{attnode} = [minimum size=1.5em,inner sep=0pt,rounded corners=1pt,draw]
+\tikzstyle{srcnode} = [rotate=45,font=\small,anchor=south west]
+\tikzstyle{tgtnode} = [left,font=\small,anchor=north east]
+\tikzstyle{masknode} = [minimum size=5.8em,inner sep=0pt,rounded corners=1pt,draw]
+\tikzstyle{elementnode} = [rectangle,text=white,anchor=center]
+%\setlength{\hnode}{1.0cm}
+%\node [anchor=west,attnode] (node1) at (0,0) {\tiny{}};
+%\node [anchor=west,attnode] (node2) at ([xshift=1em]node1.east) {\tiny{}};
+{
+\foreach \i / \j / \c in
+    {0/5/0.25, 1/5/0.15, 2/5/0.15, 3/5/0.35, 4/5/0.25, 5/5/0.15,
+    0/4/0.15, 1/4/0.25, 2/4/0.2, 3/4/0.30, 4/4/0.15, 5/4/0.15,
+    0/3/0.15, 1/3/0.15, 2/3/0.5, 3/3/0.25, 4/3/0.15, 5/3/0.25,
+    0/2/0.15, 1/2/0.15, 2/2/0.15, 3/2/0.15, 4/2/0.25, 5/2/0.3,
+    0/1/0.25, 1/1/0.15, 2/1/0.15, 3/1/0.15, 4/1/0.5, 5/1/0.15,
+    0/0/0.15, 1/0/0.15, 2/0/0.15, 3/0/0.15, 4/0/0.25, 5/0/0.40}
+    \node[elementnode,minimum size=0.6*1.0cm*\c,inner sep=0.1pt,fill=blue] (a\i\j) at (0.5*1.0cm*\i-5.4*0.5*1.0cm,0.5*1.0cm*\j-1.05*1.0cm) {};
+% source
+\node[srcnode] (src1) at (-5.4*0.5*1.0cm,-1.05*1.0cm+5.5*0.5*1.0cm) {\scriptsize{Have}};
+\node[srcnode] (src2) at ([xshift=0.5cm]src1.south west) {\scriptsize{you}};
+\node[srcnode] (src3) at ([xshift=0.5cm]src2.south west) {\scriptsize{learned}};
+\node[srcnode] (src4) at ([xshift=0.5cm]src3.south west) {\scriptsize{nothing}};
+\node[srcnode] (src5) at ([xshift=0.5cm]src4.south west) {\scriptsize{?}};
+\node[srcnode] (src6) at ([xshift=0.5cm]src5.south west) {\scriptsize{$\langle$eos$\rangle$}};
+% target
+\node[tgtnode] (tgt1) at (-6.0*0.5*1.0cm,-1.05*1.0cm+5.5*0.5*1.0cm) {\scriptsize{Have}};
+\node[tgtnode] (tgt2) at ([yshift=-0.5cm]tgt1.north east) {\scriptsize{you}};
+\node[tgtnode] (tgt3) at ([yshift=-0.5cm]tgt2.north east) {\scriptsize{learned}};
+\node[tgtnode] (tgt4) at ([yshift=-0.5cm]tgt3.north east) {\scriptsize{nothing}};
+\node[tgtnode] (tgt5) at ([yshift=-0.5cm]tgt4.north east) {\scriptsize{?}};
+\node[tgtnode] (tgt6) at ([yshift=-0.5cm]tgt5.north east) {\scriptsize{$\langle$eos$\rangle$}};
+{
+\filldraw [fill=blue!20,draw,thick,fill opacity=0.85] ([xshift=-0.9em,yshift=0.5em]a15.north west) -- ([xshift=0.5em,yshift=-0.9em]a51.south east) --  ([xshift=0.5em,yshift=0.5em]a55.north east) -- ([xshift=-0.9em,yshift=0.5em]a15.north west);
+\node[anchor=west] (labelmask) at ([xshift=0.3em,yshift=0.5em]a23.north east) {Masked};
+}
+{
+\foreach \i / \j / \c in
+    {0/5/0.25,
+    0/4/0.15, 1/4/0.25,
+    0/3/0.15, 1/3/0.15, 2/3/0.5,
+    0/2/0.15, 1/2/0.15, 2/2/0.15, 3/2/0.15,
+    0/1/0.25, 1/1/0.15, 2/1/0.15, 3/1/0.15, 4/1/0.5,
+    0/0/0.15, 1/0/0.15, 2/0/0.15, 3/0/0.15, 4/0/0.25, 5/0/0.40}
+    \node[elementnode,minimum size=0.6*1.0cm*\c,inner sep=0.1pt,fill=blue] (a\i\j) at (0.5*1.0cm*\i+6*0.5*1.0cm,0.5*1.0cm*\j-1.05*1.0cm) {};
+% source
+\node[srcnode] (src1) at (6*0.5*1.0cm,-1.05*1.0cm+5.5*0.5*1.0cm) {\scriptsize{Have}};
+\node[srcnode] (src2) at ([xshift=0.5cm]src1.south west) {\scriptsize{you}};
+\node[srcnode] (src3) at ([xshift=0.5cm]src2.south west) {\scriptsize{learned}};
+\node[srcnode] (src4) at ([xshift=0.5cm]src3.south west) {\scriptsize{nothing}};
+\node[srcnode] (src5) at ([xshift=0.5cm]src4.south west) {\scriptsize{?}};
+\node[srcnode] (src6) at ([xshift=0.5cm]src5.south west) {\scriptsize{$\langle$eos$\rangle$}};
+% target
+\node[tgtnode] (tgt1) at (5.4*0.5*1.0cm,-1.05*1.0cm+5.5*0.5*1.0cm) {\scriptsize{Have}};
+\node[tgtnode] (tgt2) at ([yshift=-0.5cm]tgt1.north east) {\scriptsize{you}};
+\node[tgtnode] (tgt3) at ([yshift=-0.5cm]tgt2.north east) {\scriptsize{learned}};
+\node[tgtnode] (tgt4) at ([yshift=-0.5cm]tgt3.north east) {\scriptsize{nothing}};
+\node[tgtnode] (tgt5) at ([yshift=-0.5cm]tgt4.north east) {\scriptsize{?}};
+\node[tgtnode] (tgt6) at ([yshift=-0.5cm]tgt5.north east) {\scriptsize{$\langle$eos$\rangle$}};
+}
+}
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter12/Figures/figure-matrix-representation-of-attention-weights-between-chinese-english-sentence-pairs.tex
+++ b/Chapter12/Figures/figure-matrix-representation-of-attention-weights-between-chinese-english-sentence-pairs.tex
+%-------------------------------------------
+\begin{tikzpicture}
+%\setlength{\hnode}{1.2cm}
+\tikzstyle{elementnode} = [rectangle,text=white,anchor=center]
+\tikzstyle{srcnode} = [rotate=45,font=\small,anchor=south west]
+\tikzstyle{tgtnode} = [left,font=\small,anchor=north east]
+\tikzstyle{alignmentnode} = [rectangle,draw,minimum height=3.6cm,minimum width=0.36cm]
+\tikzstyle{probnode} = [fill=blue!30,minimum width=0.4cm]
+\tikzstyle{labelnode} = [above]
+% alignment matrix
+\begin{scope}[scale=0.9,yshift=0.12in]
+\foreach \i / \j / \c in
+    {0/7/0.2, 1/7/0.45, 2/7/0.15, 3/7/0.15, 4/7/0.15, 5/7/0.15,
+    0/6/0.35, 1/6/0.45, 2/6/0.15, 3/6/0.15, 4/6/0.15, 5/6/0.15,
+    0/5/0.25, 1/5/0.15, 2/5/0.15, 3/5/0.35, 4/5/0.15, 5/5/0.15,
+    0/4/0.15, 1/4/0.25, 2/4/0.2, 3/4/0.30, 4/4/0.15, 5/4/0.15,
+    0/3/0.15, 1/3/0.15, 2/3/0.8, 3/3/0.25, 4/3/0.15, 5/3/0.25,
+    0/2/0.15, 1/2/0.15, 2/2/0.15, 3/2/0.15, 4/2/0.25, 5/2/0.3,
+    0/1/0.15, 1/1/0.15, 2/1/0.15, 3/1/0.15, 4/1/0.8, 5/1/0.15,
+    0/0/0.15, 1/0/0.15, 2/0/0.15, 3/0/0.15, 4/0/0.25, 5/0/0.60}
+    \node[elementnode,minimum size=0.6*1.2cm*\c,inner sep=0.1pt,fill=blue] (a\i\j) at (0.5*1.2cm*\i-5.4*0.5*1.2cm,0.5*1.2cm*\j-1.05*1.2cm) {};
+%attention score labels
+\node[align=center] (l17) at (a17) {\scriptsize{{\color{white} .4}}};
+\node[align=center] (l26) at (a06) {\scriptsize{{\color{white} .3}}};
+\node[align=center] (l26) at (a16) {\scriptsize{{\color{white} .4}}};
+\node[align=center] (l17) at (a35) {\scriptsize{{\color{white} .3}}};
+\node[align=center] (l17) at (a34) {\tiny{{\color{white} .3}}};
+\node[align=center] (l17) at (a23) {\small{{\color{white} .8}}};
+\node[align=center] (l17) at (a41) {\small{{\color{white} .8}}};
+\node[align=center] (l17) at (a50) {\small{{\color{white} .7}}};
+% source
+\node[srcnode] (src1) at (-5.4*0.5*1.2cm,-1.05*1.2cm+7.5*0.5*1.2cm) {\scriptsize{Have}};
+\node[srcnode] (src2) at ([xshift=0.6cm]src1.south west) {\scriptsize{you}};
+\node[srcnode] (src3) at ([xshift=0.6cm]src2.south west) {\scriptsize{learned}};
+\node[srcnode] (src4) at ([xshift=0.6cm]src3.south west) {\scriptsize{nothing}};
+\node[srcnode] (src5) at ([xshift=0.6cm]src4.south west) {\scriptsize{?}};
+\node[srcnode] (src6) at ([xshift=0.6cm]src5.south west) {\scriptsize{$\langle$eos$\rangle$}};
+% target
+\node[tgtnode] (tgt1) at (-6.0*0.5*1.2cm,-1.05*1.2cm+7.5*0.5*1.2cm) {\scriptsize{你}};
+\node[tgtnode] (tgt2) at ([yshift=-0.6cm]tgt1.north east) {\scriptsize{什么}};
+\node[tgtnode] (tgt3) at ([yshift=-0.6cm]tgt2.north east) {\scriptsize{都}};
+\node[tgtnode] (tgt4) at ([yshift=-0.6cm]tgt3.north east) {\scriptsize{没}};
+\node[tgtnode] (tgt5) at ([yshift=-0.6cm]tgt4.north east) {\scriptsize{学}};
+\node[tgtnode] (tgt6) at ([yshift=-0.6cm]tgt5.north east) {\scriptsize{到}};
+\node[tgtnode] (tgt7) at ([yshift=-0.6cm]tgt6.north east) {\scriptsize{?}};
+\node[tgtnode] (tgt8) at ([yshift=-0.6cm]tgt7.north east) {\scriptsize{$\langle$eos$\rangle$}};
+\end{scope}
+\end{tikzpicture}
+%-------------------------------------------
\ No newline at end of file
--- a/Chapter12/Figures/figure-multi-head-attention-model.tex
+++ b/Chapter12/Figures/figure-multi-head-attention-model.tex
+\begin{tikzpicture}
+\begin{scope}
+\node [anchor=west,draw=black!30,inner sep=4pt,fill=ugreen!20!white,text=ugreen!20!white] (Linear0) at (0,0) {\footnotesize{Linear}};
+\node [anchor=south west,draw=black!50,fill=ugreen!20!white,draw,inner sep=4pt,text=ugreen!20!white] (Linear01) at ([shift={(-0.2em,-0.2em)}]Linear0.south west) {\footnotesize{Linear}};
+\node [anchor=south west,fill=ugreen!20!white,draw,inner sep=4pt] (Linear02) at ([shift={(-0.2em,-0.2em)}]Linear01.south west) {\footnotesize{Linear}};
+\node [anchor=north] (Q) at ([xshift=0em,yshift=-1em]Linear02.south) {\footnotesize{$\vectorn{\emph{Q}}$}};
+\node [anchor=west,draw=black!30,inner sep=4pt,fill=ugreen!20!white,text=ugreen!20!white] (Linear1) at ([xshift=1.5em]Linear0.east) {\footnotesize{Linear}};
+\node [anchor=south west,draw=black!50,fill=ugreen!20!white,draw,inner sep=4pt,text=ugreen!20!white] (Linear11) at ([shift={(-0.2em,-0.2em)}]Linear1.south west) {\footnotesize{Linear}};
+\node [anchor=south west,fill=ugreen!20!white,draw,inner sep=4pt] (Linear12) at ([shift={(-0.2em,-0.2em)}]Linear11.south west) {\footnotesize{Linear}};
+\node [anchor=north] (K) at ([xshift=0em,yshift=-1em]Linear12.south) {\footnotesize{$\vectorn{\emph{K}}$}};
+\node [anchor=west,draw=black!30,inner sep=4pt,fill=ugreen!20!white,text=ugreen!20!white] (Linear2) at ([xshift=1.5em]Linear1.east) {\footnotesize{Linear}};
+\node [anchor=south west,draw=black!50,fill=ugreen!20!white,draw,inner sep=4pt,text=ugreen!20!white] (Linear21) at ([shift={(-0.2em,-0.2em)}]Linear2.south west) {\footnotesize{Linear}};
+\node [anchor=south west,fill=ugreen!20!white,draw,inner sep=4pt] (Linear22) at ([shift={(-0.2em,-0.2em)}]Linear21.south west) {\footnotesize{Linear}};
+\node [anchor=north] (V) at ([xshift=0em,yshift=-1em]Linear22.south) {\footnotesize{$\vectorn{\emph{V}}$}};
+\node [anchor=south,draw=black!30,minimum width=12em,minimum height=2em,inner sep=4pt,fill=blue!20!white] (Scale) at ([yshift=1em]Linear1.north) {\footnotesize{}};
+\node [anchor=south west,draw=black!50,minimum width=12em,minimum height=2em,fill=blue!20!white,draw,inner sep=4pt] (Scale1) at ([shift={(-0.2em,-0.2em)}]Scale.south west) {\footnotesize{}};
+\node [anchor=south west,fill=blue!20!white,draw,minimum width=12em,minimum height=2em,inner sep=4pt] (Scale2) at ([shift={(-0.2em,-0.2em)}]Scale1.south west) {\footnotesize{Scaled Dot-Product Attention}};
+\node [anchor=south,draw,minimum width=4em,inner sep=4pt,fill=yellow!30] (Concat) at ([yshift=1em]Scale2.north) {\footnotesize{Concat}};
+\node [anchor=south,draw,minimum width=4em,inner sep=4pt,fill=ugreen!20!white] (Linear) at ([yshift=1em]Concat.north) {\footnotesize{Linear}};
+\draw [->] ([yshift=0.1em]Q.north) -- ([yshift=-0.1em]Linear02.south);
+\draw [-,draw=black!50] ([yshift=0.1em]Q.north) -- ([xshift=0.2em,yshift=-0.1em]Linear02.south);
+\draw [-,draw=black!30] ([yshift=0.1em]Q.north) -- ([xshift=0.4em,yshift=-0.1em]Linear02.south);
+\draw [->] ([yshift=0.1em]K.north) -- ([yshift=-0.1em]Linear12.south);
+\draw [-,draw=black!50] ([yshift=0.1em]K.north) -- ([xshift=0.2em,yshift=-0.1em]Linear12.south);
+\draw [-,draw=black!30] ([yshift=0.1em]K.north) -- ([xshift=0.4em,yshift=-0.1em]Linear12.south);
+\draw [->] ([yshift=0.1em]V.north) -- ([yshift=-0.1em]Linear22.south);
+\draw [-,draw=black!50] ([yshift=0.1em]V.north) -- ([xshift=0.2em,yshift=-0.1em]Linear22.south);
+\draw [-,draw=black!30] ([yshift=0.1em]V.north) -- ([xshift=0.4em,yshift=-0.1em]Linear22.south);
+\draw [->] ([yshift=0em]Linear02.north) -- ([yshift=1em]Linear02.north);
+\draw [-,draw=black!50] ([yshift=0em]Linear01.north) -- ([yshift=0.8em]Linear01.north);
+\draw [-,draw=black!30] ([yshift=0em]Linear0.north) -- ([yshift=0.6em]Linear0.north);
+\draw [->] ([yshift=0em]Linear12.north) -- ([yshift=1em]Linear12.north);
+\draw [-,draw=black!50] ([yshift=0em]Linear11.north) -- ([yshift=0.8em]Linear11.north);
+\draw [-,draw=black!30] ([yshift=0em]Linear1.north) -- ([yshift=0.6em]Linear1.north);
+\draw [->] ([yshift=0em]Linear22.north) -- ([yshift=1em]Linear22.north);
+\draw [-,draw=black!50] ([yshift=0em]Linear21.north) -- ([yshift=0.8em]Linear21.north);
+\draw [-,draw=black!30] ([yshift=0em]Linear2.north) -- ([yshift=0.6em]Linear2.north);
+\draw [->] ([yshift=0em]Scale2.north) -- ([yshift=0em]Concat.south);
+\draw [-,draw=black!50] ([yshift=0em]Scale1.north) -- ([yshift=0.8em]Scale1.north);
+\draw [-,draw=black!30] ([yshift=0em]Scale.north) -- ([yshift=0.6em]Scale.north);
+\draw [->] ([yshift=0em]Concat.north) -- ([yshift=0em]Linear.south);
+\draw [->] ([yshift=0em]Linear.north) -- ([yshift=1em]Linear.north);
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter12/Figures/figure-point-product-attention-model.tex
+++ b/Chapter12/Figures/figure-point-product-attention-model.tex
+\begin{tikzpicture}
+\begin{scope}
+\node [anchor=south west,fill=white,draw,inner sep=4pt,minimum width=4em,fill=blue!20!white] (MatMul) at (0,0) {\tiny{MatMul}};
+\node [anchor=north] (Q1) at ([xshift=-1.4em,yshift=-1em]MatMul.south) {\footnotesize{$\vectorn{\emph{Q}}$}};
+\node [anchor=north] (K1) at ([xshift=1.4em,yshift=-1em]MatMul.south) {\footnotesize{$\vectorn{\emph{K}}$}};
+\node [anchor=south,draw,inner sep=4pt,fill=yellow!30,minimum width=2.5em] (Scale3) at ([yshift=1em]MatMul.north) {\tiny{Scale}};
+\node [anchor=south,draw,inner sep=4pt,fill=purple!20,minimum width=3.5em] (Mask) at ([yshift=0.8em]Scale3.north) {\tiny{Mask(opt.)}};
+\node [anchor=south,draw,inner sep=4pt,fill=ugreen!20!white] (SoftMax) at ([yshift=1em]Mask.north) {\tiny{SoftMax}};
+\node [anchor=south,draw,minimum width=4em,inner sep=4pt,fill=blue!20!white] (MatMul1) at ([xshift=1.7em,yshift=1em]SoftMax.north) {\tiny{MatMul}};
+\node [anchor=north] (V1) at ([xshift=2em]K1.north) {\footnotesize{$\vectorn{\emph{V}}$}};
+\node [anchor=north] (null) at ([yshift=0.8em]MatMul1.north) {};
+\draw [->] ([yshift=0.1em]Q1.north) -- ([xshift=-1.4em,yshift=-0.1em]MatMul.south);
+\draw [->] ([yshift=0.1em]K1.north) -- ([xshift=1.4em,yshift=-0.1em]MatMul.south);
+\draw [->] ([yshift=0.1em]MatMul.north) -- ([yshift=-0.1em]Scale3.south);
+\draw [->] ([yshift=0.1em]Scale3.north) -- ([yshift=-0.1em]Mask.south);
+\draw [->] ([yshift=0.1em]Mask.north) -- ([yshift=-0.1em]SoftMax.south);
+\draw [->] ([yshift=0.1em]SoftMax.north) -- ([yshift=0.9em]SoftMax.north);
+\draw [->] ([yshift=0.1em]V1.north) -- ([yshift=9.3em]V1.north);
+\draw [->] ([yshift=0.1em]MatMul1.north) -- ([yshift=0.8em]MatMul1.north);
+{
+\node [anchor=east] (line1) at ([xshift=-4em,yshift=1em]MatMul.west) {\scriptsize{自注意力机制的Query}};
+\node [anchor=north west] (line2) at ([yshift=0.3em]line1.south west) {\scriptsize{Key和Value均来自同一句子}};
+\node [anchor=north west] (line3) at ([yshift=0.3em]line2.south west) {\scriptsize{编码-解码注意力机制}};
+\node [anchor=north west] (line4) at ([yshift=0.3em]line3.south west) {\scriptsize{与前面讲的一样}};
+}
+{
+\node [anchor=west] (line11) at ([xshift=3em,yshift=0em]MatMul.east) {\scriptsize{Query和Key的转置}};
+\node [anchor=north west] (line12) at ([yshift=0.3em]line11.south west) {\scriptsize{进行点积,得到句子内部}};
+\node [anchor=north west] (line13) at ([yshift=0.3em]line12.south west) {\scriptsize{各个位置的相关性}};
+}
+{
+\node [anchor=west] (line21) at ([yshift=5em]line11.west) {\scriptsize{相关性矩阵在训练中}};
+\node [anchor=north west] (line22) at ([yshift=0.3em]line21.south west) {\scriptsize{方差变大，不利于训练}};
+\node [anchor=north west] (line23) at ([yshift=0.3em]line22.south west) {\scriptsize{所以对其进行缩放}};
+}
+{
+\node [anchor=west] (line31) at ([yshift=6em]line1.west) {\scriptsize{在编码端，对句子补齐}};
+\node [anchor=north west] (line32) at ([yshift=0.3em]line31.south west) {\scriptsize{填充的部分进行屏蔽}};
+\node [anchor=north west] (line33) at ([yshift=0.3em]line32.south west) {\scriptsize{解码时看不到未来的信息}};
+\node [anchor=north west] (line34) at ([yshift=0.3em]line33.south west) {\scriptsize{需要对未来的信息进行屏蔽}};
+}
+{
+\node [anchor=west] (line41) at ([yshift=4em]line21.west) {\scriptsize{用归一化的相关性打分}};
+\node [anchor=north west] (line42) at ([yshift=0.3em]line41.south west) {\scriptsize{对Value进行加权求和}};
+}
+\begin{pgfonlayer}{background}
+{
+\node [rectangle,inner sep=0.2em,rounded corners=1pt,fill=green!10,drop shadow,draw=ugreen] [fit = (line1) (line2) (line3) (line4)] (box1) {};
+\node [rectangle,inner sep=0.1em,rounded corners=1pt,very thick,dotted,draw=ugreen] [fit = (Q1) (K1) (V1)] (box0) {};
+\draw [->,dotted,very thick,ugreen] ([yshift=-1.5em,xshift=1.2em]box1.east) -- ([yshift=-1.5em,xshift=0.1em]box1.east);
+}
+{
+\node [rectangle,inner sep=0.2em,rounded corners=1pt,fill=blue!20!white,drop shadow,draw=blue] [fit = (line11) (line12) (line13)] (box2) {};
+\draw [->,dotted,very thick,blue] ([yshift=1em,xshift=-2.8em]box2.west) -- ([yshift=1em,xshift=-0.1em]box2.west);
+}
+{
+\node [rectangle,inner sep=0.2em,rounded corners=1pt,fill=yellow!20,drop shadow,draw=black] [fit = (line21) (line22) (line23)] (box3) {};
+\draw [->,dotted,very thick,black] ([xshift=0.1em]Scale3.east) .. controls +(east:1) and +(west:1) .. ([yshift=1.0em]box3.west) ;
+}
+{
+\node [rectangle,inner sep=0.2em,rounded corners=1pt,fill=red!10,drop shadow,draw=red] [fit = (line31) (line32) (line33) (line34)] (box4) {};
+\draw [->,dotted,very thick,red] ([yshift=-1.2em,xshift=2.2em]box4.east) -- ([yshift=-1.2em,xshift=0.1em]box4.east);
+}
+{
+\node [rectangle,inner sep=0.2em,rounded corners=1pt,fill=blue!20!white,drop shadow,draw=blue] [fit = (line41) (line42)] (box5) {};
+\draw [->,dotted,very thick,blue] ([yshift=-0.3em,xshift=-1em]box5.west) -- ([yshift=-0.3em,xshift=-0.1em]box5.west);
+}					
+\end{pgfonlayer}
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter12/Figures/figure-position-of-difference-and-layer-regularization-in-the-model.tex
+++ b/Chapter12/Figures/figure-position-of-difference-and-layer-regularization-in-the-model.tex
+\begin{tikzpicture}
+\begin{scope}
+\tikzstyle{Sanode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=orange!20];
+\tikzstyle{Resnode} = [minimum height=1.1em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=yellow!20];
+\tikzstyle{ffnnode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw];
+\tikzstyle{outputnode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw];
+\tikzstyle{inputnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=red!10];
+\tikzstyle{posnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=black!5!white];
+\tikzstyle{standard} = [rounded corners=3pt]
+\node [Sanode,anchor=west] (sa1) at (0,0) {\tiny{$\textbf{Self-Attention}$}};
+\node [Resnode,anchor=south] (res1) at ([yshift=0.3em]sa1.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
+\node [ffnnode,anchor=south] (ffn1) at ([yshift=1em]res1.north) {\tiny{$\textbf{Feed Forward Network}$}};
+\node [Resnode,anchor=south] (res2) at ([yshift=0.3em]ffn1.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
+\node [inputnode,anchor=north west] (input1) at ([yshift=-1em]sa1.south west) {\tiny{$\textbf{Embedding}$}};
+\node [posnode,anchor=north east] (pos1) at ([yshift=-1em]sa1.south east) {\tiny{$\textbf{Postion}$}};
+\node [anchor=north] (inputs) at ([yshift=-3em]sa1.south) {\scriptsize{$\textbf{编码器输入: 我\ \ 很\ \ 好}$}};
+\node [anchor=south] (encoder) at ([xshift=0.2em,yshift=0.6em]res2.north west) {\scriptsize{\textbf{编码器}}};
+\draw [->] (sa1.north) -- (res1.south);
+\draw [->] (res1.north) -- (ffn1.south);
+\draw [->] (ffn1.north) -- (res2.south);
+\draw [->] ([yshift=-1em]sa1.south) -- (sa1.south);
+\draw [->] ([yshift=-0.3em]inputs.north) -- ([yshift=0.6em]inputs.north);
+\node [Sanode,anchor=west] (sa2) at ([xshift=3em]sa1.east) {\tiny{$\textbf{Self-Attention}$}};
+\node [Resnode,anchor=south] (res3) at ([yshift=0.3em]sa2.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
+\node [Sanode,anchor=south] (ed1) at ([yshift=1em]res3.north) {\tiny{$\textbf{Encoder-Decoder Attention}$}};
+\node [Resnode,anchor=south] (res4) at ([yshift=0.3em]ed1.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
+\node [ffnnode,anchor=south] (ffn2) at ([yshift=1em]res4.north) {\tiny{$\textbf{Feed Forward Network}$}};
+\node [Resnode,anchor=south] (res5) at ([yshift=0.3em]ffn2.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
+\node [outputnode,anchor=south] (o1) at ([yshift=1em]res5.north) {\tiny{$\textbf{Output layer}$}};
+\node [inputnode,anchor=north west] (input2) at ([yshift=-1em]sa2.south west) {\tiny{$\textbf{Embedding}$}};
+\node [posnode,anchor=north east] (pos2) at ([yshift=-1em]sa2.south east) {\tiny{$\textbf{Postion}$}};
+\node [anchor=north] (outputs) at ([yshift=-3em]sa2.south) {\scriptsize{$\textbf{解码器输入: $<$sos$>$ I  am  fine}$}};
+\node [anchor=east] (decoder) at ([xshift=-1em,yshift=-1.5em]o1.west) {\scriptsize{\textbf{解码器}}};
+\node [anchor=north] (decoutputs) at ([yshift=1.5em]o1.north) {\scriptsize{$\textbf{解码器输出: I  am  fine $<$eos$>$ }$}};
+\draw [->] (sa2.north) -- (res3.south);
+\draw [->] (res3.north) -- (ed1.south);
+\draw [->] (ed1.north) -- (res4.south);
+\draw [->] (res4.north) -- (ffn2.south);
+\draw [->] (ffn2.north) -- (res5.south);
+\draw [->] (res5.north) -- (o1.south);
+\draw [->] (o1.north) -- ([yshift=0.5em]o1.north);
+\draw [->] ([yshift=-1em]sa2.south) -- (sa2.south);
+\draw [->] ([yshift=-0.3em]outputs.north) -- ([yshift=0.6em]outputs.north);
+\draw[->,standard] ([yshift=-0.5em]sa1.south) -- ([xshift=-4em,yshift=-0.5em]sa1.south) -- ([xshift=-4em,yshift=2.3em]sa1.south) -- ([xshift=-3.5em,yshift=2.3em]sa1.south);
+\draw[->,standard] ([yshift=0.5em]res1.north) -- ([xshift=-4em,yshift=0.5em]res1.north) -- ([xshift=-4em,yshift=3.3em]res1.north) -- ([xshift=-3.5em,yshift=3.3em]res1.north);
+\draw[->,standard] ([yshift=-0.5em]sa2.south) -- ([xshift=4em,yshift=-0.5em]sa2.south) -- ([xshift=4em,yshift=2.3em]sa2.south) -- ([xshift=3.5em,yshift=2.3em]sa2.south);
+\draw[->,standard] ([yshift=0.5em]res3.north) -- ([xshift=4em,yshift=0.5em]res3.north) -- ([xshift=4em,yshift=3.3em]res3.north) -- ([xshift=3.5em,yshift=3.3em]res3.north);
+\draw[->,standard] ([yshift=0.5em]res4.north) -- ([xshift=4em,yshift=0.5em]res4.north) -- ([xshift=4em,yshift=3.3em]res4.north) -- ([xshift=3.5em,yshift=3.3em]res4.north);
+\draw[->,standard] (res2.north) -- ([yshift=0.5em]res2.north) -- ([xshift=5em,yshift=0.5em]res2.north) -- ([xshift=5em,yshift=-2.2em]res2.north) -- ([xshift=6.5em,yshift=-2.2em]res2.north);
+%\node [rectangle,inner sep=0.7em,rounded corners=1pt,very thick,dotted,draw=ugreen!70] [fit = (sa1) (res1) (ffn1) (res2)] (box0) {};
+%\node [rectangle,inner sep=0.7em,rounded corners=1pt,very thick,dotted,draw=red!60] [fit = (sa2) (res3) (res5)] (box1) {};
+\begin{pgfonlayer}{background}
+	\node [rectangle,inner sep=0.2em,rounded corners=1pt,very thick,dotted,fill=red!40] [fit = (res1)] (box1) {};		
+	\node [rectangle,inner sep=0.2em,rounded corners=1pt,very thick,dotted,fill=red!40] [fit = (res2)] (box2) {};	
+	\node [rectangle,inner sep=0.2em,rounded corners=1pt,very thick,dotted,fill=red!40] [fit = (res3)] (box3) {};	
+	\node [rectangle,inner sep=0.2em,rounded corners=1pt,very thick,dotted,fill=red!40] [fit = (res4)] (box4) {};	
+\end{pgfonlayer}
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter12/Figures/figure-position-of-feedforward-neural-network-in-the-model.tex
+++ b/Chapter12/Figures/figure-position-of-feedforward-neural-network-in-the-model.tex
+\begin{tikzpicture}
+\begin{scope}
+\tikzstyle{Sanode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=orange!20];
+\tikzstyle{Resnode} = [minimum height=1.1em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=yellow!20];
+\tikzstyle{ffnnode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=blue!20];
+\tikzstyle{outputnode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw];
+\tikzstyle{inputnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=red!10];
+\tikzstyle{posnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=black!5!white];
+\tikzstyle{standard} = [rounded corners=3pt]
+\node [Sanode,anchor=west] (sa1) at (0,0) {\tiny{$\textbf{Self-Attention}$}};
+\node [Resnode,anchor=south] (res1) at ([yshift=0.3em]sa1.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
+\node [ffnnode,anchor=south] (ffn1) at ([yshift=1em]res1.north) {\tiny{$\textbf{Feed Forward Network}$}};
+\node [Resnode,anchor=south] (res2) at ([yshift=0.3em]ffn1.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
+\node [inputnode,anchor=north west] (input1) at ([yshift=-1em]sa1.south west) {\tiny{$\textbf{Embedding}$}};
+\node [posnode,anchor=north east] (pos1) at ([yshift=-1em]sa1.south east) {\tiny{$\textbf{Postion}$}};
+\node [anchor=north] (inputs) at ([yshift=-3em]sa1.south) {\scriptsize{$\textbf{编码器输入: 我\ \ 很\ \ 好}$}};
+\node [anchor=south] (encoder) at ([xshift=0.2em,yshift=0.6em]res2.north west) {\scriptsize{\textbf{编码器}}};
+\draw [->] (sa1.north) -- (res1.south);
+\draw [->] (res1.north) -- (ffn1.south);
+\draw [->] (ffn1.north) -- (res2.south);
+\draw [->] ([yshift=-1em]sa1.south) -- (sa1.south);
+\draw [->] ([yshift=-0.3em]inputs.north) -- ([yshift=0.6em]inputs.north);
+\node [Sanode,anchor=west] (sa2) at ([xshift=3em]sa1.east) {\tiny{$\textbf{Self-Attention}$}};
+\node [Resnode,anchor=south] (res3) at ([yshift=0.3em]sa2.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
+\node [Sanode,anchor=south] (ed1) at ([yshift=1em]res3.north) {\tiny{$\textbf{Encoder-Decoder Attention}$}};
+\node [Resnode,anchor=south] (res4) at ([yshift=0.3em]ed1.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
+\node [ffnnode,anchor=south] (ffn2) at ([yshift=1em]res4.north) {\tiny{$\textbf{Feed Forward Network}$}};
+\node [Resnode,anchor=south] (res5) at ([yshift=0.3em]ffn2.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
+\node [outputnode,anchor=south] (o1) at ([yshift=1em]res5.north) {\tiny{$\textbf{Output layer}$}};
+\node [inputnode,anchor=north west] (input2) at ([yshift=-1em]sa2.south west) {\tiny{$\textbf{Embedding}$}};
+\node [posnode,anchor=north east] (pos2) at ([yshift=-1em]sa2.south east) {\tiny{$\textbf{Postion}$}};
+\node [anchor=north] (outputs) at ([yshift=-3em]sa2.south) {\scriptsize{$\textbf{解码器输入: $<$sos$>$ I  am  fine}$}};
+\node [anchor=east] (decoder) at ([xshift=-1em,yshift=-1.5em]o1.west) {\scriptsize{\textbf{解码器}}};
+\node [anchor=north] (decoutputs) at ([yshift=1.5em]o1.north) {\scriptsize{$\textbf{解码器输出: I  am  fine $<$eos$>$ }$}};
+\draw [->] (sa2.north) -- (res3.south);
+\draw [->] (res3.north) -- (ed1.south);
+\draw [->] (ed1.north) -- (res4.south);
+\draw [->] (res4.north) -- (ffn2.south);
+\draw [->] (ffn2.north) -- (res5.south);
+\draw [->] (res5.north) -- (o1.south);
+\draw [->] (o1.north) -- ([yshift=0.5em]o1.north);
+\draw [->] ([yshift=-1em]sa2.south) -- (sa2.south);
+\draw [->] ([yshift=-0.3em]outputs.north) -- ([yshift=0.6em]outputs.north);
+\draw[->,standard] ([yshift=-0.5em]sa1.south) -- ([xshift=-4em,yshift=-0.5em]sa1.south) -- ([xshift=-4em,yshift=2.3em]sa1.south) -- ([xshift=-3.5em,yshift=2.3em]sa1.south);
+\draw[->,standard] ([yshift=0.5em]res1.north) -- ([xshift=-4em,yshift=0.5em]res1.north) -- ([xshift=-4em,yshift=3.3em]res1.north) -- ([xshift=-3.5em,yshift=3.3em]res1.north);
+\draw[->,standard] ([yshift=-0.5em]sa2.south) -- ([xshift=4em,yshift=-0.5em]sa2.south) -- ([xshift=4em,yshift=2.3em]sa2.south) -- ([xshift=3.5em,yshift=2.3em]sa2.south);
+\draw[->,standard] ([yshift=0.5em]res3.north) -- ([xshift=4em,yshift=0.5em]res3.north) -- ([xshift=4em,yshift=3.3em]res3.north) -- ([xshift=3.5em,yshift=3.3em]res3.north);
+\draw[->,standard] ([yshift=0.5em]res4.north) -- ([xshift=4em,yshift=0.5em]res4.north) -- ([xshift=4em,yshift=3.3em]res4.north) -- ([xshift=3.5em,yshift=3.3em]res4.north);
+\draw[->,standard] (res2.north) -- ([yshift=0.5em]res2.north) -- ([xshift=5em,yshift=0.5em]res2.north) -- ([xshift=5em,yshift=-2.2em]res2.north) -- ([xshift=6.5em,yshift=-2.2em]res2.north);
+%\node [rectangle,inner sep=0.7em,rounded corners=1pt,very thick,dotted,draw=ugreen!70] [fit = (sa1) (res1) (ffn1) (res2)] (box0) {};
+%\node [rectangle,inner sep=0.7em,rounded corners=1pt,very thick,dotted,draw=red!60] [fit = (sa2) (res3) (res5)] (box1) {};
+\begin{pgfonlayer}{background}
+	\node [rectangle,inner sep=0.2em,rounded corners=1pt,very thick,dotted,fill=red!40] [fit = (ffn1)] (box1) {};		
+	\node [rectangle,inner sep=0.2em,rounded corners=1pt,very thick,dotted,fill=red!40] [fit = (ffn2)] (box2) {};	
+\end{pgfonlayer}
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter12/Figures/figure-position-of-self-attention-mechanism-in-the-model.tex
+++ b/Chapter12/Figures/figure-position-of-self-attention-mechanism-in-the-model.tex
+\begin{tikzpicture}
+\begin{scope}
+\tikzstyle{Sanode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=orange!20];
+\tikzstyle{Resnode} = [minimum height=1.1em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw];
+\tikzstyle{ffnnode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw];
+\tikzstyle{outputnode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw];
+\tikzstyle{inputnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=red!10];
+\tikzstyle{posnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=black!5!white];
+\tikzstyle{standard} = [rounded corners=3pt]
+\node [Sanode,anchor=west] (sa1) at (0,0) {\tiny{$\textbf{Self-Attention}$}};
+\node [Resnode,anchor=south] (res1) at ([yshift=0.3em]sa1.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
+\node [ffnnode,anchor=south] (ffn1) at ([yshift=1em]res1.north) {\tiny{$\textbf{Feed Forward Network}$}};
+\node [Resnode,anchor=south] (res2) at ([yshift=0.3em]ffn1.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
+\node [inputnode,anchor=north west] (input1) at ([yshift=-1em]sa1.south west) {\tiny{$\textbf{Embedding}$}};
+\node [posnode,anchor=north east] (pos1) at ([yshift=-1em]sa1.south east) {\tiny{$\textbf{Postion}$}};
+\node [anchor=north] (inputs) at ([yshift=-3em]sa1.south) {\scriptsize{$\textbf{编码器输入: 我\ \ 很\ \ 好}$}};
+\node [anchor=south] (encoder) at ([xshift=0.2em,yshift=0.6em]res2.north west) {\scriptsize{\textbf{编码器}}};
+\draw [->] (sa1.north) -- (res1.south);
+\draw [->] (res1.north) -- (ffn1.south);
+\draw [->] (ffn1.north) -- (res2.south);
+\draw [->] ([yshift=-1em]sa1.south) -- (sa1.south);
+\draw [->] ([yshift=-0.3em]inputs.north) -- ([yshift=0.6em]inputs.north);
+\node [Sanode,anchor=west] (sa2) at ([xshift=3em]sa1.east) {\tiny{$\textbf{Self-Attention}$}};
+\node [Resnode,anchor=south] (res3) at ([yshift=0.3em]sa2.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
+\node [Sanode,anchor=south] (ed1) at ([yshift=1em]res3.north) {\tiny{$\textbf{Encoder-Decoder Attention}$}};
+\node [Resnode,anchor=south] (res4) at ([yshift=0.3em]ed1.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
+\node [ffnnode,anchor=south] (ffn2) at ([yshift=1em]res4.north) {\tiny{$\textbf{Feed Forward Network}$}};
+\node [Resnode,anchor=south] (res5) at ([yshift=0.3em]ffn2.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
+\node [outputnode,anchor=south] (o1) at ([yshift=1em]res5.north) {\tiny{$\textbf{Output layer}$}};
+\node [inputnode,anchor=north west] (input2) at ([yshift=-1em]sa2.south west) {\tiny{$\textbf{Embedding}$}};
+\node [posnode,anchor=north east] (pos2) at ([yshift=-1em]sa2.south east) {\tiny{$\textbf{Postion}$}};
+\node [anchor=north] (outputs) at ([yshift=-3em]sa2.south) {\scriptsize{$\textbf{解码器输入: $<$sos$>$ I  am  fine}$}};
+\node [anchor=east] (decoder) at ([xshift=-1em,yshift=-1.5em]o1.west) {\scriptsize{\textbf{解码器}}};
+\node [anchor=north] (decoutputs) at ([yshift=1.5em]o1.north) {\scriptsize{$\textbf{解码器输出: I  am  fine $<$eos$>$ }$}};
+\draw [->] (sa2.north) -- (res3.south);
+\draw [->] (res3.north) -- (ed1.south);
+\draw [->] (ed1.north) -- (res4.south);
+\draw [->] (res4.north) -- (ffn2.south);
+\draw [->] (ffn2.north) -- (res5.south);
+\draw [->] (res5.north) -- (o1.south);
+\draw [->] (o1.north) -- ([yshift=0.5em]o1.north);
+\draw [->] ([yshift=-1em]sa2.south) -- (sa2.south);
+\draw [->] ([yshift=-0.3em]outputs.north) -- ([yshift=0.6em]outputs.north);
+\draw[->,standard] ([yshift=-0.5em]sa1.south) -- ([xshift=-4em,yshift=-0.5em]sa1.south) -- ([xshift=-4em,yshift=2.3em]sa1.south) -- ([xshift=-3.5em,yshift=2.3em]sa1.south);
+\draw[->,standard] ([yshift=0.5em]res1.north) -- ([xshift=-4em,yshift=0.5em]res1.north) -- ([xshift=-4em,yshift=3.3em]res1.north) -- ([xshift=-3.5em,yshift=3.3em]res1.north);
+\draw[->,standard] ([yshift=-0.5em]sa2.south) -- ([xshift=4em,yshift=-0.5em]sa2.south) -- ([xshift=4em,yshift=2.3em]sa2.south) -- ([xshift=3.5em,yshift=2.3em]sa2.south);
+\draw[->,standard] ([yshift=0.5em]res3.north) -- ([xshift=4em,yshift=0.5em]res3.north) -- ([xshift=4em,yshift=3.3em]res3.north) -- ([xshift=3.5em,yshift=3.3em]res3.north);
+\draw[->,standard] ([yshift=0.5em]res4.north) -- ([xshift=4em,yshift=0.5em]res4.north) -- ([xshift=4em,yshift=3.3em]res4.north) -- ([xshift=3.5em,yshift=3.3em]res4.north);
+\draw[->,standard] (res2.north) -- ([yshift=0.5em]res2.north) -- ([xshift=5em,yshift=0.5em]res2.north) -- ([xshift=5em,yshift=-2.2em]res2.north) -- ([xshift=6.5em,yshift=-2.2em]res2.north);
+%\node [rectangle,inner sep=0.7em,rounded corners=1pt,very thick,dotted,draw=ugreen!70] [fit = (sa1) (res1) (ffn1) (res2)] (box0) {};
+%\node [rectangle,inner sep=0.7em,rounded corners=1pt,very thick,dotted,draw=red!60] [fit = (sa2) (res3) (res5)] (box1) {};
+\begin{pgfonlayer}{background}
+	\node [rectangle,inner sep=0.2em,rounded corners=1pt,very thick,dotted,fill=red!40] [fit = (sa1)] (box1) {};		
+	\node [rectangle,inner sep=0.2em,rounded corners=1pt,very thick,dotted,fill=red!40] [fit = (sa2)] (box2) {};	
+	\node [rectangle,inner sep=0.2em,rounded corners=1pt,very thick,dotted,fill=red!40] [fit = (ed1)] (box3) {};	
+\end{pgfonlayer}
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter12/Figures/figure-process-of-5.tex
+++ b/Chapter12/Figures/figure-process-of-5.tex
+\begin{tikzpicture}
+\node(atten) at (0,0){Attention(};
+%%%% Q
+\node(tbq) at ([xshift=0.5em,yshift=0]atten.east){
+\begin{tabular}{|c|}
+\hline
+\rowcolor{yellow!20}  \\ \hline 
+\rowcolor{yellow!20}  \\ \hline
+\rowcolor{yellow!20}  \\ \hline 
+\end{tabular}
+};
+\node at  ([xshift=0em,yshift=0.5em]tbq.north){$\vectorn{\emph{Q}}$};
+\node(comma1) at ([xshift=0.15em,yshift=-2em]tbq.east){,};
+%%%% k
+\node(tbk) at ([xshift=1em,yshift=0]tbq.east){
+\begin{tabular}{|c|}
+\hline
+\rowcolor{blue!20}  \\ \hline 
+\rowcolor{blue!20}  \\ \hline
+\rowcolor{blue!20}  \\ \hline 
+\end{tabular}
+};
+\node at  ([xshift=0em,yshift=0.5em]tbk.north){$\vectorn{\emph{K}}$};
+\node(comma2) at ([xshift=0.15em,yshift=-2em]tbk.east){,};
+%%%% v
+\node(tbv) at ([xshift=1em,yshift=0]tbk.east){
+\begin{tabular}{|c|}
+\hline
+\rowcolor{orange!20}  \\ \hline 
+\rowcolor{orange!20}  \\ \hline
+\rowcolor{orange!20}  \\ \hline 
+\end{tabular}
+};
+\node at  ([xshift=0em,yshift=0.5em]tbv.north){$\vectorn{\emph{V}}$};
+\node(bra) at ([xshift=0.3em,yshift=0]tbv.east){)};
+\node(eq1) at ([xshift=0.5em,yshift=0]bra.east){=};
+\node(sof1) at ([xshift=2em,yshift=0]eq1.east){Softmax(};
+%-----------------------------------------------------------
+%QK+MASK
+\node(tbq2) at ([xshift=0.5em,yshift=2em]sof1.east){
+\begin{tabular}{|c|}
+\hline
+\rowcolor{yellow!20}  \\ \hline 
+\rowcolor{yellow!20}  \\ \hline
+\rowcolor{yellow!20}  \\ \hline 
+\end{tabular}
+};
+\node at  ([xshift=0em,yshift=0.5em]tbq2.north){$\vectorn{\emph{Q}}$};
+% x
+\node (times) at  ([xshift=1em,yshift=0em]tbq2.east){$\times$};
+%k
+\node(tbk2) at ([xshift=2em,yshift=0em]times.east){
+\begin{tabular}{|l|l|l|}
+\hline
+\cellcolor{blue!20} & \cellcolor{blue!20} &\cellcolor{blue!20}  \\ \hline
+\end{tabular}
+};
+\node at  ([xshift=0em,yshift=0.5em]tbk2.north){$\vectorn{\emph{K}}^{\mathrm{T}}$};
+\draw [-] (5.6,-0.2) -- (8,-0.2);
+\node at  ([xshift=0em,yshift=-3em]times.south){$\sqrt{d_k}$};
+% MASK
+\node(mask) at  ([xshift=3em,yshift=-2em]tbk2.east){
+\begin{tabular}{|l|l|l|}
+\hline
+\cellcolor{green!20} &\cellcolor{green!20}   &\cellcolor{green!20}   \\ \hline
+ \cellcolor{green!20} &\cellcolor{green!20}   &\cellcolor{green!20}   \\ \hline
+ \cellcolor{green!20} &\cellcolor{green!20}   &\cellcolor{green!20}   \\ \hline
+\end{tabular}
+};
+\node at  ([xshift=0em,yshift=0.5em]mask.north){$\vectorn{\emph{Mask}}$};
+%+
+\node at  ([xshift=-0.6em,yshift=0em]mask.west){$+$};
+%）
+\node at  ([xshift=0.2em,yshift=0em]mask.east){)};
+%%%% v
+\node(tbv2) at ([xshift=1.2em,yshift=0]mask.east){
+\begin{tabular}{|c|}
+\hline
+\rowcolor{orange!20}  \\ \hline 
+\rowcolor{orange!20}  \\ \hline
+\rowcolor{orange!20}  \\ \hline 
+\end{tabular}
+};
+\node at  ([xshift=0em,yshift=0.5em]tbv2.north){$\vectorn{\emph{V}}$};
+%------------------------------
+%第二行
+\node(eq2) at  ([xshift=0em,yshift=-6em]eq1.south){=};
+\node(sof2) at ([xshift=2em,yshift=0]eq2.east){Softmax(};
+%中间粉色矩阵
+\node(mid) at  ([xshift=1.5em,yshift=0em]sof2.east){
+\begin{tabular}{|l|l|l|}
+\hline
+\cellcolor{pink!30} &\cellcolor{pink!30}   &\cellcolor{pink!30}   \\ \hline
+ \cellcolor{pink!30} &\cellcolor{pink!30}   &\cellcolor{pink!30}   \\ \hline
+ \cellcolor{pink!30} &\cellcolor{pink!30}   &\cellcolor{pink!30}   \\ \hline
+\end{tabular}
+};
+% )
+\node(bra2) at ([xshift=0.2em,yshift=0]mid.east){)};
+%红色框
+\node[rectangle,minimum width=4.0em,minimum height=1.5em,draw=red](p222) at([xshift=0em,yshift=-1.0em]mid.north) {};
+%%%% v
+\node(tbv3) at ([xshift=0.5em,yshift=0]bra2.east){
+\begin{tabular}{|c|}
+\hline
+\rowcolor{orange!20}  \\ \hline 
+\rowcolor{orange!20}  \\ \hline
+\rowcolor{orange!20}  \\ \hline 
+\end{tabular}
+};
+\node at  ([xshift=0em,yshift=0.5em]tbv3.north){$\vectorn{\emph{V}}$};
+%------------------------------------
+%第三行
+\node(eq3) at  ([xshift=0em,yshift=-6em]eq2.south){=};
+%%%% softmax结果 红色矩阵
+\node(result) at ([xshift=2em,yshift=0]eq3.east){
+\begin{tabular}{|l|l|l|}
+\hline
+\cellcolor{red!20} &\cellcolor{red!20}   &\cellcolor{red!20}   \\ \hline
+\cellcolor{red!20}&\cellcolor{red!20}   &\cellcolor{red!20}   \\ \hline
+\cellcolor{red!20} &\cellcolor{red!20}  &\cellcolor{red!20}   \\ \hline
+\end{tabular}
+};
+% x
+\node (times) at  ([xshift=0.5em,yshift=0em]result.east){$\times$};
+%%%% v
+\node(tbv4) at ([xshift=0.5em,yshift=0]times.east){
+\begin{tabular}{|c|}
+\hline
+\rowcolor{orange!20}  \\ \hline 
+\rowcolor{orange!20}  \\ \hline
+\rowcolor{orange!20}  \\ \hline 
+\end{tabular}
+};
+\node at  ([xshift=0em,yshift=0.5em]tbv4.north){$\vectorn{\emph{V}}$};
+%=
+\node(eq4) at  ([xshift=0.5em,yshift=0em]tbv4.east){=};
+%%%% 灰色矩阵
+\node(gre) at ([xshift=0.5em,yshift=0]eq4.east){
+\begin{tabular}{|c|}
+\hline
+\rowcolor{black!15}  \\ \hline 
+\rowcolor{black!15}  \\ \hline
+\rowcolor{black!15}  \\ \hline 
+\end{tabular}
+};
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter12/Figures/figure-query-model-corresponding-to-attention-mechanism.tex
+++ b/Chapter12/Figures/figure-query-model-corresponding-to-attention-mechanism.tex
+\begin{tikzpicture}
+\begin{scope}
+\tikzstyle{rnode} = [draw,minimum width=3.5em,minimum height=1.2em]
+\node [rnode,anchor=south west,fill=red!20!white] (value1) at (0,0) {\scriptsize{$\vectorn{\emph{h}}(\textrm{“你”})$}};
+\node [rnode,anchor=south west,fill=red!20!white] (value2) at ([xshift=1em]value1.south east) {\scriptsize{$\vectorn{\emph{h}}(\textrm{“什么”})$}};
+\node [rnode,anchor=south west,fill=red!20!white] (value3) at ([xshift=1em]value2.south east) {\scriptsize{$\vectorn{\emph{h}}(\textrm{“也”})$}};
+\node [rnode,anchor=south west,fill=red!20!white] (value4) at ([xshift=1em]value3.south east) {\scriptsize{$\vectorn{\emph{h}}(\textrm{“没”})$}};
+\node [rnode,anchor=south west,fill=green!20!white] (key1) at ([yshift=0.2em]value1.north west) {\scriptsize{$\vectorn{\emph{h}}(\textrm{“你”})$}};
+\node [rnode,anchor=south west,fill=green!20!white] (key2) at ([yshift=0.2em]value2.north west) {\scriptsize{$\vectorn{\emph{h}}(\textrm{“什么”})$}};
+\node [rnode,anchor=south west,fill=green!20!white] (key3) at ([yshift=0.2em]value3.north west) {\scriptsize{$\vectorn{\emph{h}}(\textrm{“也”})$}};
+\node [rnode,anchor=south west,fill=green!20!white] (key4) at ([yshift=0.2em]value4.north west) {\scriptsize{$\vectorn{\emph{h}}(\textrm{“没”})$}};
+\node [rnode,anchor=east] (query) at ([xshift=-2em]key1.west) {\scriptsize{$\vectorn{\emph{s}}(\textrm{“you”})$}};
+\node [anchor=east] (querylabel) at ([xshift=-0.2em]query.west) {\scriptsize{query}};
+\draw [->] ([yshift=1pt,xshift=6pt]query.north) .. controls +(90:1em) and +(90:1em) .. ([yshift=1pt]key1.north);
+\draw [->] ([yshift=1pt,xshift=3pt]query.north) .. controls +(90:1.5em) and +(90:1.5em) .. ([yshift=1pt]key2.north);
+\draw [->] ([yshift=1pt]query.north) .. controls +(90:2em) and +(90:2em) .. ([yshift=1pt]key3.north);
+\draw [->] ([yshift=1pt,xshift=-3pt]query.north) .. controls +(90:2.5em) and +(90:2.5em) .. ([yshift=1pt]key4.north);
+\node [anchor=south east] (alpha1) at ([xshift=1em]key1.north east) {\scriptsize{$\alpha_1=.4$}};
+\node [anchor=south east] (alpha2) at ([xshift=1em]key2.north east) {\scriptsize{$\alpha_2=.4$}};
+\node [anchor=south east] (alpha3) at ([xshift=1em]key3.north east) {\scriptsize{$\alpha_3=0$}};
+\node [anchor=south east] (alpha4) at ([xshift=1em]key4.north east) {\scriptsize{$\alpha_4=.1$}};
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter12/Figures/figure-query-model-corresponding-to-traditional-query-model-vs-attention-mechanism.tex
+++ b/Chapter12/Figures/figure-query-model-corresponding-to-traditional-query-model-vs-attention-mechanism.tex
+%-----------------------------------------------------
+\begin{tikzpicture}
+\begin{scope}
+\tikzstyle{rnode} = [draw,minimum width=3em,minimum height=1.2em]
+\node [rnode,anchor=south west,fill=blue!20!white] (value1) at (0,0) {\scriptsize{value$_1$}};
+\node [rnode,anchor=south west,fill=blue!20!white] (value2) at ([xshift=1em]value1.south east) {\scriptsize{value$_2$}};
+\node [rnode,anchor=south west,fill=red!20!white] (value3) at ([xshift=1em]value2.south east) {\scriptsize{value$_3$}};
+\node [rnode,anchor=south west,fill=blue!20!white] (value4) at ([xshift=1em]value3.south east) {\scriptsize{value$_4$}};
+\node [rnode,anchor=south west,pattern=north east lines] (key1) at ([yshift=0.2em]value1.north west) {};
+\node [rnode,anchor=south west,pattern=dots] (key2) at ([yshift=0.2em]value2.north west) {};
+\node [rnode,anchor=south west,pattern=horizontal lines] (key3) at ([yshift=0.2em]value3.north west) {};
+\node [rnode,anchor=south west,pattern=crosshatch dots] (key4) at ([yshift=0.2em]value4.north west) {};
+\node [fill=white,inner sep=1pt] (key1label) at (key1) {\scriptsize{key$_1$}};
+\node [fill=white,inner sep=1pt] (key1label) at (key2) {\scriptsize{key$_2$}};
+\node [fill=white,inner sep=1pt] (key1label) at (key3) {\scriptsize{key$_3$}};
+\node [fill=white,inner sep=1pt] (key1label) at (key4) {\scriptsize{key$_4$}};
+\node [rnode,anchor=east,pattern=horizontal lines] (query) at ([xshift=-3em]key1.west) {};
+\node [anchor=east] (querylabel) at ([xshift=-0.2em]query.west) {\scriptsize{query}};
+\draw [->] ([yshift=1pt]query.north) .. controls +(90:2em) and +(90:2em) .. ([yshift=1pt]key3.north) node [pos=0.5,below,yshift=0.2em] {\scriptsize{匹配}};
+\node [anchor=north] (result) at (value3.south) {\scriptsize{ {\red 返回结果} }};
+\end{scope}
+\end{tikzpicture}
--- a/Chapter12/Figures/figure-query-model-corresponding-to-traditional-query-model-vs-attention-mechanism02.tex
+++ b/Chapter12/Figures/figure-query-model-corresponding-to-traditional-query-model-vs-attention-mechanism02.tex
+%-----------------------------------------------------
+\begin{tikzpicture}
+\begin{scope}
+\tikzstyle{rnode} = [draw,minimum width=3em,minimum height=1.2em]
+\node [rnode,anchor=south west,fill=red!20!white] (value1) at (0,0) {\scriptsize{value$_1$}};
+\node [rnode,anchor=south west,fill=red!20!white] (value2) at ([xshift=1em]value1.south east) {\scriptsize{value$_2$}};
+\node [rnode,anchor=south west,fill=red!20!white] (value3) at ([xshift=1em]value2.south east) {\scriptsize{value$_3$}};
+\node [rnode,anchor=south west,fill=red!20!white] (value4) at ([xshift=1em]value3.south east) {\scriptsize{value$_4$}};
+\node [rnode,anchor=south west,pattern=north east lines] (key1) at ([yshift=0.2em]value1.north west) {};
+\node [rnode,anchor=south west,pattern=dots] (key2) at ([yshift=0.2em]value2.north west) {};
+\node [rnode,anchor=south west,pattern=horizontal lines] (key3) at ([yshift=0.2em]value3.north west) {};
+\node [rnode,anchor=south west,pattern=crosshatch dots] (key4) at ([yshift=0.2em]value4.north west) {};
+\node [fill=white,inner sep=1pt] (key1label) at (key1) {\scriptsize{key$_1$}};
+\node [fill=white,inner sep=1pt] (key1label) at (key2) {\scriptsize{key$_2$}};
+\node [fill=white,inner sep=1pt] (key1label) at (key3) {\scriptsize{key$_3$}};
+\node [fill=white,inner sep=1pt] (key1label) at (key4) {\scriptsize{key$_4$}};
+\node [rnode,anchor=east,pattern=vertical lines] (query) at ([xshift=-3em]key1.west) {};
+\node [anchor=east] (querylabel) at ([xshift=-0.2em]query.west) {\scriptsize{query}};
+\draw [->] ([yshift=1pt,xshift=6pt]query.north) .. controls +(90:1em) and +(90:1em) .. ([yshift=1pt]key1.north);
+\draw [->] ([yshift=1pt,xshift=3pt]query.north) .. controls +(90:1.5em) and +(90:1.5em) .. ([yshift=1pt]key2.north);
+\draw [->] ([yshift=1pt]query.north) .. controls +(90:2em) and +(90:2em) .. ([yshift=1pt]key3.north);
+\draw [->] ([yshift=1pt,xshift=-3pt]query.north) .. controls +(90:2.5em) and +(90:2.5em) .. ([yshift=1pt]key4.north);
+\node [anchor=south east] (alpha1) at (key1.north east) {\scriptsize{$\alpha_1$}};
+\node [anchor=south east] (alpha2) at (key2.north east) {\scriptsize{$\alpha_2$}};
+\node [anchor=south east] (alpha3) at (key3.north east) {\scriptsize{$\alpha_3$}};
+\node [anchor=south east] (alpha4) at (key4.north east) {\scriptsize{$\alpha_4$}};
+\node [anchor=north] (result) at ([xshift=-1.5em]value2.south east) {\scriptsize{{\red 返回结果}=$\alpha_1 \cdot \textrm{value}_1 + \alpha_2 \cdot \textrm{value}_2 + \alpha_3 \cdot \textrm{value}_3 + \alpha_4 \cdot \textrm{value}_4$}};
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter12/Figures/figure-residual-network-structure.tex
+++ b/Chapter12/Figures/figure-residual-network-structure.tex
+\begin{tikzpicture}
+\begin{scope}
+\tikzstyle{lnode} = [minimum height=1.5em,minimum width=3em,inner sep=3pt,rounded corners=1.5pt,draw,fill=orange!20];
+\tikzstyle{standard} = [rounded corners=3pt]
+\node [lnode,anchor=west] (l1) at (0,0) {\scriptsize{子层1}};
+\node [lnode,anchor=west] (l2) at ([xshift=3em]l1.east) {\scriptsize{子层2}};
+\node [lnode,anchor=west] (l3) at ([xshift=3em]l2.east) {\scriptsize{子层3}};
+\node [anchor=west,inner sep=2pt] (dot1) at ([xshift=1em]l3.east) {\scriptsize{$\textbf{...}$}};
+\node [lnode,anchor=west] (l4) at ([xshift=1em]dot1.east) {\scriptsize{子层$n$}};
+\node [anchor=west] (plus1) at ([xshift=0.9em]l1.east) {\scriptsize{$\mathbf{\oplus}$}};
+\node [anchor=west] (plus2) at ([xshift=0.9em]l2.east) {\scriptsize{$\mathbf{\oplus}$}};
+\draw [->,thick] ([xshift=-1.5em]l1.west) -- ([xshift=-0.1em]l1.west);
+\draw [->,thick] ([xshift=0.1em]l1.east) -- ([xshift=0.2em]plus1.west);
+\draw [->,thick] ([xshift=-0.2em]plus1.east) -- ([xshift=-0.1em]l2.west);
+\draw [->,thick] ([xshift=0.1em]l2.east) -- ([xshift=0.2em]plus2.west);
+\draw [->,thick] ([xshift=-0.2em]plus2.east) -- ([xshift=-0.1em]l3.west);
+\draw [->,thick] ([xshift=0.1em]l3.east) -- ([xshift=-0.1em]dot1.west);
+\draw [->,thick] ([xshift=0.1em]dot1.east) -- ([xshift=-0.1em]l4.west);
+\draw [->,thick] ([xshift=0.1em]l4.east) -- ([xshift=1.5em]l4.east);
+\draw[->,standard,thick] ([xshift=-0.8em]l1.west) -- ([xshift=-0.8em,yshift=2em]l1.west) -- ([yshift=2em]plus1.center) -- ([yshift=-0.2em]plus1.north);
+\draw[->,standard,thick] ([xshift=-0.8em]l2.west) -- ([xshift=-0.8em,yshift=2em]l2.west) -- ([yshift=2em]plus2.center) -- ([yshift=-0.2em]plus2.north);
+\draw [->,very thick,red] ([xshift=1.5em,yshift=-0.3em]l4.east) -- ([xshift=0.1em,,yshift=-0.3em]l4.east);
+\draw [->,very thick,red] ([xshift=-0.1em,yshift=-0.3em]l4.west) -- ([xshift=0.1em,yshift=-0.3em]dot1.east);
+\draw [->,very thick,red] ([xshift=-0.1em,yshift=-0.3em]dot1.west) -- ([xshift=0.1em,yshift=-0.3em]l3.east);
+\draw[->,standard,very thick,red] ([xshift=-0.3em,yshift=-0.2em]plus2.north) -- ([xshift=-0.3em,yshift=1.8em]plus2.center) -- ([xshift=-0.5em,yshift=1.8em]l2.west) -- ([xshift=-0.5em,yshift=0.2em]l2.west);
+\draw[->,standard,very thick,red] ([xshift=-0.3em,yshift=-0.2em]plus1.north) -- ([xshift=-0.3em,yshift=1.8em]plus1.center) -- ([xshift=-0.5em,yshift=1.8em]l1.west) -- ([xshift=-0.5em,yshift=0.2em]l1.west);
+\node [anchor=west] (label1) at ([xshift=1em,yshift=1.5em]l3.north) {\tiny{前向计算}};
+\draw [->,thick] ([xshift=-1.5em]label1.west) -- ([xshift=-0.1em]label1.west);
+\node [anchor=west] (label2) at ([xshift=2.5em]label1.east) {\tiny{反向传播}};
+\draw [->,thick,red] ([xshift=-1.5em]label2.west) -- ([xshift=-0.1em]label2.west);
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter12/Figures/figure-self-att-vs-enco-deco-att.tex
+++ b/Chapter12/Figures/figure-self-att-vs-enco-deco-att.tex
+\begin{tikzpicture}
+\node[rounded corners=1pt,minimum width=11.0em,minimum height=2.0em,fill=pink!30,draw=black](p1) at (0,0) {\small{Self-Attention}};
+\node[anchor=north](word1) at ([xshift=0.0em,yshift=-2.0em]p1.south) {\small \vectorn{\emph{K}}};
+\node[anchor=west](word2) at ([xshift=2.2em]word1.east) {\small \vectorn{\emph{Q}}};
+\node[anchor=east](word3) at ([xshift=-2.2em]word1.west) {\small \vectorn{\emph{Q}}};
+\draw[->,thick](word1.north)--(p1.south);
+\draw[->,thick]([xshift=-3.6em]word1.north)--([xshift=-3.6em]p1.south);
+\draw[->,thick]([xshift=3.6em]word1.north)--([xshift=3.6em]p1.south);
+\node[anchor=north,rounded corners=1pt,minimum width=11.0em,minimum height=3.5em,draw=ugreen!70,very thick,dotted](p1-1) at ([yshift=-5.2em]p1.south) {\small{解码端每个位置的表示}};
+\draw [->,thick,dashed] (word3.south) .. controls +(south:1.5em) and +(north:1.5em) .. ([xshift=-0.4em]p1-1.north);
+\draw [->,thick,dashed](word1.south) --(p1-1.north);
+\draw [->,thick,dashed] (word2.south) .. controls +(south:1.0em) and +(north:1.5em) .. ([xshift=0.4em]p1-1.north);
+\node[anchor=north](caption1) at ([xshift=0.0em,yshift=-9.5em]p1.south){\small{(a) Self-Attention的输入}};
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\node[anchor=west,rounded corners=1pt,minimum width=14.0em,minimum height=2.0em,fill=pink!30,draw=black](p2) at ([xshift=5.0em]p1.east){\small{Encoder-Decoder Attention}};
+\node[anchor=north](word1-2) at ([xshift=0.0em,yshift=-2.0em]p2.south) {\small \vectorn{\emph{K}}};
+\node[anchor=west](word2-2) at ([xshift=2.2em]word1-2.east) {\small \vectorn{\emph{Q}}};
+\node[anchor=east](word3-2) at ([xshift=-2.2em]word1-2.west) {\small \vectorn{\emph{Q}}};
+\draw[->,thick](word1-2.north)--(p2.south);
+\draw[->,thick]([xshift=-3.6em]word1-2.north)--([xshift=-3.6em]p2.south);
+\draw[->,thick]([xshift=3.6em]word1-2.north)--([xshift=3.6em]p2.south);
+\node[anchor=north,rounded corners=1pt](p2-1) at ([xshift=-3.55em,yshift=-5.5em]p2.south) {\small{解码端每个}};
+\node[anchor=north,rounded corners=1pt](p2-2) at ([xshift=-3.55em,yshift=-6.8em]p2.south) {\small{位置的表示}};
+\begin{pgfonlayer}{background}
+{
+\node[rounded corners=1pt,draw=ugreen!70,very thick,dotted] [fit = (p2-1) (p2-2)] (p2-12) {};
+}
+\end{pgfonlayer}
+\node[anchor=north,rounded corners=1pt](p2-3) at ([xshift=3.55em,yshift=-5.5em]p2.south) {\small{编码端每个}};
+\node[anchor=north,rounded corners=1pt](p2-4) at ([xshift=3.55em,yshift=-6.8em]p2.south) {\small{位置的表示}};
+\begin{pgfonlayer}{background}
+{
+\node[rounded corners=1pt,draw=ugreen!70,very thick,dotted] [fit = (p2-3) (p2-4)] (p2-34) {};
+}
+\end{pgfonlayer}
+\draw[<-,thick,dashed]([xshift=-3.6em,yshift=-3.2em]word1-2.north)--([xshift=-3.6em,yshift=-3.2em]p2.south);
+\draw[<-,thick,dashed]([xshift=3.6em,yshift=-3.2em]word1-2.north)--([xshift=3.6em,yshift=-3.2em]p2.south);
+\draw [->,thick,dashed] (word1-2.south) .. controls +(south:1em) and +(north:1.5em) .. ([yshift=0.3em,xshift=-0.4em]p2-3.north);
+\node[anchor=north](caption2) at ([xshift=0.0em,yshift=-9.5em]p2.south){\small{(b) Encoder-Decoder Attention的输入}};
+    \end{tikzpicture}
\ No newline at end of file
--- a/Chapter12/Figures/figure-structure-of-the-network-during-transformer-training.tex
+++ b/Chapter12/Figures/figure-structure-of-the-network-during-transformer-training.tex
+\begin{tikzpicture}
+\begin{scope}
+\tikzstyle{rnnnode} = [minimum height=1.1em,minimum width=2.1em,inner sep=2pt,rounded corners=1pt,draw,fill=red!20];
+\tikzstyle{lossnode} = [minimum height=1.1em,minimum width=6em,inner sep=2pt,rounded corners=1pt,draw,fill=red!20];
+\node [rnnnode,anchor=west] (h1) at (0,0) {\tiny{$\textbf{h}_1$}};
+\node [rnnnode,anchor=west] (h2) at ([xshift=1em]h1.east) {\tiny{$\textbf{h}_2$}};
+\node [rnnnode,anchor=west] (h3) at ([xshift=1em]h2.east) {\tiny{$\textbf{h}_3$}};
+\node [rnnnode,anchor=north,fill=green!20] (e1) at ([yshift=-1em]h1.south) {\tiny{$e_x()$}};
+\node [rnnnode,anchor=west,fill=green!20] (e2) at ([xshift=1em]e1.east) {\tiny{$e_x()$}};
+\node [rnnnode,anchor=west,fill=green!20] (e3) at ([xshift=1em]e2.east) {\tiny{$e_x()$}};
+\node [anchor=north,inner sep=2pt] (w1) at ([yshift=-0.6em]e1.south) {\tiny{你}};
+\node [anchor=north,inner sep=2pt] (w2) at ([yshift=-0.6em]e2.south) {\tiny{好}};
+\node [anchor=north,inner sep=2pt] (w3) at ([yshift=-0.6em]e3.south) {\tiny{$\langle$eos$\rangle$}};
+\node [anchor=south] (dot1) at ([xshift=0.4em,yshift=-0.7em]h1.south) {\tiny{...}};
+\node [anchor=south] (dot2) at ([xshift=-0.4em,yshift=-0.7em]h3.south) {\tiny{...}};
+\draw [->] (w1.north) -- ([yshift=-0.1em]e1.south);
+\draw [->] (w2.north) -- ([yshift=-0.1em]e2.south);
+\draw [->] (w3.north) -- ([yshift=-0.1em]e3.south);
+\draw [->] ([yshift=0.1em]e1.north) -- ([yshift=-0.1em]h1.south);
+\draw [->] ([yshift=0.1em]e2.north) -- ([yshift=-0.1em]h2.south);
+\draw [->] ([yshift=0.1em]e3.north) -- ([yshift=-0.1em]h3.south);
+\draw [->] ([xshift=0.2em,yshift=0.1em]e1.north) .. controls +(north:0.3) and +(south:0.4) .. ([xshift=-0.3em,yshift=-0.1em]h2.south);
+\draw [->] ([xshift=-0.2em,yshift=0.1em]e3.north) .. controls +(north:0.3) and +(south:0.4) .. ([xshift=0.3em,yshift=-0.1em]h2.south);
+\node [anchor=south] (encoder) at ([xshift=-0.2em]h1.north west) {\scriptsize{\textbf{编码器}}};
+{
+\node [rnnnode,anchor=west,fill=green!20] (t1) at ([xshift=3em]e3.east) {\tiny{$e_y()$}};
+\node [rnnnode,anchor=west,fill=green!20] (t2) at ([xshift=1.5em]t1.east) {\tiny{$e_y()$}};
+\node [rnnnode,anchor=west,fill=green!20] (t3) at ([xshift=1.5em]t2.east) {\tiny{$e_y()$}};
+\node [rnnnode,anchor=west,fill=green!20] (t4) at ([xshift=1.5em]t3.east) {\tiny{$e_y()$}};
+}
+{
+\node [rnnnode,anchor=south] (s1) at ([yshift=1em]t1.north) {\tiny{$\textbf{s}_1$}};
+\node [rnnnode,anchor=south] (s2) at ([yshift=1em]t2.north) {\tiny{$\textbf{s}_2$}};
+\node [rnnnode,anchor=south] (s3) at ([yshift=1em]t3.north) {\tiny{$\textbf{s}_3$}};
+\node [rnnnode,anchor=south] (s4) at ([yshift=1em]t4.north) {\tiny{$\textbf{s}_4$}};
+%\node [anchor=south] (dot3) at ([xshift=-0.4em,yshift=-0.7em]s3.south) {\tiny{...}};
+\node [anchor=south] (dot4) at ([xshift=-0.4em,yshift=-0.7em]s4.south) {\tiny{...}};
+\draw [->] ([xshift=-0.6em,yshift=-0.5em]s3.south) .. controls +(north:0) and +(south:0.2) .. ([xshift=-0.3em,yshift=-0.1em]s3.south);
+    \draw [->] ([xshift=-1.5em,yshift=-0.5em]s3.south) .. controls +(north:0) and +(south:0.15) .. ([xshift=-0.6em,yshift=-0.1em]s3.south);
+}
+{
+\node [rnnnode,anchor=south] (f1) at ([yshift=1em]s1.north) {\tiny{$\textbf{f}_1$}};
+\node [rnnnode,anchor=south] (f2) at ([yshift=1em]s2.north) {\tiny{$\textbf{f}_2$}};
+\node [rnnnode,anchor=south] (f3) at ([yshift=1em]s3.north) {\tiny{$\textbf{f}_3$}};
+\node [rnnnode,anchor=south] (f4) at ([yshift=1em]s4.north) {\tiny{$\textbf{f}_4$}};
+\node [rnnnode,anchor=south,fill=blue!20] (o1) at ([yshift=1em]f1.north) {\tiny{softmax}};
+\node [rnnnode,anchor=south,fill=blue!20] (o2) at ([yshift=1em]f2.north) {\tiny{softmax}};
+\node [rnnnode,anchor=south,fill=blue!20] (o3) at ([yshift=1em]f3.north) {\tiny{softmax}};
+\node [rnnnode,anchor=south,fill=blue!20] (o4) at ([yshift=1em]f4.north) {\tiny{softmax}};
+\node [anchor=east] (decoder) at ([xshift=-0.3em,yshift=0.5em]o1.north west) {\scriptsize{\textbf{解码器}}};
+\node [anchor=south,fill=black!5!white,minimum height=1.1em,minimum width=13em,inner sep=2pt,rounded corners=1pt,draw] (loss) at ([xshift=1.8em,yshift=1em]o2.north) {\scriptsize{\textbf{Cross Entropy Loss}}};
+}
+{
+\node [anchor=north,inner sep=2pt] (wt1) at ([yshift=-0.6em]t1.south) {\tiny{$\langle$eos$\rangle$}};
+\node [anchor=north,inner sep=2pt] (wt2) at ([yshift=-0.6em]t2.south) {\tiny{How}};
+\node [anchor=north,inner sep=2pt] (wt3) at ([yshift=-0.8em]t3.south) {\tiny{are}};
+\node [anchor=north,inner sep=2pt] (wt4) at ([yshift=-0.8em]t4.south) {\tiny{you}};
+}
+{
+\foreach \x in {1,2,3,4}{
+    \draw [->] ([yshift=-0.7em]t\x.south) -- ([yshift=-0.1em]t\x.south);
+    \draw [->] ([yshift=0.1em]t\x.north) -- ([yshift=-0.1em]s\x.south);
+\draw [->] ([xshift=0.2em,yshift=0.1em]t1.north) .. controls +(north:0.3) and +(south:0.3) .. ([xshift=-0.3em,yshift=-0.1em]s2.south);
+}
+}
+{
+\foreach \x in {1,2,3,4}{
+    \draw [->] ([yshift=0.1em]s\x.north) -- ([yshift=-0.1em]f\x.south);
+    \draw [->] ([yshift=0.1em]f\x.north) -- ([yshift=-0.1em]o\x.south);
+    \draw [->] ([yshift=0.1em]o\x.north) -- ([yshift=0.8em]o\x.north);
+}
+}
+{
+\node [circle,draw,anchor=south,inner sep=3pt,fill=orange!20] (c1) at ([yshift=2em]h2.north) {\tiny{$\textbf{C}_1$}};
+\node [anchor=south] (c1label) at (c1.north) {\tiny{\textbf{编码-解码注意力机制：上下文}}};
+\draw [->] (h1.north) .. controls +(north:0.6) and +(250:0.9) .. (c1.250);
+\draw [->] (h2.north) .. controls +(north:0.6) and +(270:0.9) .. (c1.270);
+\draw [->] (h3.north) .. controls +(north:0.6) and +(290:0.9) .. (c1.290);
+\draw [->] ([yshift=0.3em]s1.west) .. controls +(west:1) and +(east:1) .. (c1.-30);
+\draw [->] (c1.0) .. controls +(east:1) and +(west:1) .. ([yshift=0em]f1.west);
+}
+{
+\node [circle,draw,anchor=north,inner sep=3pt,fill=orange!20] (c2) at ([yshift=-2em]t1.south) {\tiny{$\textbf{C}_2$}};
+\draw [->] ([xshift=-0.7em]c2.west) -- ([xshift=-0.1em]c2.west);
+\draw [->] ([xshift=0.1em]c2.east) .. controls +(east:0.6) and +(west:0.8) ..([yshift=-0.3em,xshift=-0.1em]f2.west);
+\node [circle,draw,anchor=north,inner sep=3pt,fill=orange!20] (c3) at ([yshift=-2em]t2.south) {\tiny{$\textbf{C}_3$}};
+\draw [->] ([xshift=-0.7em]c3.west) -- ([xshift=-0.1em]c3.west);
+\draw [->] ([xshift=0.1em]c3.east) .. controls +(east:0.6) and +(west:0.8) ..([yshift=-0.3em,xshift=-0.1em]f3.west);
+\node [circle,draw,anchor=north,inner sep=3pt,fill=orange!20] (c4) at ([yshift=-2em]t3.south) {\tiny{$\textbf{C}_4$}};
+\draw [->] ([xshift=-0.7em]c4.west) -- ([xshift=-0.1em]c4.west);
+\draw [->] ([xshift=0.1em]c4.east) .. controls +(east:0.6) and +(west:0.8) ..([yshift=-0.3em,xshift=-0.1em]f4.west);
+}
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter12/Figures/figure-transformer-input-and-position-encoding.tex
+++ b/Chapter12/Figures/figure-transformer-input-and-position-encoding.tex
+\begin{tikzpicture}
+\begin{scope}
+\tikzstyle{Sanode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw];
+\tikzstyle{Resnode} = [minimum height=1.1em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw];
+\tikzstyle{ffnnode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw];
+\tikzstyle{outputnode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw];
+\tikzstyle{inputnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=red!10];
+\tikzstyle{posnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=black!5!white];
+\tikzstyle{standard} = [rounded corners=3pt]
+\node [Sanode,anchor=west] (sa1) at (0,0) {\tiny{$\textbf{Self-Attention}$}};
+\node [Resnode,anchor=south] (res1) at ([yshift=0.3em]sa1.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
+\node [ffnnode,anchor=south] (ffn1) at ([yshift=1em]res1.north) {\tiny{$\textbf{Feed Forward Network}$}};
+\node [Resnode,anchor=south] (res2) at ([yshift=0.3em]ffn1.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
+\node [inputnode,anchor=north west] (input1) at ([yshift=-1em]sa1.south west) {\tiny{$\textbf{Embedding}$}};
+\node [posnode,anchor=north east] (pos1) at ([yshift=-1em]sa1.south east) {\tiny{$\textbf{Postion}$}};
+\node [anchor=north] (inputs) at ([yshift=-3em]sa1.south) {\scriptsize{$\textbf{编码器输入: 我\ \ 很\ \ 好}$}};
+\node [anchor=south] (encoder) at ([xshift=0.2em,yshift=0.6em]res2.north west) {\scriptsize{\textbf{编码器}}};
+\draw [->] (sa1.north) -- (res1.south);
+\draw [->] (res1.north) -- (ffn1.south);
+\draw [->] (ffn1.north) -- (res2.south);
+\draw [->] ([yshift=-1em]sa1.south) -- (sa1.south);
+\draw [->] ([yshift=-0.3em]inputs.north) -- ([yshift=0.6em]inputs.north);
+\node [Sanode,anchor=west] (sa2) at ([xshift=3em]sa1.east) {\tiny{$\textbf{Self-Attention}$}};
+\node [Resnode,anchor=south] (res3) at ([yshift=0.3em]sa2.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
+\node [Sanode,anchor=south] (ed1) at ([yshift=1em]res3.north) {\tiny{$\textbf{Encoder-Decoder Attention}$}};
+\node [Resnode,anchor=south] (res4) at ([yshift=0.3em]ed1.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
+\node [ffnnode,anchor=south] (ffn2) at ([yshift=1em]res4.north) {\tiny{$\textbf{Feed Forward Network}$}};
+\node [Resnode,anchor=south] (res5) at ([yshift=0.3em]ffn2.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
+\node [outputnode,anchor=south] (o1) at ([yshift=1em]res5.north) {\tiny{$\textbf{Output layer}$}};
+\node [inputnode,anchor=north west] (input2) at ([yshift=-1em]sa2.south west) {\tiny{$\textbf{Embedding}$}};
+\node [posnode,anchor=north east] (pos2) at ([yshift=-1em]sa2.south east) {\tiny{$\textbf{Postion}$}};
+\node [anchor=north] (outputs) at ([yshift=-3em]sa2.south) {\scriptsize{$\textbf{解码器输入: $<$sos$>$ I  am  fine}$}};
+\node [anchor=east] (decoder) at ([xshift=-1em,yshift=-1.5em]o1.west) {\scriptsize{\textbf{解码器}}};
+\node [anchor=north] (decoutputs) at ([yshift=1.5em]o1.north) {\scriptsize{$\textbf{解码器输出: I  am  fine $<$eos$>$ }$}};
+\draw [->] (sa2.north) -- (res3.south);
+\draw [->] (res3.north) -- (ed1.south);
+\draw [->] (ed1.north) -- (res4.south);
+\draw [->] (res4.north) -- (ffn2.south);
+\draw [->] (ffn2.north) -- (res5.south);
+\draw [->] (res5.north) -- (o1.south);
+\draw [->] (o1.north) -- ([yshift=0.5em]o1.north);
+\draw [->] ([yshift=-1em]sa2.south) -- (sa2.south);
+\draw [->] ([yshift=-0.3em]outputs.north) -- ([yshift=0.6em]outputs.north);
+\draw[->,standard] ([yshift=-0.5em]sa1.south) -- ([xshift=-4em,yshift=-0.5em]sa1.south) -- ([xshift=-4em,yshift=2.3em]sa1.south) -- ([xshift=-3.5em,yshift=2.3em]sa1.south);
+\draw[->,standard] ([yshift=0.5em]res1.north) -- ([xshift=-4em,yshift=0.5em]res1.north) -- ([xshift=-4em,yshift=3.3em]res1.north) -- ([xshift=-3.5em,yshift=3.3em]res1.north);
+\draw[->,standard] ([yshift=-0.5em]sa2.south) -- ([xshift=4em,yshift=-0.5em]sa2.south) -- ([xshift=4em,yshift=2.3em]sa2.south) -- ([xshift=3.5em,yshift=2.3em]sa2.south);
+\draw[->,standard] ([yshift=0.5em]res3.north) -- ([xshift=4em,yshift=0.5em]res3.north) -- ([xshift=4em,yshift=3.3em]res3.north) -- ([xshift=3.5em,yshift=3.3em]res3.north);
+\draw[->,standard] ([yshift=0.5em]res4.north) -- ([xshift=4em,yshift=0.5em]res4.north) -- ([xshift=4em,yshift=3.3em]res4.north) -- ([xshift=3.5em,yshift=3.3em]res4.north);
+\draw[->,standard] (res2.north) -- ([yshift=0.5em]res2.north) -- ([xshift=5em,yshift=0.5em]res2.north) -- ([xshift=5em,yshift=-2.2em]res2.north) -- ([xshift=6.5em,yshift=-2.2em]res2.north);
+%\node [rectangle,inner sep=0.7em,rounded corners=1pt,very thick,dotted,draw=ugreen!70] [fit = (sa1) (res1) (ffn1) (res2)] (box0) {};
+%\node [rectangle,inner sep=0.7em,rounded corners=1pt,very thick,dotted,draw=red!60] [fit = (sa2) (res3) (res5)] (box1) {};
+\begin{pgfonlayer}{background}
+	\node [rectangle,inner sep=0.2em,rounded corners=1pt,very thick,dotted,fill=red!40] [fit = (input1) (pos1)] (box1) {};		
+	\node [rectangle,inner sep=0.2em,rounded corners=1pt,very thick,dotted,fill=red!40] [fit = (input2) (pos2)] (box2) {};	
+\end{pgfonlayer}
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter12/Figures/figure-transformer.tex
+++ b/Chapter12/Figures/figure-transformer.tex
+\begin{tikzpicture}
+\begin{scope}
+\tikzstyle{Sanode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=orange!20];
+\tikzstyle{Resnode} = [minimum height=1.1em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=yellow!20];
+\tikzstyle{ffnnode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=blue!10];
+\tikzstyle{outputnode} = [minimum height=1.4em,minimum width=7em,inner sep=3pt,rounded corners=1.5pt,draw,fill=blue!30];
+\tikzstyle{inputnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=red!10];
+\tikzstyle{posnode} = [minimum height=1.4em,minimum width=3.5em,inner sep=3pt,rounded corners=1.5pt,draw,fill=black!5!white];
+\tikzstyle{standard} = [rounded corners=3pt]
+\node [Sanode,anchor=west] (sa1) at (0,0) {\tiny{$\textbf{Self-Attention}$}};
+\node [Resnode,anchor=south] (res1) at ([yshift=0.3em]sa1.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
+\node [ffnnode,anchor=south] (ffn1) at ([yshift=1em]res1.north) {\tiny{$\textbf{Feed Forward Network}$}};
+\node [Resnode,anchor=south] (res2) at ([yshift=0.3em]ffn1.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
+\node [inputnode,anchor=north west] (input1) at ([yshift=-1em]sa1.south west) {\tiny{$\textbf{Embedding}$}};
+\node [posnode,anchor=north east] (pos1) at ([yshift=-1em]sa1.south east) {\tiny{$\textbf{Postion}$}};
+\node [anchor=north] (inputs) at ([yshift=-3em]sa1.south) {\scriptsize{$\textbf{编码器输入: 我\ \ 很\ \ 好}$}};
+\node [anchor=south] (encoder) at ([xshift=0.2em,yshift=0.6em]res2.north west) {\scriptsize{\textbf{编码器}}};
+\draw [->] (sa1.north) -- (res1.south);
+\draw [->] (res1.north) -- (ffn1.south);
+\draw [->] (ffn1.north) -- (res2.south);
+\draw [->] ([yshift=-1em]sa1.south) -- (sa1.south);
+\draw [->] ([yshift=-0.3em]inputs.north) -- ([yshift=0.6em]inputs.north);
+\node [Sanode,anchor=west] (sa2) at ([xshift=3em]sa1.east) {\tiny{$\textbf{Self-Attention}$}};
+\node [Resnode,anchor=south] (res3) at ([yshift=0.3em]sa2.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
+\node [Sanode,anchor=south] (ed1) at ([yshift=1em]res3.north) {\tiny{$\textbf{Encoder-Decoder Attention}$}};
+\node [Resnode,anchor=south] (res4) at ([yshift=0.3em]ed1.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
+\node [ffnnode,anchor=south] (ffn2) at ([yshift=1em]res4.north) {\tiny{$\textbf{Feed Forward Network}$}};
+\node [Resnode,anchor=south] (res5) at ([yshift=0.3em]ffn2.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
+\node [outputnode,anchor=south] (o1) at ([yshift=1em]res5.north) {\tiny{$\textbf{Output layer}$}};
+\node [inputnode,anchor=north west] (input2) at ([yshift=-1em]sa2.south west) {\tiny{$\textbf{Embedding}$}};
+\node [posnode,anchor=north east] (pos2) at ([yshift=-1em]sa2.south east) {\tiny{$\textbf{Postion}$}};
+\node [anchor=north] (outputs) at ([yshift=-3em]sa2.south) {\scriptsize{$\textbf{解码器输入: $<$sos$>$ I  am  fine}$}};
+\node [anchor=east] (decoder) at ([xshift=-1em,yshift=-1.5em]o1.west) {\scriptsize{\textbf{解码器}}};
+\node [anchor=north] (decoutputs) at ([yshift=1.5em]o1.north) {\scriptsize{$\textbf{解码器输出: I  am  fine $<$eos$>$ }$}};
+\draw [->] (sa2.north) -- (res3.south);
+\draw [->] (res3.north) -- (ed1.south);
+\draw [->] (ed1.north) -- (res4.south);
+\draw [->] (res4.north) -- (ffn2.south);
+\draw [->] (ffn2.north) -- (res5.south);
+\draw [->] (res5.north) -- (o1.south);
+\draw [->] (o1.north) -- ([yshift=0.5em]o1.north);
+\draw [->] ([yshift=-1em]sa2.south) -- (sa2.south);
+\draw [->] ([yshift=-0.3em]outputs.north) -- ([yshift=0.6em]outputs.north);
+\draw[->,standard] ([yshift=-0.5em]sa1.south) -- ([xshift=-4em,yshift=-0.5em]sa1.south) -- ([xshift=-4em,yshift=2.3em]sa1.south) -- ([xshift=-3.5em,yshift=2.3em]sa1.south);
+\draw[->,standard] ([yshift=0.5em]res1.north) -- ([xshift=-4em,yshift=0.5em]res1.north) -- ([xshift=-4em,yshift=3.3em]res1.north) -- ([xshift=-3.5em,yshift=3.3em]res1.north);
+\draw[->,standard] ([yshift=-0.5em]sa2.south) -- ([xshift=4em,yshift=-0.5em]sa2.south) -- ([xshift=4em,yshift=2.3em]sa2.south) -- ([xshift=3.5em,yshift=2.3em]sa2.south);
+\draw[->,standard] ([yshift=0.5em]res3.north) -- ([xshift=4em,yshift=0.5em]res3.north) -- ([xshift=4em,yshift=3.3em]res3.north) -- ([xshift=3.5em,yshift=3.3em]res3.north);
+\draw[->,standard] ([yshift=0.5em]res4.north) -- ([xshift=4em,yshift=0.5em]res4.north) -- ([xshift=4em,yshift=3.3em]res4.north) -- ([xshift=3.5em,yshift=3.3em]res4.north);
+\draw[->,standard] (res2.north) -- ([yshift=0.5em]res2.north) -- ([xshift=5em,yshift=0.5em]res2.north) -- ([xshift=5em,yshift=-2.2em]res2.north) -- ([xshift=6.5em,yshift=-2.2em]res2.north);
+\node [rectangle,inner sep=0.7em,rounded corners=1pt,very thick,dotted,draw=ugreen!70] [fit = (sa1) (res1) (ffn1) (res2)] (box0) {};
+\node [rectangle,inner sep=0.7em,rounded corners=1pt,very thick,dotted,draw=red!60] [fit = (sa2) (res3) (res5)] (box1) {};
+\node [ugreen,font=\scriptsize] (count) at ([xshift=-1.5em,yshift=-1em]encoder.south) {$6\times$};
+\node [red,font=\scriptsize] (count) at ([xshift=10.8em,yshift=0em]decoder.south) {$\times 6$};
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter12/chapter12.tex
+++ b/Chapter12/chapter12.tex
@@ -21,10 +21,554 @@
 %	CHAPTER 12
 %----------------------------------------------------------------------------------------
-\chapter{神经机器翻译模型训练}
+\chapter{基于自注意力的模型}
+%----------------------------------------------------------------------------------------
+%    NEW SECTION  12.1
+%----------------------------------------------------------------------------------------
+\section{自注意力机制}
+\vspace{0.5em}
+\label{sec:12.1}
+\parinterval 自注意力机制与注意力机制究竟有什么不同？首先回顾一下循环神经网络处理文字序列的过程。如图\ref{fig:12-36}所示，对于单词序列$\{ w_1,...,w_m \}$，处理第$m$个单词$w_m$时（绿色方框部分），需要输入前一时刻的信息（即处理单词$w_{m-1}$），而$w_{m-1}$又依赖于$w_{m-2}$，以此类推。也就是说，如果想建立$w_m$和$w_1$之间的关系，需要$m-1$次信息传递。对于长序列来说，词汇之间信息传递距离过长会导致信息在传递过程中丢失，同时这种按顺序建模的方式也使得系统对序列的处理十分缓慢。
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter12/Figures/figure-dependencies-between-words-in-a-recurrent-neural-network}
+\caption{循环神经网络中单词之间的依赖关系}
+\label{fig:12-36}
+\end{figure}
+%----------------------------------------------
+\parinterval 那么能否摆脱这种顺序传递信息的方式，直接对不同位置单词之间的关系进行建模，即将信息传递的距离拉近为1？{\small\sffamily\bfseries{自注意力机制}}\index{自注意力机制}（Self-Attention）\index{Self-Attention}的提出便有效解决了这个问题\upcite{DBLP:journals/corr/LinFSYXZB17}。图\ref{fig:12-37}给出了自注意力机制对序列进行建模的示例。对于单词$w_m$，自注意力机制直接建立它与前$m-1$个单词之间的关系。也就是说，$w_m$与序列中所有其他单词的距离都是1。这种方式很好地解决了长距离依赖问题，同时由于单词之间的联系都是相互独立的，因此也大大提高了模型的并行度。
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter12/Figures/figure-dependencies-between-words-of-attention}
+\caption{自注意力机制中单词之间的依赖关系}
+\label{fig:12-37}
+\end{figure}
+%----------------------------------------------
+\parinterval 自注意力机制也可以被看做是一个序列表示模型。比如，对于每个目标位置$j$，都生成一个与之对应的源语句子表示，它的形式为：
+\begin{eqnarray}
+\vectorn{\emph{C}}_j = \sum_i \alpha_{i,j}\vectorn{\emph{h}}_i
+\label{eq:12-4201}
+\end{eqnarray}
+\noindent 其中$\vectorn{\emph{h}}_i$ 为源语句子每个位置的表示结果，$\alpha_{i,j}$是目标位置$j$对$\vectorn{\emph{h}}_i$的注意力权重。而自注意力机制不仅可以处理两种语言句子之间的对应，它也可以对单语句子进行表示。以源语句子为例，自注意力机制将序列中每个位置的表示$\vectorn{\emph{h}}_i$看作$\mathrm{query}$（查询），并且将所有位置的表示看作$\mathrm{key}$（键）和$\mathrm{value}$（值）。自注意力模型通过计算当前位置与所有位置的匹配程度，也就是在注意力机制中提到的注意力权重，来对各个位置的$\mathrm{value}$进行加权求和。得到的结果可以被看作是在这个句子中当前位置的抽象表示。这个过程，可以叠加多次，形成多层注意力模型，对输入序列中各个位置进行更深层的表示。
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter12/Figures/figure-example-of-self-attention-mechanism-calculation}
+\caption{自注意力计算实例}
+\label{fig:12-38}
+\end{figure}
+%----------------------------------------------
+\parinterval 举个例子，如图\ref{fig:12-38}所示，一个汉语句子包含5个词。这里，用$h$(“你”)表示“你”当前的表示结果，其中$h(\cdot)$是一个函数，用于返回输入单词所在位置对应的表示结果（向量）。如果把“你”看作目标，这时$\mathrm{query}$ 就是$h$(你)，$\mathrm{key}$和$\mathrm{value}$是图中所有位置的表示，即：{$h$(你)、$h$(什么)、$h$(也)、$h$(没)、$h$(学)}。在自注意力模型中，首先计算$\mathrm{query}$ 和$\mathrm{key}$的相关度，这里用$\alpha_i$表示$h$(你)和位置$i$的表示之间的相关性。然后，把$\alpha_i$作为权重，对不同位置上的$\mathrm{value}$进行加权求和。最终，得到新的表示结果$\tilde{h}$ (“你” )：
+\begin{eqnarray}
+\tilde{h} (\textrm{你} ) & = & \alpha_1 {h} (\textrm{你} ) + \alpha_2 {h} (\textrm{什么}) + \alpha_3 {h} (\textrm{也} ) + \nonumber \\
+                         &   & \alpha_4 {h} (\textrm{没} ) +\alpha_5 {h} (\textrm{学} )  
+\label{eq:12-42}
+\end{eqnarray}
+\parinterval 同理，也可以用同样的方法处理这个句子中的其他单词。可以看出，在注意力机制中，并不是使用类似于循环神经网络的记忆能力去访问历史信息。序列中所有单词之间的信息都是通过同一种操作（$\mathrm{query}$和$\mathrm{key}$的相关度）进行处理。这样，表示结果$\tilde{h} (\textrm{你})$在包含“你”这个单词的信息的同时，也包含了序列中其他词的信息。也就是，序列中每一个位置的表示结果中，都包含了其他位置的信息。从这个角度说，$\tilde{h} (\textrm{你})$已经不再是单词“你”自身的表示结果，而是一种在单词“你”的位置上的全局信息的表示。
+\parinterval 通常，也把生成\{ $\tilde{h}(\vectorn{\emph{w}}_i)$ \}的过程称为{\small\sffamily\bfseries{特征提取}}\index{特征提取}，而实现这个过程的模型被称为特征提取器。循环神经网络、自注意力模型都是典型的特征提取器。特征提取是神经机器翻译系统的关键步骤，在随后的内容中可以看到自注意力模型是一个非常适合机器翻译任务的特征提取器。
 %----------------------------------------------------------------------------------------
 %    NEW SECTION
 %----------------------------------------------------------------------------------------
+\sectionnewpage
+\section{Transformer架构}
+本节会对Transformer模型由来以及总体架构进行介绍。
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsection{为什么需要Transformer}
+\parinterval 首先回顾一下{\chapterten}介绍的循环神经网络，虽然它很强大，但是也存在一些弊端。其中比较突出的问题是，循环神经网络每个循环单元都有向前依赖性，也就是当前时间步的处理依赖前一时间步处理的结果。这个性质可以使序列的“历史”信息不断被传递，但是也造成模型运行效率的下降。特别是对于自然语言处理任务，序列往往较长，无论是传统的RNN结构，还是更为复杂的LSTM结构，都需要很多次循环单元的处理才能够捕捉到单词之间的长距离依赖。由于需要多个循环单元的处理，距离较远的两个单词之间的信息传递变得很复杂。
+\parinterval 针对这些问题，谷歌的研究人员提出了一种全新的模型$\ \dash\ $Transformer\upcite{NIPS2017_7181}。与循环神经网络等传统模型不同，Transformer模型仅仅使用一种被称作自注意力机制的方法和标准的前馈神经网络，完全不依赖任何循环单元或者卷积操作。自注意力机制的优点在于可以直接对序列中任意两个单元之间的关系进行建模，这使得长距离依赖等问题可以更好地被求解。此外，自注意力机制非常适合在GPU 上进行并行化，因此模型训练的速度更快。表\ref{tab:12-11}对比了RNN、CNN、Transformer三种模型的时间复杂度。
+%----------------------------------------------
+\begin{table}[htp]
+\centering
+\caption{ RNN、CNN、Transformer的对比\upcite{NIPS2017_7181} （$n$表示序列长度，$d$表示隐层大小，$k$表示卷积核大小） }
+\label{tab:12-11}
+\begin{tabular}{l | l l l}
+\rule{0pt}{20pt} Layer Type & \begin{tabular}[l]{@{}l@{}}Complexity\\ per Layer\end{tabular} & \begin{tabular}[l]{@{}l@{}}Sequential\\ Operations\end{tabular} & \begin{tabular}[l]{@{}l@{}}Maximum\\ Path Length\end{tabular} \\ \hline
+\rule{0pt}{13pt}Self-Attention &$O(n^2\cdot d)$	&$O(1)$	&$O(1)$       \\
+\rule{0pt}{13pt}Recurrent &$O(n \cdot d^2)$		&$O(n)$	&$O(n)$ 	\\
+\rule{0pt}{13pt}Convolutional  &$O(k\cdot n \cdot d^2)$	&$O(1)$	&$O(\mathrm{log}_k(n))$
+\end{tabular}
+\end{table}
+%----------------------------------------------
+\parinterval Transformer在被提出之后，很快就席卷了整个自然语言处理领域。实际上，Transformer也可以当作一种表示模型，因此也被大量地使用在自然语言处理的其他领域，甚至图像处理和语音处理中也能看到它的影子。比如，目前非常流行的BERT等预训练模型就是基于Transformer。表\ref{tab:12-12}展示了Transformer在WMT英德和英法机器翻译任务上的性能。它能用更少的计算量（FLOPS）达到比其他模型更好的翻译品质\footnote{FLOPS = floating-point operations per second，即每秒浮点运算次数。它是度量计算机运算规模的常用单位} 。
+%----------------------------------------------
+\begin{table}[htp]
+\centering
+\caption{ 不同翻译模型性能对比\upcite{NIPS2017_7181}}
+\label{tab:12-12}
+\begin{tabular}{l l l l}
+\multicolumn{1}{l|}{\multirow{2}{*}{\#}} & \multicolumn{2}{c}{BLEU} & \multirow{2}{*}{\parbox{6em}{Training Cost (FLOPs)}} \\
+\multicolumn{1}{l|}{}                    & EN-DE  & EN-FR  &                                       \\ \hline
+\multicolumn{1}{l|}{GNMT+RL}             & 24.6            & 39.92           & 1.4$\times 10^{20}$                   \\
+\multicolumn{1}{l|}{ConvS2S}             & 25.16           & 40.46           & 1.5$\times 10^{20}$                   \\
+\multicolumn{1}{l|}{MoE}                 & 26.03           & 40.56           & 1.2$\times 10^{20}$                   \\
+\multicolumn{1}{l|}{Transformer (Big)}    & {\small\sffamily\bfseries{28.4}}   & {\small\sffamily\bfseries{41.8}}   & 2.3$\times 10^{19}$                   \\
+\end{tabular}
+\end{table}
+%----------------------------------------------
+\parinterval 注意，Transformer并不简单等同于自注意力机制。Transformer模型还包含了很多优秀的技术，比如：多头注意力、新的训练学习率调整策略等等。这些因素一起组成了真正的Transformer。下面就一起看一看自注意力机制和Transformer是如何工作的。
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsection{总体结构}
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter12/Figures/figure-transformer}
+\caption{ Transformer结构}
+\label{fig:12-39}
+\end{figure}
+%----------------------------------------------
+\parinterval 图\ref{fig:12-39}展示了经典的Transformer结构。解码器由若干层组成（绿色虚线框就代表一层）。每一层（layer）的输入都是一个向量序列，输出是同样大小的向量序列，而Transformer层的作用是对输入进行进一步的抽象，得到新的表示结果。不过这里的层并不是指单一的神经网络结构，它里面由若干不同的模块组成，包括：
+\begin{itemize}
+\vspace{0.5em}
+\item {\small\sffamily\bfseries{自注意力子层}}\index{自注意力子层}（Self-attention Sub-layer）\index{Self-attention Sub-layer}：使用自注意力机制对输入的序列进行新的表示；
+\vspace{0.5em}
+\item {\small\sffamily\bfseries{前馈神经网络子层}}\index{前馈神经网络子层}（Feed-forward Sub-layer）\index{Feed-forward Sub-layer}：使用全连接的前馈神经网络对输入向量序列进行进一步变换；
+\vspace{0.5em}
+\item {\small\sffamily\bfseries{残差连接}}\index{残差连接}（Residual Connection，标记为“Add”）\index{Residual Connection}：对于自注意力子层和前馈神经网络子层，都有一个从输入直接到输出的额外连接，也就是一个跨子层的直连。残差连接可以使深层网络的信息传递更为有效；
+\vspace{0.5em}
+\item {\small\sffamily\bfseries{层正则化}}\index{层正则化}（Layer Normalization）\index{Layer Normalization}：自注意力子层和前馈神经网络子层进行最终输出之前，会对输出的向量进行层正则化，规范结果向量取值范围，这样易于后面进一步的处理。
+\vspace{0.5em}
+\end{itemize}
+\parinterval 以上操作就构成了Transformer的一层，各个模块执行的顺序可以简单描述为：Self-Attention $\to$ Residual Connection $\to$ Layer Normalization $\to$ Feed Forward Network $\to$ Residual Connection $\to$ Layer Normalization。编码器可以包含多个这样的层，比如，可以构建一个六层编码器，每层都执行上面的操作。最上层的结果作为整个编码的结果，会被传入解码器。
+\parinterval 解码器的结构与编码器十分类似。它也是由若干层组成，每一层包含编码器中的所有结构，即：自注意力子层、前馈神经网络子层、残差连接和层正则化模块。此外，为了捕捉源语言的信息，解码器又引入了一个额外的{\small\sffamily\bfseries{编码-解码注意力子层}}\index{编码-解码注意力子层}（Encoder-decoder Attention Sub-layer）\index{Encoder-decoder Attention Sub-layer}。这个新的子层，可以帮助模型使用源语言句子的表示信息生成目标语不同位置的表示。编码-解码注意力子层仍然基于自注意力机制，因此它和自注意力子层的结构是相同的，只是$\mathrm{query}$、$\mathrm{key}$、$\mathrm{value}$的定义不同。比如，在解码端，自注意力子层的$\mathrm{query}$、$\mathrm{key}$、$\mathrm{value}$是相同的，它们都等于解码端每个位置的表示。而在编码-解码注意力子层中，$\mathrm{query}$是解码端每个位置的表示，此时$\mathrm{key}$和$\mathrm{value}$是相同的，等于编码端每个位置的表示。图\ref{fig:12-40}给出了这两种不同注意力子层输入的区别。
+%----------------------------------------------
+\begin{figure}[htp]
+    \centering
+   \input{./Chapter12/Figures/figure-self-att-vs-enco-deco-att}
+    \caption{ 注意力模型的输入（自注意力子层 vs 编码-解码注意力子层）}
+    \label{fig:12-40}
+\end{figure}
+%----------------------------------------------
+\parinterval 此外，编码端和解码端都有输入的词序列。编码端的词序列输入是为了对其进行表示，进而解码端能从编码端访问到源语言句子的全部信息。解码端的词序列输入是为了进行目标语的生成，本质上它和语言模型是一样的，在得到前$n-1$个单词的情况下输出第$n$个单词。除了输入的词序列的词嵌入，Transformer中也引入了位置嵌入，以表示每个位置信息。原因是，自注意力机制没有显性地对位置进行表示，因此也无法考虑词序。在输入中引入位置信息可以让自注意力机制间接地感受到每个词的位置，进而保证对序列表示的合理性。最终，整个模型的输出由一个Softmax层完成，它和循环神经网络中的输出层是完全一样的（\ref{sec:10.3.2}节）。
+\parinterval 在进行更详细的介绍前，先利用图\ref{fig:12-39}简单了解一下Transformer模型是如何进行翻译的。首先，Transformer将源语“我\ 很\ 好”的{\small\bfnew{词嵌入}}\index{词嵌入}（Word Embedding）\index{Word Embedding}融合{\small\bfnew{位置编码}}\index{位置编码}（Position Embedding）\index{Position Embedding}后作为输入。然后，编码器对输入的源语句子进行逐层抽象，得到包含丰富的上下文信息的源语表示并传递给解码器。解码器的每一层，使用自注意力子层对输入解码端的表示进行加工，之后再使用编码-解码注意力子层融合源语句子的表示信息。就这样逐词生成目标语译文单词序列。解码器的每个位置的输入是当前单词（比如，“I”），而这个位置输出是下一个单词（比如，“am”），这个设计和标准的神经语言模型是完全一样的。
+\parinterval 了解到这里，可能大家还有很多疑惑，比如，什么是位置编码？Transformer的自注意力机制具体是怎么进行计算的，其结构是怎样的？Add\& LayerNorm又是什么？等等。下面就一一展开介绍。
+%----------------------------------------------------------------------------------------
+%    NEW SECTION
+%----------------------------------------------------------------------------------------
+\section{位置编码}
+\parinterval 在使用循环神经网络进行序列的信息提取时，每个时刻的运算都要依赖前一个时刻的输出，具有一定的时序性，这也与语言具有顺序的特点相契合。而采用自注意力机制对源语言和目标语言序列进行处理时，直接对当前位置和序列中的任意位置进行建模，忽略了词之间的顺序关系，例如图\ref{fig:12-41}中两个语义不同的句子，通过自注意力得到的表示$\tilde{h}$(“机票”)却是相同的。
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter12/Figures/figure-calculation-of-context-vector-c}
+\caption{“机票”的更进一步抽象表示$\tilde{\vectorn{\emph{h}}}$的计算}
+\label{fig:12-41}
+\end{figure}
+%----------------------------------------------
+\parinterval 为了解决这个问题，Transformer在原有的词向量输入基础上引入了位置编码，来表示单词之间的顺序关系。位置编码在Transformer结构中的位置如图\ref{fig:12-42}，它是Transformer成功的一个重要因素。
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter12/Figures/figure-transformer-input-and-position-encoding}
+\caption{Transformer输入与位置编码}
+\label{fig:12-42}
+\end{figure}
+%----------------------------------------------
+\parinterval 位置编码的计算方式有很多种，Transformer使用不同频率的正余弦函数：
+\begin{eqnarray}
+\textrm{PE}(pos,2i) = \textrm{sin} (\frac{pos}{10000^{2i/d_{model}}})
+\label{eq:12-43}
+\end{eqnarray}
+\begin{eqnarray}
+\textrm{PE}(pos,2i+1) = \textrm{cos} (\frac{pos}{10000^{2i/d_{model}}})
+\label{eq:12-44}
+\end{eqnarray}
+\noindent 式中PE($\cdot$)表示位置编码的函数，$pos$表示单词的位置，$i$代表位置编码向量中的第几维，$d_{model}$是Transformer的一个基础参数，表示每个位置的隐层大小。因为，正余弦函数的编码各占一半，因此当位置编码的维度为512 时，$i$ 的范围是0-255。 在Transformer中，位置编码的维度和词嵌入向量的维度相同（均为$d_{model}$），模型通过将二者相加作为模型输入，如图\ref{fig:12-43}所示。
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter12/Figures/figure-a-combination-of-position-encoding-and-word-encoding}
+\caption{位置编码与词编码的组合}
+\label{fig:12-43}
+\end{figure}
+%----------------------------------------------
+\parinterval 那么为什么通过这种计算方式可以很好的表示位置信息？有几方面原因。首先，正余弦函数是具有上下界的周期函数，用正余弦函数可将长度不同的序列的位置编码的范围都固定到[-1,1]，这样在与词的编码进行相加时，不至于产生太大差距。另外位置编码的不同维度对应不同的正余弦曲线，这为多维的表示空间赋予一定意义。最后，根据三角函数的性质：
+\begin{eqnarray}
+\textrm{sin}(\alpha + \beta) &=& \textrm{sin}\alpha \cdot \textrm{cos} \beta + \textrm{cos} \alpha \cdot \textrm{sin} \beta \nonumber  \\
+\textrm{cos}(\alpha + \beta) &=&  \textrm{cos} \alpha  \cdot \textrm{cos} \beta - \textrm{sin} \alpha \cdot \textrm{sin} \beta
+\label{eq:12-45}
+\end{eqnarray}
+\parinterval 可以得到“$pos+k$”的位置编码为：
+\begin{eqnarray}
+\textrm{PE}(pos+k,2i) &=& \textrm{PE}(pos,2i) \times \textrm{PE}(k,2i+1) + \nonumber \\
+                      & & \textrm{PE}(pos,2i+1) \times \textrm{PE}(k,2i)\\
+\textrm{PE}(pos+k ,2i+1) &=& \textrm{PE}(pos,2i+1) \times \textrm{PE}(k,2i+1) - \nonumber \\
+                         & & \textrm{PE}(pos,2i) \times \textrm{PE}(k,2i)
+\label{eq:12-46}
+\end{eqnarray}
+\noindent 即对于任意固定的偏移量$k$，$\textrm{PE}(pos+k)$能被表示成$\textrm{PE}(pos)$的线性函数，换句话说，位置编码可以表示词之间的距离。在实践中发现，位置编码对Transformer系统的性能有很大影响。对其进行改进也会带来进一步的性能提升\upcite{Shaw2018SelfAttentionWR}。
+%----------------------------------------------------------------------------------------
+%    NEW SECTION
+%----------------------------------------------------------------------------------------
+\section{基于点乘的多头注意力机制}
+\parinterval Transformer模型摒弃了循环单元和卷积等结构，完全基于注意力机制来构造模型，其中包含着大量的注意力计算。比如，可以通过自注意力机制对源语言和目标语言序列进行信息提取，并通过编码-解码注意力对双语句对之间的关系进行提取。图\ref{fig:12-44}中红色方框部分是Transformer中使用自注意力机制的模块。而这些模块都是由基于点乘的多头注意力机制实现的。
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter12/Figures/figure-position-of-self-attention-mechanism-in-the-model}
+\caption{自注意力机制在模型中的位置}
+\label{fig:12-44}
+\end{figure}
+%----------------------------------------------
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsection{点乘注意力}
+\parinterval 在\ref{sec:12.1.3}节中已经介绍，自注意力机制中至关重要的是获取相关性系数，也就是在融合不同位置的表示向量时各位置的权重。不同于\ref{sec:12.1.3}节介绍的注意力机制的相关性系数计算方式，Transformer模型采用了一种基于点乘的方法来计算相关性系数。这种方法也称为{\small\bfnew{点乘注意力}}\index{点乘注意力}（Scaled Dot-Product Attention）\index{Scaled Dot-Product Attention}机制。它的运算并行度高，同时并不消耗太多的存储空间。
+\parinterval 具体来看，在注意力机制的计算过程中，包含三个重要的参数，分别是Query，\\Key和Value。在下面的描述中，分别用$\vectorn{\emph{Q}}$，$\vectorn{\emph{K}}$，$\vectorn{\emph{V}}$对它们进行表示，其中$\vectorn{\emph{Q}}$ 和$\vectorn{\emph{K}}$的维度为$L\times d_k$，$\vectorn{\emph{V}}$的维度为$L\times d_v$。这里，$L$为序列的长度，$d_k$和$d_v$分别表示每个Key和Value的大小，通常设置为$d_k=d_v=d_{model}$。
+\parinterval 在自注意力机制中，$\vectorn{\emph{Q}}$、$\vectorn{\emph{K}}$、$\vectorn{\emph{V}}$都是相同的，对应着源语言或目标语言的表示。而在编码-解码注意力机制中，由于要对双语之间的信息进行建模，因此，将目标语每个位置的表示视为编码-解码注意力机制的$\vectorn{\emph{Q}}$，源语言句子的表示视为$\vectorn{\emph{K}}$ 和$\vectorn{\emph{V}}$。
+\parinterval 在得到$\vectorn{\emph{Q}}$，$\vectorn{\emph{K}}$和$\vectorn{\emph{V}}$后，便可以进行注意力机制的运算，这个过程可以被形式化为：
+\begin{eqnarray}
+\textrm{Attention}(\vectorn{\emph{Q}},\vectorn{\emph{K}},\vectorn{\emph{V}}) = \textrm{Softmax}
+ ( \frac{\vectorn{\emph{Q}}\vectorn{\emph{K}}^{T}} {\sqrt{d_k}} + \vectorn{\emph{Mask}} ) \vectorn{\emph{V}}
+\label{eq:12-47}
+\end{eqnarray}
+\noindent 首先，通过对$\vectorn{\emph{Q}}$和$\vectorn{\emph{K}}$的转置进行点乘操作，计算得到一个维度大小为$L \times L$的相关性矩阵，即$\vectorn{\emph{Q}}\vectorn{\emph{K}}^{T}$，它表示一个序列上任意两个位置的相关性。再通过系数1/$\sqrt{d_k}$进行放缩操作，放缩可以尽量减少相关性矩阵的方差，具体体现在运算过程中实数矩阵中的数值不会过大，有利于模型训练。
+\parinterval 在此基础上，通过对相关性矩阵累加一个掩码矩阵，来屏蔽掉矩阵中的无用信息。比如，在编码端对句子的补齐，在解码端则屏蔽掉未来信息，这一部分内容将在下一小节进行详细介绍。随后，使用Softmax函数对相关性矩阵在行的维度上进行归一化操作，这可以理解为对第$i$行进行归一化，结果对应了$\vectorn{\emph{V}}$中不同位置上向量的注意力权重。对于$\mathrm{value}$的加权求和，可以直接用相关性系数和$\vectorn{\emph{V}}$进行矩阵乘法得到，即$\textrm{Softmax}
+ ( \frac{\vectorn{\emph{Q}}\vectorn{\emph{K}}^{T}} {\sqrt{d_k}} + \vectorn{\emph{Mask}} )$和$\vectorn{\emph{V}}$进行矩阵乘。最终得到自注意力的输出，它和输入的$\vectorn{\emph{V}}$的大小是一模一样的。图\ref{fig:12-45}展示了点乘注意力计算的全过程。
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter12/Figures/figure-point-product-attention-model}
+\caption{点乘注意力力模型 }
+\label{fig:12-45}
+\end{figure}
+%----------------------------------------------
+\parinterval 下面举个简单的例子介绍点乘注意力的具体计算过程。如图\ref{fig:12-46}所示，用黄色、蓝色和橙色的矩阵分别表示$\vectorn{\emph{Q}}$、$\vectorn{\emph{K}}$和$\vectorn{\emph{V}}$。$\vectorn{\emph{Q}}$、$\vectorn{\emph{K}}$ 和$\vectorn{\emph{V}}$中的每一个小格都对应一个单词在模型中的表示（即一个向量）。首先，通过点乘、放缩、掩码等操作得到相关性矩阵，即粉色部分。其次，将得到的中间结果矩阵（粉色）的每一行使用Softmax激活函数进行归一化操作，得到最终的权重矩阵，也就是图中的红色矩阵。红色矩阵中的每一行都对应一个注意力分布。最后，按行对$\vectorn{\emph{V}}$进行加权求和，便得到了每个单词通过点乘注意力机制计算得到的表示。这里面，主要的计算消耗是两次矩阵乘法，即$\vectorn{\emph{Q}}$与$\vectorn{\emph{K}}^{T}$的乘法、相关性矩阵和$\vectorn{\emph{V}}$的乘法。这两个操作都可以在GPU上高效地完成，因此可以一次性计算出序列中所有单词之间的注意力权重，并完成所有位置表示的加权求和过程，这样大大提高了模型的计算速度。
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter12/Figures/figure-process-of-5}
+\caption{式\ref{eq:12-47}的执行过程示例}
+\label{fig:12-46}
+\end{figure}
+%----------------------------------------------
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsection{多头注意力}
+\parinterval Transformer中使用的另一项重要技术是{\small\sffamily\bfseries{多头注意力}}\index{多头注意力}（Multi-head Attention）\index{Multi-head Attention}。“多头”可以理解成将原来的$\vectorn{\emph{Q}}$、$\vectorn{\emph{K}}$、$\vectorn{\emph{V}}$按照隐层维度平均切分成多份。假设切分$h$份，那么最终会得到$\vectorn{\emph{Q}} = \{ \vectorn{\emph{q}}_1, \vectorn{\emph{q}}_2,...,\vectorn{\emph{q}}_h \}$，$\vectorn{\emph{K}}=\{ \vectorn{\emph{k}}_1,\vectorn{\emph{k}}_2,...,\vectorn{\emph{k}}_h \}$，$\vectorn{\emph{V}}=\{ \vectorn{\emph{v}}_1, \vectorn{\emph{v}}_2,...,\vectorn{\emph{v}}_h \}$。多头注意力机制就是用每一个切分得到的$\vectorn{\emph{q}}$，$\vectorn{\emph{k}}$，$\vectorn{\emph{v}}$独立的进行注意力计算。即第$i$个头的注意力计算结果$\vectorn{\emph{head}}_i = \textrm{Attention}(\vectorn{\emph{q}}_i,\vectorn{\emph{k}}_i, \vectorn{\emph{v}}_i)$。
+\parinterval 下面根据图\ref{fig:12-48}详细介绍多头注意力的计算过程：
+\begin{itemize}
+\vspace{0.5em}
+\item	首先将$\vectorn{\emph{Q}}$、$\vectorn{\emph{K}}$、$\vectorn{\emph{V}}$分别通过线性变换的方式映射为$h$个子集（机器翻译任务中，$h$一般为8）。即$\vectorn{\emph{q}}_i = \vectorn{\emph{Q}}\vectorn{\emph{W}}_i^Q $、$\vectorn{\emph{k}}_i = \vectorn{\emph{K}}\vectorn{\emph{W}}_i^K $、$\vectorn{\emph{v}}_i = \vectorn{\emph{V}}\vectorn{\emph{W}}_i^V $，其中$i$表示第$i$个头， $\vectorn{\emph{W}}_i^Q  \in \mathbb{R}^{d_{model} \times d_k}$,  $\vectorn{\emph{W}}_i^K  \in \mathbb{R}^{d_{model} \times d_k}$,  $\vectorn{\emph{W}}_i^V  \in \mathbb{R}^{d_{model} \times d_v}$是参数矩阵; $d_k=d_v=d_{model} / h$，对于不同的头采用不同的变换矩阵，这里$d_{model}$是Transformer的一个参数，表示每个隐层向量的维度；
+\vspace{0.5em}
+\item 其次对每个头分别执行点乘注意力操作，并得到每个头的注意力操作的输出$\vectorn{\emph{head}}_i$；
+\vspace{0.5em}
+\item	最后将$h$个头的注意力输出在最后一维$d_v$进行拼接（Concat）重新得到维度为$h \times d_v$的输出，并通过对其左乘一个权重矩阵$\vectorn{\emph{W}}^o$进行线性变换，从而对多头计算得到的信息进行融合，且将多头注意力输出的维度映射为模型的隐层大小（即$d_{model}$），这里参数矩阵$\vectorn{\emph{W}}^o \in \mathbb{R}^{h \times d_v \times d_{model}}$。
+\vspace{0.5em}
+\end{itemize}
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter12/Figures/figure-multi-head-attention-model}
+\caption{多头注意力模型}
+\label{fig:12-48}
+\end{figure}
+%----------------------------------------------
+\parinterval 多头机制具体的计算公式如下：
+\begin{eqnarray}
+\textrm{MultiHead}(\vectorn{\emph{Q}}, \vectorn{\emph{K}} , \vectorn{\emph{V}})& = & \textrm{Concat} (\vectorn{\emph{head}}_1, ... , \vectorn{\emph{head}}_h ) \vectorn{\emph{W}}^o \label{eq:12-48} \\
+\vectorn{\emph{head}}_i & = &\textrm{Attention} (\vectorn{\emph{Q}}\vectorn{\emph{W}}_i^Q , \vectorn{\emph{K}}\vectorn{\emph{W}}_i^K  , \vectorn{\emph{V}}\vectorn{\emph{W}}_i^V )
+\label{eq:12-49}
+\end{eqnarray}
+\parinterval 多头机制的好处是允许模型在不同的表示子空间里学习。在很多实验中发现，不同表示空间的头捕获的信息是不同的，比如，在使用Transformer处理自然语言时，有的头可以捕捉句法信息，有头可以捕捉词法信息。
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsection{掩码操作}
+\parinterval 在公式\ref{eq:12-47}中提到了Mask（掩码），它的目的是对向量中某些值进行掩盖，避免无关位置的数值对运算造成影响。Transformer中的Mask主要应用在注意力机制中的相关性系数计算，具体方式是在相关性系数矩阵上累加一个Mask矩阵。该矩阵在需要Mask的位置的值为负无穷-inf（具体实现时是一个非常小的数，比如-1e-9），其余位置为0，这样在进行了Softmax归一化操作之后，被掩码掉的位置计算得到的权重便近似为0，也就是说对无用信息分配的权重为0，从而避免了其对结果产生影响。Transformer包含两种Mask：
+\begin{itemize}
+\vspace{0.5em}
+\item Padding Mask。在批量处理多个样本时（训练或解码），由于要对源语言和目标语言的输入进行批次化处理，而每个批次内序列的长度不一样，为了方便对批次内序列进行矩阵表示，需要进行对齐操作，即在较短的序列后面填充0来占位（padding操作）。而这些填充的位置没有意义，不参与注意力机制的计算，因此，需要进行Mask操作，屏蔽其影响。
+\vspace{0.5em}
+\item Future Mask。对于解码器来说，由于在预测的时候是自左向右进行的，即第$t$时刻解码器的输出只能依赖于$t$时刻之前的输出。且为了保证训练解码一致，避免在训练过程中观测到目标语端每个位置未来的信息，因此需要对未来信息进行屏蔽。具体的做法是：构造一个上三角值全为-inf的Mask矩阵，也就是说，在解码端计算中，在当前位置，通过Future Mask把序列之后的信息屏蔽掉了，避免了$t$时刻之后的位置对当前的计算产生影响。图\ref{fig:12-47}给出了一个具体的实例。
+%----------------------------------------------
+% 图3.10
+\begin{figure}[htp]
+\centering
+\input{./Chapter12/Figures/figure-mask-instance-for-future-positions-in-transformer}
+\caption{Transformer中对于未来位置进行的屏蔽的Mask实例}
+\label{fig:12-47}
+\end{figure}
+%----------------------------------------------
+\vspace{0.5em}
+\end{itemize}
+%----------------------------------------------------------------------------------------
+%    NEW SECTION
+%----------------------------------------------------------------------------------------
+\section{残差网络和层正则化}
+\parinterval Transformer编码器、解码器分别由多层网络组成（通常为6层），每层网络又包含多个子层（自注意力网络、前馈神经网络）。因此Transformer实际上是一个很深的网络结构。再加上前面介绍的点乘注意力机制，包含很多线性和非线性变换；另外，注意力函数Attention($\cdot$)的计算也涉及多层网络，整个网络的信息传递非常复杂。从反向传播的角度来看，每次回传的梯度都会经过若干步骤，容易产生梯度爆炸或者消失。
+\parinterval 解决这个问题的一种办法就是使用{\small\sffamily\bfseries{残差连接}}\index{残差连接}\upcite{DBLP:journals/corr/HeZRS15}。残差连接是一种用来训练深层网络的技术，其结构如图\ref{fig:12-49}，即在子层之前通过增加直接连接的方式，将底层信息直接传递给上层。
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter12/Figures/figure-residual-network-structure}
+\caption{残差网络结构}
+\label{fig:12-49}
+\end{figure}
+%----------------------------------------------
+\parinterval 残差连接从广义上讲也叫{\small\bfnew{短连接}}\index{短连接}（Short-cut Connection）\index{Short-cut Connection}，指的是这种短距离的连接。它的思想很简单，就是把层和层之间的距离拉近。如图\ref{fig:12-49}所示，子层1通过残差连接跳过了子层2，直接和子层3进行信息传递。使信息传递变得更高效，有效解决了深层网络训练过程中容易出现的梯度消失/爆炸问题，使得深层网络的训练更加容易。其计算公式为：
+\begin{eqnarray}
+x_{l+1} = x_l + \mathcal{F} (x_l)
+\label{eq:12-50}
+\end{eqnarray}
+\noindent 其中$\mathcal{F} (x_l)$是子层运算。如果$l=2$，那么公式\ref{eq:12-50}可以解释为，第3层的输出等于第2层的输出加上第二层的输入。图\ref{fig:12-50}中的红色方框展示了Transformer中残差连接的位置。
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter12/Figures/figure-position-of-difference-and-layer-regularization-in-the-model}
+\caption{残差和层正则化在模型中的位置}
+\label{fig:12-50}
+\end{figure}
+%----------------------------------------------
+\parinterval 在Transformer的训练过程中，由于引入了残差操作，将前面所有层的输出加到一起。这样会导致不同层（或子层）的结果之间的差异性很大，造成训练过程不稳定、训练时间较长。为了避免这种情况，在每层中加入了层正则化操作\upcite{Ba2016LayerN}。层正则化的计算公式如下：
+\begin{eqnarray}
+\textrm{LN}(x) = g \cdot \frac{x- \mu} {\sigma} + b
+\label{eq:12-51}
+\end{eqnarray}
+\noindent 该公式使用均值$\mu$和方差$\sigma$对样本进行平移缩放，将数据规范化为均值为0，方差为1的标准分布。$g$和$b$是可学习的参数。
+\parinterval 在Transformer中经常使用的层正则化操作有两种结构，分别是{\small\bfnew{后正则化}}\index{后正则化}（Post-norm）\index{Post-norm}和{\small\bfnew{前正则化}}\index{前正则化}（Pre-norm）\index{Pre-norm}，结构如图\ref{fig:12-51}所示。后正则化中先进行残差连接再进行层正则化，而前正则化则是在子层输入之前进行层正则化操作。在很多实践中已经发现，前正则化的方式更有利于信息传递，因此适合训练深层的Transformer模型\upcite{WangLearning}。
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter12/Figures/figure-different-regularization-methods}
+\caption{不同正则化方式 }
+\label{fig:12-51}
+\end{figure}
+%----------------------------------------------
+%----------------------------------------------------------------------------------------
+%    NEW SECTION
+%----------------------------------------------------------------------------------------
+\section{前馈全连接网络子层}
+\parinterval 在Transformer的结构中，每一个编码层或者解码层中都包含一个前馈神经网络，它在模型中的位置如图\ref{fig:12-52}中红色方框所示。
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter12/Figures/figure-position-of-feedforward-neural-network-in-the-model}
+\caption{前馈神经网络在模型中的位置}
+\label{fig:12-52}
+\end{figure}
+%----------------------------------------------
+\parinterval Transformer使用了全连接网络。全连接网络的作用主要体现在将经过注意力操作之后的表示映射到新的空间中，新的空间会有利于接下来的非线性变换等操作。实验证明，去掉全连接网络会对模型的性能造成影响。Transformer的全连接前馈神经网络包含两次线性变换和一次非线性变换（ReLU激活函数:ReLU$(x)=\textrm{max}⁡(0,x)$），每层的前馈神经网络参数不共享，计算公式如下：
+\begin{eqnarray}
+\textrm{FFN}(x) = \textrm{max} (0,\vectorn{\emph{x}}\vectorn{\emph{W}}_1 + \vectorn{\emph{b}}_1)\vectorn{\emph{W}}_2 + \vectorn{\emph{b}}_2
+\label{eq:12-52}
+\end{eqnarray}
+\noindent 其中，$\vectorn{\emph{W}}_1$、$\vectorn{\emph{W}}_2$、$\vectorn{\emph{b}}_1$和$\vectorn{\emph{b}}_2$为模型的参数。通常情况下，前馈神经网络的隐层维度要比注意力部分的隐层维度大，而且研究人员发现这种设置对Transformer是至关重要的。 比如，注意力部分的隐层维度为512，前馈神经网络部分的隐层维度为2048。当然，继续增大前馈神经网络的隐层大小，比如设为4096，甚至8192，还可以带来性能的增益，但是前馈部分的存储消耗较大，需要更大规模GPU 设备的支持。因此在具体实现时，往往需要在翻译准确性和存储/速度之间找到一个平衡。
+%----------------------------------------------------------------------------------------
+%    NEW SECTION
+%----------------------------------------------------------------------------------------
+\section{训练}
+\parinterval 与前面介绍的神经机器翻译模型的训练一样，Transformer的训练流程为：首先对模型进行初始化，然后在编码器输入包含结束符的源语言单词序列。前面已经介绍过，解码端每个位置单词的预测都要依赖已经生成的序列。在解码端输入包含起始符号的目标语序列，通过起始符号预测目标语的第一个单词，用真实的目标语的第一个单词去预测第二个单词，以此类推，然后用真实的目标语序列和预测的结果比较，计算它的损失。Transformer使用了{\small\bfnew{交叉熵损失}}\index{交叉熵损失}（Cross Entropy Loss）\index{Cross Entropy Loss}函数，损失越小说明模型的预测越接近真实输出。然后利用反向传播来调整模型中的参数。由于Transformer 将任意时刻输入信息之间的距离拉近为1，摒弃了RNN中每一个时刻的计算都要基于前一时刻的计算这种具有时序性的训练方式，因此Transformer中训练的不同位置可以并行化训练，大大提高了训练效率。
+%----------------------------------------------
+%\begin{figure}[htp]
+%\centering
+%\input{./Chapter12/Figures/figure-structure-of-the-network-during-transformer-training}
+%\caption{Transformer训练时网络的结构}
+%\label{fig:12-53}
+%\end{figure}
+%----------------------------------------------
+\parinterval 需要注意的是，Transformer也包含很多工程方面的技巧。首先，在训练优化器方面，需要注意以下几点：
+\begin{itemize}
+\vspace{0.5em}
+\item	Transformer使用Adam优化器优化参数，并设置$\beta_1=0.9$，$\beta_2=0.98$，$\epsilon=10^{-9}$。
+\item Transformer在学习率中同样应用了学习率{\small\bfnew{预热}}\index{预热}（Warmup）\index{Warmup}策略，其计算公式如下：
+\begin{eqnarray}
+lrate = d_{model}^{-0.5} \cdot \textrm{min} (step^{-0.5} , step \cdot warmup\_steps^{-1.5})
+\label{eq:12-53}
+\end{eqnarray}
+\vspace{0.5em}
+其中，$step$表示更新的次数（或步数）。通常设置网络更新的前4000步为预热阶段即$warmup\_steps=4000$。Transformer的学习率曲线如图\ref{fig:12-54}所示。在训练初期，学习率从一个较小的初始值逐渐增大（线性增长），当到达一定的步数，学习率再逐渐减小。这样做可以减缓在训练初期的不稳定现象，同时在模型达到相对稳定之后，通过逐渐减小的学习率让模型进行更细致的调整。这种学习率的调整方法是Transformer的一个很大的工程贡献。
+\vspace{0.5em}
+\end{itemize}
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter12/Figures/figure-lrate-of-transformer}
+\caption{Transformer模型的学习率曲线}
+\label{fig:12-54}
+\end{figure}
+%----------------------------------------------
+\parinterval 另外，Transformer为了提高模型训练的效率和性能，还进行了以下几方面的操作：
+\begin{itemize}
+\vspace{0.5em}
+\item {\small\bfnew{小批量训练}}\index{小批量训练}（Mini-batch Training）\index{Mini-batch Training}：每次使用一定数量的样本进行训练，即每次从样本中选择一小部分数据进行训练。这种方法的收敛较快，同时易于提高设备的利用率。批次大小通常设置为2048/4096（token数即每个批次中的单词个数）。每一个批次中的句子并不是随机选择的，模型通常会根据句子长度进行排序，选取长度相近的句子组成一个批次。这样做可以减少padding数量，提高训练效率，如图\ref{fig:12-55}。
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter12/Figures/figure-comparison-of-the-number-of-padding-in-batch}
+\caption{batch中padding数量对比（白色部分为padding）}
+\label{fig:12-55}
+\end{figure}
+%----------------------------------------------
+\vspace{0.5em}
+\item {\small\bfnew{Dropout}}\index{Dropout}：由于Transformer模型网络结构较为复杂，会导致过度拟合训练数据，从而对未见数据的预测结果变差。这种现象也被称作{\small\sffamily\bfseries{过拟合}}\index{过拟合}（Over Fitting）\index{Over fitting}。为了避免这种现象，Transformer加入了Dropout操作\upcite{JMLR:v15:srivastava14a}。Transformer中这四个地方用到了Dropout：词嵌入和位置编码、残差连接、注意力操作和前馈神经网络。Dropout比例通常设置为$0.1$。
+\vspace{0.5em}
+\item {\small\bfnew{标签平滑}}\index{标签平滑}（Label Smoothing）\index{Label Smoothing}：在计算损失的过程中，需要用预测概率去拟合真实概率。在分类任务中，往往使用One-hot向量代表真实概率，即真实答案位置那一维对应的概率为1，其余维为0，而拟合这种概率分布会造成两个问题：1)无法保证模型的泛化能力，容易造成过拟合；2) 1和0概率鼓励所属类别和其他类别之间的差距尽可能加大，会造成模型过于相信预测的类别。因此Transformer里引入标签平滑\upcite{Szegedy_2016_CVPR}来缓解这种现象，简单的说就是给正确答案以外的类别分配一定的概率，而不是采用非0即1的概率。这样，可以学习一个比较平滑的概率分布，从而提升泛化能力。
+\vspace{0.5em}
+\end{itemize}
+\parinterval 不同的Transformer可以适应不同的任务，常见的Transformer模型有Transformer Base、Transformer Big和Transformer Deep\upcite{NIPS2017_7181,WangLearning}，具体设置如下：
+\begin{itemize}
+\vspace{0.5em}
+\item  Transformer Base：标准的Transformer结构，解码器编码器均包含6层，隐层维度为512，前馈神经网络维度为2048，多头注意力机制为8头，Dropout设为0.1。
+\vspace{0.5em}
+\item  Transformer Big：为了提升网络的容量，使用更宽的网络。在Base的基础上增大隐层维度至1024，前馈神经网络的维度变为4096，多头注意力机制为16头，Dropout设为0.3。
+\vspace{0.5em}
+\item Transformer Deep：加深编码器网络层数可以进一步提升网络的性能，它的参数设置与Transformer Base基本一致，但是层数增加到48层，同时使用Pre-Norm作为层正则化的结构。
+\vspace{0.5em}
+\end{itemize}
+\parinterval 在WMT'16数据 上的实验对比如表\ref{tab:12-13}所示。可以看出，Transformer Base的BLE\\U得分虽不如另外两种模型，但其参数量是最少的。而Transformer Deep的性能整体好于Transformer Big。
+%----------------------------------------------
+\begin{table}[htp]
+\centering
+\caption{三种Transformer模型的对比}
+\label{tab:12-13}
+\begin{tabular}{l | l l l}
+\multirow{2}{*}{系统}   & \multicolumn{2}{c}{BLEU[\%]} & \# of \\
+                      & EN-DE  & EN-FR  &    params                              \\ \hline
+Transformer Base      & 27.3            & 38.1            & 65$\times 10^{6}$                \\
+Transformer Big       & 28.4            & 41.8            & 213$\times 10^{6}$               \\
+Transformer Deep(48层) & 30.2            & 43.1            & 194$\times 10^{6}$              \\
+\end{tabular}
+\end{table}
+%----------------------------------------------
+%----------------------------------------------------------------------------------------
+%    NEW SECTION
+%----------------------------------------------------------------------------------------
+\section{推断}
+\parinterval Transformer解码器生成目标语的过程和前面介绍的循环网络翻译模型类似，都是从左往右生成，且下一个单词的预测依赖已经生成的上一个单词。其具体推断过程如图\ref{fig:12-56}所示，其中$\vectorn{\emph{C}}_i$是编码-解码注意力的结果，解码器首先根据“<eos>”和$\vectorn{\emph{C}}_1$生成第一个单词“how”，然后根据“how”和$\vectorn{\emph{C}}_2$生成第二个单词“are”，以此类推，当解码器生成“<eos>”时结束推断。
+\parinterval 但是，Transformer在推断阶段无法对所有位置进行并行化操作，因为对于每一个目标语单词都需要对前面所有单词进行注意力操作，因此它推断速度非常慢。可以采用的加速手段有：低精度\upcite{DBLP:journals/corr/CourbariauxB16}、Cache（缓存需要重复计算的变量）\upcite{DBLP:journals/corr/abs-1805-00631}、共享注意力网络等\upcite{Xiao2019SharingAW}。关于模型的推断技术将会在{\chapterfourteen}进一步深入介绍。
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter12/Figures/figure-decode-of-transformer}
+\caption{Transformer推断过程示例}
+\label{fig:12-56}
+\end{figure}
+%----------------------------------------------
+%----------------------------------------------------------------------------------------
+%    NEW SECTION  12.3
+%----------------------------------------------------------------------------------------
+\section{小结及深入阅读}
-\section{}
+\parinterval
--- a/Chapter2/Figures/figure-self-information-function.tex
+++ b/Chapter2/Figures/figure-self-information-function.tex
@@ -14,7 +14,7 @@
  domain=0.01:1,
  enlarge x limits=true,
  enlarge y limits={upper},
-  legend style={draw=none},
+  legend style={draw=none,thick},
  xmin=0,
  xmax=1,
  ymin=0,

--- a/Chapter2/chapter2.tex
+++ b/Chapter2/chapter2.tex
@@ -51,7 +51,7 @@
 \parinterval {\small\bfnew{概率}}\index{概率}（Probability）\index{Probability}是度量随机事件呈现其每个可能状态的可能性的数值，本质上它是一个测度函数\upcite{mao-prob-book-2011,kolmogorov2018foundations}。概率的大小表征了随机事件在一次试验中发生的可能性大小。用$\funp{P}(\cdot )$表示一个随机事件的可能性，即事件发生的概率。比如$\funp{P}(\textrm{太阳从东方升起})$表示“太阳从东方升起”的可能性，同理，$\funp{P}(A=B)$ 表示的就是“$A=B$”这件事的可能性。
-\parinterval 在实际问题中，往往需要得到随机变量的概率值。但是，真实的概率值可能是无法准确知道的，这时就需要对概率进行{\small\sffamily\bfseries{估计}}\index{估计}，得到的结果是概率的{\small\sffamily\bfseries{估计值}}\index{估计值}（Estimate）\index{Estimate}。概率值的估计是概率论和统计学中的经典问题，有十分多样的方法可以选择。比如，一个很简单的方法是利用相对频次作为概率的估计值。如果$\{x_1,x_2,\dots,x_n \}$ 是一个试验的样本空间，在相同情况下重复试验$N$次，观察到样本$x_i (1\leq{i}\leq{n})$的次数为$n (x_i )$，那么$x_i$在这$N$次试验中的相对频率是$\frac{n(x_i )}{N}$。 当$N$越来越大时，相对概率也就越来越接近真实概率$\funp{P}(x_i)$，即$\lim_{N \to \infty}\frac{n(x_i )}{N}=\funp{P}(x_i)$。 实际上，很多概率模型都等同于相对频次估计，比如，对于一个服从多项式分布的变量的极大似然估计就可以用相对频次估计实现。
+\parinterval 在实际问题中，往往需要得到随机变量的概率值。但是，真实的概率值可能是无法准确知道的，这时就需要对概率进行{\small\sffamily\bfseries{估计}}\index{估计}，得到的结果是概率的{\small\sffamily\bfseries{估计值}}\index{估计值}（Estimate）\index{Estimate}。概率值的估计是概率论和统计学中的经典问题，有十分多样的方法可以选择。比如，一个很简单的方法是利用相对频次作为概率的估计值。如果$\{x_1,x_2,\dots,x_n \}$ 是一个试验的样本空间，在相同情况下重复试验$N$次，观察到样本$x_i (1\leq{i}\leq{n})$的次数为$n (x_i )$，那么$x_i$在这$N$次试验中的相对频率是$\frac{n(x_i )}{N}$。 当$N$越来越大时，相对概率也就越来越接近真实概率$\funp{P}(x_i)$，即$\lim_{N \to \infty}\frac{n(x_i )}{N}=\funp{P}(x_i)$。 实际上，很多概率模型都等同于相对频次估计。比如，对于一个服从多项式分布的变量，它的极大似然估计就可以用相对频次估计实现。
 \parinterval 概率函数是用函数形式给出离散变量每个取值发生的概率，其实就是将变量的概率分布转化为数学表达形式。如果把$A$看做一个离散变量，$a$看做变量$A$的一个取值，那么$\funp{P}(A)$被称作变量$A$的概率函数，$\funp{P}(A=a)$被称作$A = a$的概率值，简记为$\funp{P}(a)$。例如，在相同条件下掷一个骰子50次，用$A$表示投骰子出现的点数这个离散变量，$a_i$表示点数的取值，$\funp{P}_i$表示$A=a_i$的概率值。表\ref{tab:2-1}为$A$的概率分布，给出了$A$的所有取值及其概率。
@@ -68,7 +68,7 @@
 \end{table}
 %--------------------------------------------------------------------
-\parinterval 除此之外，概率函数$\funp{P}(\cdot)$还具有非负性、归一性等特点。非负性是指，所有的概率函数$\funp{P}(\cdot)$都必须是大于等于0的数值，概率函数中不可能出现负数，即$\forall{x},\funp{P}{(x)}\geq{0}$。归一性，又称规范性，简单的说就是所有可能发生的事件的概率总和为1，即$\sum_{x}\funp{P}{(x)}={1}$。
+\parinterval 除此之外，概率函数$\funp{P}(\cdot)$还具有非负性、归一性等特点。非负性是指，所有的概率函数$\funp{P}(\cdot)$的数值都必须大于等于0，概率函数中不可能出现负数，即$\forall{x},\funp{P}{(x)}\geq{0}$。归一性，又称规范性，简单来说就是所有可能发生的事件的概率总和为1，即$\sum_{x}\funp{P}{(x)}={1}$。
 \parinterval 对于离散变量$A$，$\funp{P}(A=a)$是个确定的值，可以表示事件$A=a$的可能性大小；而对于连续变量，求在某个定点处的概率是无意义的，只能求其落在某个取值区间内的概率。因此，用{\small\sffamily\bfseries{概率分布函数}}\index{概率分布函数}$F(x)$和{\small\sffamily\bfseries{概率密度函数}}\index{概率密度函数}$f(x)$来统一描述随机变量取值的分布情况（如图\ref{fig:2-1}）。概率分布函数$F(x)$表示取值小于等于某个值的概率，是概率的累加（或积分）形式。假设$A$是一个随机变量，$a$是任意实数，将函数$F(a)=\funp{P}\{A\leq a\}$定义为$A$的分布函数。通过分布函数，可以清晰地表示任何随机变量的概率分布情况。
@@ -148,17 +148,16 @@ F(x)=\int_{-\infty}^x f(x)\textrm{d}x
 \label{eq:2-5}
 \end{eqnarray}
-\parinterval 推广到$n$个事件，可以得到了{\small\bfnew{链式法则}}\index{链式法则}（Chain Rule\index{Chain Rule}）的公式：
+\parinterval 推广到$n$个事件，可以得到{\small\bfnew{链式法则}}\index{链式法则}（Chain Rule\index{Chain Rule}）的公式：
 \begin{eqnarray}
-\funp{P}(x_1,x_2, \ldots ,x_n)=\funp{P}(x_1) \prod_{i=2}^n \funp{P}(x_i \mid x_1,x_2, \ldots ,x_{i-1})
+\funp{P}(x_1,x_2, \ldots ,x_n)=\funp{P}(x_1) \prod_{i=2}^n \funp{P}(x_i \mid x_1, \ldots ,x_{i-1})
 \label{eq:2-6}
 \end{eqnarray}
-\parinterval 链式法则经常被用于对事件序列的建模。比如，事件A依赖于事件B，事件B依赖于事件C，应用链式法有：
+\parinterval 链式法则经常被用于对事件序列的建模。比如，在事件$A$与事件$C$相互独立时，事件$A$、$B$、$C$的联合概率可以被表示为：
 \begin{eqnarray}
-\funp{P}(A,B,C) & = & \funp{P}(A \mid B,C)\funp{P}(B \mid C)\funp{P}(C) \nonumber \\
+\funp{P}(A,B,C) & = & \funp{P}(A)\funp{P}(B \mid A)\funp{P}(C \mid A,B) \nonumber \\
-                & = & \funp{P}(A \mid B)\funp{P}(B \mid C)\funp{P}(C)
+                & = & \funp{P}(A)\funp{P}(B \mid A)\funp{P}(C \mid B)
 \label{eq:chain-rule-example}
 \end{eqnarray}
@@ -222,7 +221,7 @@ F(x)=\int_{-\infty}^x f(x)\textrm{d}x
 %    NEW SUBSUB-SECTION
 %----------------------------------------------------------------------------------------
-\subsubsection{1.信息熵}
+\subsubsection{1. 信息熵}
 \parinterval {\small\sffamily\bfseries{熵}}\index{熵}（Entropy）\index{Entropy}是热力学中的一个概念，同时也是对系统无序性的一种度量标准。在自然语言处理领域也会使用到信息熵这一概念，比如描述文字的信息量大小。一条信息的信息量可以被看作是这条信息的不确定性。如果需要确认一件非常不确定甚至于一无所知的事情，那么需要理解大量的相关信息才能进行确认；同样的，如果对某件事已经非常确定，那么就不需要太多的信息就可以把它搞清楚。如下就是两个例子，
@@ -259,15 +258,15 @@ F(x)=\int_{-\infty}^x f(x)\textrm{d}x
 \label{eq:2-14}
 \end{eqnarray}
-\parinterval 一个分布的信息熵也就是从该分布中得到的一个事件的期望信息量。比如，$a$、$b$、$c$、$d$四支球队，四支队伍夺冠的概率分别是$\funp{P}_1$、$\funp{P}_2$、$\funp{P}_3$、$\funp{P}_4$，某个人对比赛不感兴趣但是又想知道哪只球队夺冠，通过使用二分法2次就确定哪支球队夺冠了。但假设这四只球队中$c$的实力可以碾压其他球队，那么猜1次就可以确定。所以对于前面这种情况，哪只球队夺冠的信息量较高，信息熵也相对较高；对于后面这种情况，因为结果是容易猜到的，信息量和信息熵也就相对较低。因此可以得知：分布越尖锐熵越低；分布越均匀熵越高。
+\parinterval 一个分布的信息熵也就是从该分布中得到的一个事件的期望信息量。比如，$a$、$b$、$c$、$d$四支球队，四支队伍夺冠的概率分别是$\funp{P}_1$、$\funp{P}_2$、$\funp{P}_3$、$\funp{P}_4$，某个人对比赛不感兴趣但是又想知道哪只球队夺冠，通过使用二分法2次就确定哪支球队夺冠了。但假设这四只球队中$c$的实力可以碾压其他球队，那么猜1次就可以确定。所以对于前面这种情况，哪只球队夺冠的信息量较高，信息熵也相对较高；对于后面这种情况，因为结果是容易猜到的，信息量和信息熵也就相对较低。因此可以得知：分布越尖锐熵越低，分布越均匀熵越高。
 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
 %----------------------------------------------------------------------------------------
-\subsubsection{2.KL距离}
+\subsubsection{2. KL距离}
-\parinterval 如果同一个随机变量$X$上有两个概率分布$\funp{P}(x)$和$\funp{Q}(x)$，那么可以使用{\small\bfnew{Kullback-Leibler距离}}\index{Kullback-Leibler距离}或{\small\bfnew{KL距离}}\index{KL距离}（KL Distance\index{KL Distance}）来衡量这两个分布的不同（也称作KL 散度），这种度量就是{\small\bfnew{相对熵}}\index{相对熵}（Relative Entropy）\index{Relative Entropy}。其公式如下：
+\parinterval 如果同一个随机变量$X$上有两个概率分布$\funp{P}(x)$和$\funp{Q}(x)$，那么可以使用{\small\bfnew{Kullback-Leibler距离}}\index{Kullback-Leibler距离}或{\small\bfnew{KL距离}}\index{KL距离}（KL Distance\index{KL Distance}）来衡量这两个分布的不同（也称作KL 散度）。这种度量就是{\small\bfnew{相对熵}}\index{相对熵}（Relative Entropy）\index{Relative Entropy}，其公式如下：
 \begin{eqnarray}
 \funp{D}_{\textrm{KL}}(\funp{P}\parallel \funp{Q}) & = & \sum_{x \in X} [ \funp{P}(x)\log \frac{\funp{P}(x) }{ \funp{Q}(x) } ]  \nonumber \\
                                                                                       & = & \sum_{x \in X }[ \funp{P}(x)(\log \funp{P}(x)-\log \funp{Q}(x))]
@@ -288,7 +287,7 @@ F(x)=\int_{-\infty}^x f(x)\textrm{d}x
 %    NEW SUBSUB-SECTION
 %----------------------------------------------------------------------------------------
-\subsubsection{3.交叉熵}
+\subsubsection{3. 交叉熵}
 \parinterval {\small\bfnew{交叉熵}}\index{交叉熵}（Cross-entropy）\index{Cross-entropy}是一个与KL距离密切相关的概念，它的公式是：
 \begin{eqnarray}
@@ -305,7 +304,7 @@ F(x)=\int_{-\infty}^x f(x)\textrm{d}x
 \sectionnewpage
 \section{掷骰子游戏}
-\parinterval 在阐述统计建模方法前，先看一个有趣的实例（图\ref{fig:2-5}）。掷骰子，一个生活中比较常见的游戏，掷一个骰子，玩家猜一个数字，猜中就算赢，按照常识来说，随便选一个数字，获胜的概率是一样的，即所有选择的获胜概率都是$1/6$。因此这个游戏玩家很难获胜，除非运气很好。假设进行一次游戏，玩家随意选了一个数字，比如是1。当投掷30次骰子（如图\ref{fig:2-5}），发现运气不错，命中7次，好于预期（$7/30 > 1/6$）。
+\parinterval 在阐述统计建模方法前，先看一个有趣的实例（图\ref{fig:2-5}）。掷骰子，一个生活中比较常见的游戏，掷一个骰子，玩家猜一个数字，猜中就算赢。按照常识来说，随便选一个数字，获胜的概率是一样的，即所有选择的获胜概率都是$1/6$。因此这个游戏玩家很难获胜，除非运气很好。假设进行一次游戏，玩家随意选了一个数字，比如是1。当投掷30次骰子（如图\ref{fig:2-5}），发现运气不错，命中7次，好于预期（$7/30 > 1/6$）。
 \vspace{-0.5em}
 %----------------------------------------------
@@ -318,7 +317,7 @@ F(x)=\int_{-\infty}^x f(x)\textrm{d}x
 \end{figure}
 %-------------------------------------------
-\parinterval 此时玩家的胜利似乎只能来源于运气。不过，这里的假设“随便选一个数字，获胜的概率是一样的”本身就是一个概率模型，它对骰子的六个面的出现做了均匀分布假设：
+\parinterval 此时玩家的胜利似乎只能来源于运气。不过，这里的假设“随便选一个数字，获胜的概率是一样的”本身就是一个概率模型，它对骰子六个面的出现做了均匀分布假设：
 \begin{eqnarray}
 \funp{P}(\text{1})=\funp{P}(\text{2})= \ldots =\funp{P}(\text{5})=\funp{P}(\text{6})=1/6
 \label{eq:2-17}
@@ -336,13 +335,13 @@ F(x)=\int_{-\infty}^x f(x)\textrm{d}x
 \label{eq:2-18}
 \end{eqnarray}
-\noindent 这里，$\theta_1 \sim \theta_5$可以被看作是模型的参数，因此这个模型的自由度是5。对于这样的模型，参数确定了，模型也就确定了。但是一个新的问题出现了，在定义骰子每个面的概率后，如何求出具体的概率值呢？一种常用的方法是，从大量实例中学习模型参数，这个方法也是常说的{\small\bfnew{参数估计}}\index{参数估计}（Parameter Estimation）\index{Parameter Estimation}。可以将这个不均匀的骰子先实验性地掷很多次，这可以被看作是独立同分布的若干次采样。比如投掷$X$ 次，发现1出现$X_1$ 次，2出现$X_2$ 次，以此类推，可以得到各个面出现的次数。假设掷骰子中每个面出现的概率符合多项式分布，那么通过简单的概率论知识可以知道每个面出现概率的极大似然估计为：
+\noindent 这里，$\theta_1 \sim \theta_5$可以被看作是模型的参数，因此这个模型的自由度是5。对于这样的模型，参数确定了，模型也就确定了。但是一个新的问题出现了，在定义骰子每个面的概率后，如何求出具体的概率值呢？一种常用的方法是，从大量实例中学习模型参数，这个方法也是常说的{\small\bfnew{参数估计}}\index{参数估计}（Parameter Estimation）\index{Parameter Estimation}。可以将这个不均匀的骰子先实验性地掷很多次，这可以被看作是独立同分布的若干次采样。比如投掷骰子$X$次，发现1出现$X_1$ 次，2出现$X_2$ 次，以此类推，可以得到各个面出现的次数。假设掷骰子中每个面出现的概率符合多项式分布，那么通过简单的概率论知识可以知道每个面出现概率的极大似然估计为：
 \begin{eqnarray}
 \funp{P}(i)=\frac {X_i}{X}
 \label{eq:2-19}
 \end{eqnarray}
-\parinterval 当$X$足够大的时，$\frac{X_i}{X}$可以无限逼近$\funp{P}(i)$的真实值，因此可以通过大量的实验推算出掷骰子各个面的概率的准确估计值。
+\parinterval 当$X$足够大时，$\frac{X_i}{X}$可以无限逼近$\funp{P}(i)$的真实值，因此可以通过大量的实验推算出掷骰子各个面的概率的准确估计值。
 \parinterval 回归到原始的问题，如果在正式开始游戏前，预先掷骰子30次，得到如图\ref{fig:2-6}的结果。
@@ -430,13 +429,13 @@ F(x)=\int_{-\infty}^x f(x)\textrm{d}x
 \label{eq:2-20}
 \end{eqnarray}
-\noindent 其中，$V$为词汇表。本质上，这个方法和计算单词出现概率$\funp{P}(w_i)$的方法是一样的。但是这里的问题是：当$m$较大时，词串$w_1 w_2 \ldots w_m$可能非常低频，甚至在数据中没有出现过。这时，由于$\textrm{count}(w_1 w_2 \ldots w_m) \approx 0$，公式\ref{eq:seq-mle}的结果会不准确，甚至产生0概率的情况。这是观测低频事件时经常出现的问题。对于这个问题，另一种概思路是对多个联合出现的事件进行独立性假设，这里可以假设$w_1$、$w_2\ldots w_m$的出现是相互独立的，于是：
+\noindent 其中，$V$为词汇表。本质上，这个方法和计算单词出现概率$\funp{P}(w_i)$的方法是一样的。但是这里的问题是：当$m$较大时，词串$w_1 w_2 \ldots w_m$可能非常低频，甚至在数据中没有出现过。这时，由于$\textrm{count}(w_1 w_2 \ldots w_m) \approx 0$，公式\eqref{eq:seq-mle}的结果会不准确，甚至产生0概率的情况。这是观测低频事件时经常出现的问题。对于这个问题，另一种思路是对多个联合出现的事件进行独立性假设，这里可以假设$w_1$、$w_2\ldots w_m$的出现是相互独立的，于是：
 \begin{eqnarray}
 \funp{P}(w_1 w_2 \ldots w_m) & = & \funp{P}(w_1) \funp{P}(w_2) \ldots \funp{P}(w_m) \label{eq:seq-independ}
 \label{eq:2-21}
 \end{eqnarray}
-\noindent 这样，单词序列的出现的概率被转化为每个单词概率的乘积。由于单词的概率估计是相对准确的，因此整个序列的概率会比较合理。但是，这种方法的独立性假设也破坏了句子中单词之间的依赖关系，造成概率估计结果的偏差。那如何更加合理的计算一个单词序列的概率呢？下面即将介绍的$n$-gram语言建模方法可以很好地回答这个问题。
+\noindent 这样，单词序列的出现的概率被转化为每个单词概率的乘积。由于单词的概率估计是相对准确的，因此整个序列的概率会比较合理。但是，这种独立性假设也破坏了句子中单词之间的依赖关系，造成概率估计结果的偏差。那如何更加合理的计算一个单词序列的概率呢？下面介绍的$n$-gram语言建模方法可以很好地回答这个问题。
 %----------------------------------------------------------------------------------------
 %    NEW SECTION
@@ -469,9 +468,9 @@ F(x)=\int_{-\infty}^x f(x)\textrm{d}x
 \label{eq:2-22}
 \end{eqnarray}
-这样，$w_1 w_2 \ldots w_m$的生成可以被看作是逐个生成每个单词的过程，即首先生成$w_1$，然后根据$w_1$再生成$w_2$，然后根据$w_1 w_2$再生成$w_3$，以此类推，直到根据所有前$m-1$个词生成序列的最后一个单词$w_m$。这个模型把联合概率$\funp{P}(w_1 w_2 \ldots w_m)$分解为多个条件概率的乘积，虽然对生成序列的过程进行了分解，但是模型的复杂度和以前是一样的，比如，$\funp{P}(w_m|w_1 w_2 \ldots w_{m-1})$ 仍然不好计算。
+\parinterval 这样，$w_1 w_2 \ldots w_m$的生成可以被看作是逐个生成每个单词的过程，即首先生成$w_1$，然后根据$w_1$再生成$w_2$，然后根据$w_1 w_2$再生成$w_3$，以此类推，直到根据所有前$m-1$个词生成序列的最后一个单词$w_m$。这个模型把联合概率$\funp{P}(w_1 w_2 \ldots w_m)$分解为多个条件概率的乘积，虽然对生成序列的过程进行了分解，但是模型的复杂度和以前是一样的，比如，$\funp{P}(w_m|w_1 w_2 \ldots w_{m-1})$ 仍然不好计算。
-\parinterval 换一个角度看，$\funp{P}(w_m|w_1 w_2 \ldots w_{m-1})$体现了一种基于“历史”的单词生成模型，也就是把前面生成的所有单词作为“历史”，并参考这个“历史”生成当前单词。但是这个“历史”的长度和整个序列长度是相关的，也是一种长度变化的历史序列。为了化简问题，一种简单的想法是使用定长历史，比如，每次只考虑前面$n-1$个历史单词来生成当前单词。这就是$n$-gram语言模型，其中$n$-gram 表示$n$个连续的单词构成的单元，也被称作{\small\bfnew{n元语法单元}}\index{n元语法单元}。这个模型的数学描述如下：
+\parinterval 换一个角度看，$\funp{P}(w_m|w_1 w_2 \ldots w_{m-1})$体现了一种基于“历史”的单词生成模型，也就是把前面生成的所有单词作为“历史”，并参考这个“历史”生成当前单词。但是这个“历史”的长度和整个序列长度是相关的，也是一种长度变化的历史序列。为了化简问题，一种简单的想法是使用定长历史，比如，每次只考虑前面$n-1$个历史单词来生成当前单词。这就是$n$-gram语言模型，其中$n$-gram 表示$n$个连续单词构成的单元，也被称作{\small\bfnew{n元语法单元}}\index{n元语法单元}。这个模型的数学描述如下：
 \begin{eqnarray}
 \funp{P}(w_m|w_1 w_2 \ldots w_{m-1}) = \funp{P}(w_m|w_{m-n+1} \ldots w_{m-1})
 \label{eq:2-23}
@@ -482,7 +481,7 @@ F(x)=\int_{-\infty}^x f(x)\textrm{d}x
 %------------------------------------------------------
 \begin{center}
 {\footnotesize
-\begin{tabular}{l|l|l l|l}
+\begin{tabular}{l|l|l |l|l}
 链式法则 & 1-gram & 2-gram & $ \ldots $ & $n$-gram\\
 \hline
 \rule{0pt}{10pt} $\funp{P}(w_1 w_2 \ldots w_m)$ = & $\funp{P}(w_1 w_2 \ldots w_m)$ = & $\funp{P}(w_1 w_2 \ldots w_m)$ = & $ \ldots $ & $\funp{P}(w_1 w_2 \ldots w_m)$ = \\
@@ -497,7 +496,7 @@ F(x)=\int_{-\infty}^x f(x)\textrm{d}x
 \end{center}
 %------------------------------------------------------
-\parinterval 可以看到，1-gram语言模型只是$n$-gram语言模型的一种特殊形式。基于独立性假设，1-gram假定当前单词出现与否与任何历史都无关，这种方法大大化简了求解句子概率的复杂度。比如，上一节中公式\ref{eq:seq-independ}就是一个1-gram语言模型。但是，句子中的单词并非完全相互独立的，这种独立性假设并不能完美地描述客观世界的问题。如果需要更精确地获取句子的概率，就需要使用更长的“历史”信息，比如，2-gram、3-gram、甚至更高阶的语言模型。
+\parinterval 可以看到，1-gram语言模型只是$n$-gram语言模型的一种特殊形式。基于独立性假设，1-gram假定当前单词出现与否与任何历史都无关，这种方法大大化简了求解句子概率的复杂度。比如，上一节中公式\eqref{eq:seq-independ}就是一个1-gram语言模型。但是，句子中的单词并非完全相互独立的，这种独立性假设并不能完美地描述客观世界的问题。如果需要更精确地获取句子的概率，就需要使用更长的“历史”信息，比如，2-gram、3-gram、甚至更高阶的语言模型。
 \parinterval $n$-gram的优点在于，它所使用的历史信息是有限的，即$n-1$个单词。这种性质也反映了经典的马尔可夫链的思想\upcite{liuke-markov-2004,resnick1992adventures}，有时也被称作马尔可夫假设或者马尔可夫属性。因此$n$-gram也可以被看作是变长序列上的一种马尔可夫模型，比如，2-gram语言模型对应着1阶马尔可夫模型，3-gram语言模型对应着2阶马尔可夫模型，以此类推。
@@ -512,7 +511,7 @@ F(x)=\int_{-\infty}^x f(x)\textrm{d}x
 \vspace{0.5em}
 \end{eqnarray}
-其中，$\textrm{count}(\cdot)$是在训练数据中统计频次的函数。
+\noindent 其中，$\textrm{count}(\cdot)$是在训练数据中统计频次的函数。
 \vspace{0.5em}
 \item {\small\bfnew{人工神经网络方法}}\index{人工神经网络方法}。构建一个人工神经网络估计$\funp{P}(w_m|w_{m-n+1}  \ldots  w_{m-1})$的值，比如，可以构建一个前馈神经网络来对$n$-gram进行建模。
@@ -523,8 +522,7 @@ F(x)=\int_{-\infty}^x f(x)\textrm{d}x
 \parinterval $n$-gram语言模型的使用非常简单。可以直接用它来对词序列出现的概率进行计算。比如，可以使用一个2-gram语言模型计算一个句子出现的概率，其中单词之间用斜杠分隔，如下：
 \begin{eqnarray}
- & &\funp{P}_{2-\textrm{gram}}{(\textrm{确实/现在/数据/很
+ & &\funp{P}_{2\textrm{-gram}}{(\textrm{确实/现在/数据/很/多})} \nonumber \\
-/多})} \nonumber \\
 &= & \funp{P}(\textrm{确实}) \times \funp{P}(\textrm{现在}|\textrm{确实})\times \funp{P}(\textrm{数据}|\textrm{现在}) \times \nonumber \\
 &  & \funp{P}(\textrm{很}|\textrm{数据})\times \funp{P}(\textrm{多}|\textrm{很})
 \label{eq:2-25}
@@ -538,17 +536,17 @@ F(x)=\int_{-\infty}^x f(x)\textrm{d}x
 \subsection{参数估计和平滑算法}
-对于$n$-gram语言模型，每个$\funp{P}(w_m|w_{m-n+1} \ldots w_{m-1})$都可以被看作是模型的{\small\bfnew{参数}}\index{参数}（Parameter\index{Parameter}）。而$n$-gram语言模型的一个核心任务是估计这些参数的值，即参数估计。通常，参数估计可以通过在数据上的统计得到。一种简单的方法是：给定一定数量的句子，统计每个$n$-gram 出现的频次，并利用公式\ref{eq:2-24}得到每个参数$\funp{P}(w_m|w_{m-n+1} \ldots w_{m-1})$的值。这个过程也被称作模型的{\small\bfnew{训练}}\index{训练}（Training\index{训练}）。对于自然语言处理任务来说，统计模型的训练是至关重要的。在本书后面的内容中也会看到，不同的问题可能需要不同的模型以及不同的模型训练方法，并且很多研究工作也都集中在优化模型训练的效果上。
+\parinterval 对于$n$-gram语言模型，每个$\funp{P}(w_m|w_{m-n+1} \ldots w_{m-1})$都可以被看作是模型的{\small\bfnew{参数}}\index{参数}（Parameter\index{Parameter}）。而$n$-gram语言模型的一个核心任务是估计这些参数的值，即参数估计。通常，参数估计可以通过在数据上的统计得到。一种简单的方法是：给定一定数量的句子，统计每个$n$-gram 出现的频次，并利用公式\eqref{eq:2-24}得到每个参数$\funp{P}(w_m|w_{m-n+1} \ldots w_{m-1})$的值。这个过程也被称作模型的{\small\bfnew{训练}}\index{训练}（Training\index{训练}）。对于自然语言处理任务来说，统计模型的训练是至关重要的。在本书后面的内容中也会看到，不同的问题可能需要不同的模型以及不同的模型训练方法，并且很多研究工作也都集中在优化模型训练的效果上。
-\parinterval 回到$n$-gram语言模型上。前面所使用的参数估计方法并不完美，因为它无法很好的处理低频或者未见现象。比如，在式\ref{eq:2-25}所示的例子中，如果语料中从没有“确实”和“现在”两个词连续出现的情况，即$\textrm{count}(\textrm{确实}\ \textrm{现在})=0$。 那么使用2-gram 计算句子“确实/现在/数据/很多”的概率时，会出现如下情况：
+\parinterval 回到$n$-gram语言模型上。前面所使用的参数估计方法并不完美，因为它无法很好地处理低频或者未见现象。比如，在式\eqref{eq:2-25}所示的例子中，如果语料中从没有“确实”和“现在”两个词连续出现的情况，即$\textrm{count}(\textrm{确实}/\textrm{现在})=0$。 那么使用2-gram 计算句子“确实/现在/数据/很/多”的概率时，会出现如下情况：
 \begin{eqnarray}
-\funp{P}(\textrm{现在}|\textrm{确实}) & =  & \frac{\textrm{count}(\textrm{确实}\ \textrm{现在})}{\textrm{count}(\textrm{确实})} \nonumber \\
+\funp{P}(\textrm{现在}|\textrm{确实}) & =  & \frac{\textrm{count}(\textrm{确实}/\textrm{现在})}{\textrm{count}(\textrm{确实})} \nonumber \\
                                                                     & =  & \frac{0}{\textrm{count}(\textrm{确实})} \nonumber \\
                                                                     & =  & 0
 \label{eq:2-26}
 \end{eqnarray}
-\parinterval 显然，这个结果是不合理的。因为即使语料中没有 “确实”和“现在”两个词连续出现，这种搭配也是客观存在的。这时简单的用极大似然估计得到概率却是0，导致整个句子出现的概率为0。 更常见的问题是那些根本没有出现在词表中的词，称为{\small\sffamily\bfseries{未登录词}}\index{未登录词}（Out-of-vocabulary Word，OOV Word）\index{Out-of-vocabulary Word，OOV Word}，比如一些生僻词，可能模型训练阶段从来没有看到过，这时模型仍然会给出0 概率。图\ref{fig:2-11}展示了一个真实语料库中词语出现频次的分布，可以看到绝大多数词都是低频词。
+\parinterval 显然，这个结果是不合理的。因为即使语料中没有 “确实”和“现在”两个词连续出现，这种搭配也是客观存在的。这时简单地用极大似然估计得到概率却是0，导致整个句子出现的概率为0。 更常见的问题是那些根本没有出现在词表中的词，称为{\small\sffamily\bfseries{未登录词}}\index{未登录词}（Out-of-vocabulary Word，OOV Word）\index{Out-of-vocabulary Word，OOV Word}，比如一些生僻词，可能模型训练阶段从来没有看到过，这时模型仍然会给出0 概率。图\ref{fig:2-11}展示了一个真实语料库中词语出现频次的分布，可以看到绝大多数词都是低频词。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -567,14 +565,14 @@ F(x)=\int_{-\infty}^x f(x)\textrm{d}x
 %    NEW SUBSUB-SECTION
 %----------------------------------------------------------------------------------------
-\subsubsection{1.加法平滑方法}
+\subsubsection{1. 加法平滑方法}
-\parinterval {\small\bfnew{加法平滑}}\index{加法平滑}（Additive Smoothing）\index{Additive Smoothing}是一种简单的平滑技术。本小节首先介绍这一方法，希望通过它了解平滑算法的思想。通常情况下，系统研发者会利用采集到的语料库来模拟真实的全部语料库。当然，没有一个语料库能覆盖所有的语言现象。假设有一个语料库$C$，其中从未出现“确实\ 现在”这样的2-gram，现在要计算一个句子$S$ =“确实/现在/物价/很高”的概率。当计算“确实\ 现在”的概率时，$\funp{P}(S) = 0$，导致整个句子的概率为0。
+\parinterval {\small\bfnew{加法平滑}}\index{加法平滑}（Additive Smoothing）\index{Additive Smoothing}是一种简单的平滑技术。本小节首先介绍这一方法，希望通过它了解平滑算法的思想。通常情况下，系统研发者会利用采集到的语料库来模拟真实的全部语料库。当然，没有一个语料库能覆盖所有的语言现象。假设有一个语料库$C$，其中从未出现“确实\ 现在”这样的2-gram，现在要计算一个句子$S$ =“确实/现在/物价/很/高”的概率。当计算“确实/现在”的概率时，$\funp{P}(S) = 0$，导致整个句子的概率为0。
-\parinterval 加法平滑方法假设每个$n$-gram出现的次数比实际统计次数多$\theta$次，$0 \le \theta\le 1$。这样，计算概率的时候分子部分不会为0。重新计算$\funp{P}(\textrm{现在}|\textrm{确实})$，可以得到：
+\parinterval 加法平滑方法假设每个$n$-gram出现的次数比实际统计次数多$\theta$次，$0 < \theta\le 1$。这样，计算概率的时候分子部分不会为0。重新计算$\funp{P}(\textrm{现在}|\textrm{确实})$，可以得到：
 \begin{eqnarray}
-\funp{P}(\textrm{现在}|\textrm{确实}) & =  & \frac{\theta + \textrm{count}(\textrm{确实\ 现在})}{\sum_{w}^{|V|}(\theta + \textrm{count}(\textrm{确实\ }w))} \nonumber \\
+\funp{P}(\textrm{现在}|\textrm{确实}) & =  & \frac{\theta + \textrm{count}(\textrm{确实/现在})}{\sum_{w}^{|V|}(\theta + \textrm{count}(\textrm{确实/}w))} \nonumber \\
-                                                             & =  & \frac{\theta + \textrm{count}(\textrm{确实\ 现在})}{\theta{|V|} + \textrm{count}(\textrm{确实})}
+                                                             & =  & \frac{\theta + \textrm{count}(\textrm{确实/现在})}{\theta{|V|} + \textrm{count}(\textrm{确实})}
 \label{eq:2-27}
 \end{eqnarray}
@@ -595,12 +593,12 @@ F(x)=\int_{-\infty}^x f(x)\textrm{d}x
 %    NEW SUBSUB-SECTION
 %----------------------------------------------------------------------------------------
-\subsubsection{2.古德-图灵估计法}
+\subsubsection{2. 古德-图灵估计法}
 \vspace{-0.5em}
-\parinterval {\small\bfnew{古德-图灵估计法}}\index{古德-图灵估计法}（Good-Turing Estimate）\index{Good-Turing Estimate}是Alan Turing和他的助手Irving John Good开发的，作为他们在二战期间破解德国密码机Enigma所使用的方法的一部分，在1953 年Irving John Good将其发表。这一方法也是很多平滑算法的核心，其基本思路是：把非零的$n$元语法单元的概率降低匀给一些低概率$n$元语法单元，以减小最大似然估计与真实概率之间的偏离\upcite{good1953population,gale1995good}。
+\parinterval {\small\bfnew{古德-图灵估计法}}\index{古德-图灵估计法}（Good-Turing Estimate）\index{Good-Turing Estimate}是Alan Turing和他的助手Irving John Good开发的，作为他们在二战期间破解德国密码机Enigma所使用方法的一部分，在1953 年Irving John Good将其发表。这一方法也是很多平滑算法的核心，其基本思路是：把非零的$n$元语法单元的概率降低，匀给一些低概率$n$元语法单元，以减小最大似然估计与真实概率之间的偏离\upcite{good1953population,gale1995good}。
-\parinterval 假定在语料库中出现$r$次的$n$-gram有$n_r$个，特别的，出现0次的$n$-gram（即未登录词及词串）出现的次数为$n_0$个。语料库中全部单词的总个数为$N$，显然：
+\parinterval 假定在语料库中出现$r$次的$n$-gram有$n_r$个，特别的，出现0次的$n$-gram（即未登录词及词串）有$n_0$个。语料库中全部单词的总个数为$N$，显然：
 \begin{eqnarray}
 N = \sum_{r=1}^{\infty}{r\,n_r}
 \label{eq:2-28}
@@ -612,7 +610,7 @@ r^* = (r + 1)\frac{n_{r + 1}}{n_r}
 \label{eq:2-29}
 \end{eqnarray}
-\parinterval 基于这个公式，就可以估计所有0次$n$-gram的频次$n_0 r^*=(r+1)n_1=n_1$。要把这个重新估计的统计数转化为概率，需要进行归一化处理：对于每个统计数为$r$的事件，其概率为：
+\parinterval 基于这个公式，就可以估计所有0次$n$-gram的频次$n_0 r^*=(r+1)n_1=n_1$。要把这个重新估计的统计数转化为概率，需要进行归一化处理。对于每个统计数为$r$的事件，其概率为：
 \begin{eqnarray}
 \funp{P}_r=\frac{r^*}{N}
 \label{eq:2-30}
@@ -626,7 +624,7 @@ N & = & \sum_{r=0}^{\infty}{r^{*}n_r} \nonumber \\
 \label{eq:2-31}
 \end{eqnarray}
-也就是说，$N$仍然为这个整个样本分布最初的计数。所有出现事件（即$r > 0$）的概率之和为：
+\parinterval 也就是说，公式\eqref{eq:2-31}中使用的$N$仍然为这个整个样本分布最初的计数。所有出现事件（即$r > 0$）的概率之和为：
 \begin{eqnarray}
 \funp{P}(r>0) & = & \sum_{r>0}{\funp{P}_r} \nonumber \\
                & = & 1 - \frac{n_1}{N} \nonumber \\
@@ -636,7 +634,7 @@ N & = & \sum_{r=0}^{\infty}{r^{*}n_r} \nonumber \\
 \noindent 其中$n_1/N$就是分配给所有出现为0次事件的概率。古德-图灵方法最终通过出现1次的$n$-gram估计了出现为0次的事件概率，达到了平滑的效果。
-\parinterval 这里使用一个例子来说明这个方法是如何对事件出现的可能性进行平滑的。仍然考虑在加法平滑法中统计单词的例子，根据古德-图灵方法进行修正如表\ref{tab:2-2}所示。
+\parinterval 下面通过一个例子来说明这个方法是如何对事件出现的可能性进行平滑的。仍然考虑在加法平滑法中统计单词的例子，根据古德-图灵方法进行修正如表\ref{tab:2-2}所示。
 %------------------------------------------------------
 \begin{table}[htp]{
@@ -658,23 +656,23 @@ N & = & \sum_{r=0}^{\infty}{r^{*}n_r} \nonumber \\
 %------------------------------------------------------
 %\vspace{-1.5em}
-\parinterval 当$r$很大的时候经常会出现$n_{r+1}=0$的情况，而且这时$n_r$也会有噪音存在。通常，简单的古德-图灵方法可能无法很好的处理这种复杂的情况，不过古德-图灵方法仍然是其他一些平滑方法的基础。
+\parinterval 但是在$r$很大的时候经常会出现$n_{r+1}=0$的情况。通常，古德-图灵方法可能无法很好的处理这种复杂的情况，不过该方法仍然是其他一些平滑方法的基础。
 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
 %----------------------------------------------------------------------------------------
-\subsubsection{3.Kneser-Ney平滑方法}
+\subsubsection{3. Kneser-Ney平滑方法}
 \parinterval Kneser-Ney平滑方法是由Reinhard Kneser和Hermann Ney于1995年提出的用于计算$n$元语法概率分布的方法\upcite{kneser1995improved,chen1999empirical}，并被广泛认为是最有效的平滑方法之一。这种平滑方法改进了Absolute Discounting\upcite{ney1991smoothing,ney1994structuring}中与高阶分布相结合的低阶分布的计算方法，使不同阶分布得到充分的利用。这种算法也综合利用了其他多种平滑算法的思想。
 \parinterval 首先介绍一下Absolute Discounting平滑算法，公式如下所示：
 \begin{eqnarray}
-\funp{P}_{\textrm{AbsDiscount}}(w_i | w_{i-1}) = \frac{c(w_{i-1},w_i )-d}{c(w_{i-1})} + \lambda(w_{i-1})\funp{P}(w_{i})
+\funp{P}_{\textrm{AbsDiscount}}(w_i | w_{i-1}) = \frac{c(w_{i-1} w_i )-d}{c(w_{i-1})} + \lambda(w_{i-1})\funp{P}(w_{i})
 \label{eq:2-33}
 \end{eqnarray}
-\noindent 其中$d$表示被裁剪的值，$\lambda$是一个正则化常数，$c(\cdot)$是count$(\cdot)$的缩写。可以看到第一项是经过减值调整过的2-gram的概率值，第二项则相当于一个带权重$\lambda$的1-gram的插值项。然而这种插值模型极易受到原始1-gram 模型的干扰。
+\noindent 其中$d$表示被裁剪的值，$\lambda$是一个正则化常数，$c(\cdot)$是count$(\cdot)$的缩写。可以看到第一项是经过减值调整后的2-gram的概率值，第二项则相当于一个带权重$\lambda$的1-gram的插值项。然而这种插值模型极易受到原始1-gram 模型$\funp{P}(w_{i})$的干扰。
 \parinterval 假设这里使用2-gram和1-gram的插值模型预测下面句子中下划线处的词
@@ -684,64 +682,78 @@ I cannot see without my reading \underline{\ \ \ \ \ \ \ \ }
 \end{center}
 \vspace{0.0em}
-\noindent 直觉上应该会猜测这个地方的词应该是“glasses”，但是在训练语料库中“Francisco”出现的频率非常高。如果在预测时仍然使用的是标准的1-gram模型，那么系统会高概率选择“Francisco”填入下划线出，这个结果明显是不合理的。当使用的是混合的插值模型时，如果“reading Francisco”这种二元语法并没有出现在语料中，就会导致1-gram对结果的影响变大，使得仍然会做出与标准1-gram模型相同的结果，犯下相同的错误。
+\noindent 直觉上应该会猜测这个地方的词应该是“glasses”，但是在训练语料库中“Francisco”出现的频率非常高。如果在预测时仍然使用的是标准的1-gram模型，那么系统会高概率选择“Francisco”填入下划线处，这个结果显然是不合理的。当使用混合的插值模型时，如果“reading Francisco”这种二元语法并没有出现在语料中，就会导致1-gram对结果的影响变大，仍然会做出与标准1-gram模型相同的结果，犯下相同的错误。
 \parinterval 观察语料中的2-gram发现，“Francisco”的前一个词仅可能是“San”，不会出现“reading”。这个分析证实了，考虑前一个词的影响是有帮助的，比如仅在前一个词是“San”时，才给“Francisco”赋予一个较高的概率值。基于这种想法，改进原有的1-gram模型，创造一个新的1-gram模型$\funp{P}_{\textrm{continuation}}$，简写为$\funp{P}_{\textrm{cont}}$。这个模型可以通过考虑前一个词的影响评估当前词作为第二个词出现的可能性。
-\parinterval 为了评估$\funp{P}_{\textrm{cont}}$，统计使用当前词作为第二个词所出现2-gram的种类，2-gram法种类越多，这个词作为第二个词出现的可能性越高，呈正比：
+\parinterval 为了评估$\funp{P}_{\textrm{cont}}$，统计使用当前词作为第二个词所出现2-gram的种类，2-gram种类越多，这个词作为第二个词出现的可能性越高，呈正比：
 \begin{eqnarray}
-\funp{P}_{\textrm{cont}}(w_i) \varpropto |w_{i-1}: c(w_{i-1},w_i )>0|
+\funp{P}_{\textrm{cont}}(w_i) \varpropto |\{w_{i-1}: c(w_{i-1} w_i )>0\}|
 \label{eq:2-34}
 \end{eqnarray}
-通过全部的二元语法的种类做归一化可得到评估的公式：
+其中，公式\eqref{eq:2-34}右端表示求出在$w_i$之前出现过的$w_{i-1}$的数量。接下来通过对全部的二元语法单元的种类做归一化可得到评估公式：
 \begin{eqnarray}
-\funp{P}_{\textrm{cont}}(w_i) = \frac{|\{ w_{i-1}:c(w_{i-1},w_i )>0 \}|}{|\{ (w_{j-1}, w_j):c(w_{j-1},w_j )>0 \}|}
+\funp{P}_{\textrm{cont}}(w_i) = \frac{|\{ w_{i-1}:c(w_{i-1} w_i )>0 \}|}{|\{ (w_{j-1},w_j):c(w_{j-1} w_j )>0 \}|}
 \label{eq:2-35}
 \end{eqnarray}
-\parinterval 基于分母的变化还有另一种形式：
+\parinterval 分母中对二元语法单元种类的统计还可以写为另一种形式：
 \begin{eqnarray}
-\funp{P}_{\textrm{cont}}(w_i) = \frac{|\{ w_{i-1}:c(w_{i-1},w_i )>0 \}|}{\sum_{w^{\prime}_{i}}|\{ w_{i-1}^{\prime}:c(w_{i-1}^{\prime},w_i^{\prime} )>0 \}|}
+\funp{P}_{\textrm{cont}}(w_i) = \frac{|\{ w_{i-1}:c(w_{i-1} w_i )>0 \}|}{\sum_{w^{\prime}_{i}}|\{ w_{i-1}^{\prime}:c(w_{i-1}^{\prime} w_i^{\prime} )>0 \}|}
 \label{eq:2-36}
 \end{eqnarray}
-结合基础的Absolute discounting计算公式，从而得到了Kneser-Ney平滑方法的公式：
+\parinterval 结合基础的Absolute discounting计算公式，可以得到Kneser-Ney平滑方法的公式：
 \begin{eqnarray}
-\funp{P}_{\textrm{KN}}(w_i|w_{i-1}) = \frac{\max(c(w_{i-1},w_i )-d,0)}{c(w_{i-1})}+ \lambda(w_{i-1})\funp{P}_{\textrm{cont}}(w_i)
+\funp{P}_{\textrm{KN}}(w_i|w_{i-1}) = \frac{\max(c(w_{i-1} w_i )-d,0)}{c(w_{i-1})}+ \lambda(w_{i-1})\funp{P}_{\textrm{cont}}(w_i)
 \label{eq:2-37}
 \end{eqnarray}
 \noindent 其中：
 \begin{eqnarray}
-\lambda(w_{i-1}) = \frac{d}{c(w_{i-1})}|\{w:c(w_{i-1},w)>0\}|
+\lambda(w_{i-1}) = \frac{d}{c(w_{i-1})}|\{w_i:c(w_{i-1} w_i)>0\}|
 \label{eq:2-38}
 \end{eqnarray}
 \noindent 这里$\max(\cdot)$保证了分子部分为不小0的数，原始1-gram更新成$\funp{P}_{\textrm{cont}}$概率分布，$\lambda$是正则化项。
-\parinterval 为了更具普适性，不仅局限为2-gram和1-gram的插值模型，利用递归的方式可以得到更通用的Kneser-Ney平滑公式：
+\parinterval 为了更具普适性，不局限于2-gram和1-gram的插值模型，利用递归的方式可以得到更通用的Kneser-Ney平滑公式：
 \begin{eqnarray}
 \funp{P}_{\textrm{KN}}(w_i|w_{i-n+1}  \ldots w_{i-1}) & = & \frac{\max(c_{\textrm{KN}}(w_{i-n+1} \ldots w_{i})-d,0)}{c_{\textrm{KN}}(w_{i-n+1} \ldots w_{i-1})} + \nonumber \\
                                                   &   &  \lambda(w_{i-n+1} \ldots w_{i-1})\funp{P}_{\textrm{KN}}(w_i|w_{i-n+2} \ldots w_{i-1})
 \label{eq:2-39}
 \end{eqnarray}
 \begin{eqnarray}
-\lambda(w_{i-n+1} \ldots w_{i-1}) =  \frac{d}{c_{\textrm{KN}}(w_{i-n+1}^{i-1})}|\{w:c_{\textrm{KN}}(w_{i-n+1} \ldots w_{i-1},w)>0\}
+\lambda(w_{i-n+1} \ldots w_{i-1}) =  \frac{d}{c_{\textrm{KN}}(w_{i-n+1}^{i-1})}|\{w_i:c_{\textrm{KN}}(w_{i-n+1} \ldots w_{i-1} w_i)>0\}|
 \label{eq:2-40}
 \end{eqnarray}
 \begin{eqnarray}
 c_{\textrm{KN}}(\cdot) = \left\{\begin{array}{ll}
-\textrm{count}(\cdot) & \textrm{for\ highest\ order}  \\
+\textrm{count}(\cdot) & \textrm{当计算最高阶模型时}  \\
-\textrm{catcount}(\cdot) & \textrm{for\ lower\ order}
+\textrm{catcount}(\cdot) & \textrm{当计算低阶模型时}
 \end{array}\right.
 \label{eq:2-41}
 \end{eqnarray}
-\noindent 其中catcount$(\cdot)$表示的是基于某个单个词作为第$n$个词的$n$-gram的种类数目。
+\noindent 其中catcount$(\cdot)$表示的是单词$w_i$作为n-gram中第n个词时$w_{i-n+1} \ldots w_i$的种类数目。
 \parinterval Kneser-Ney平滑是很多语言模型工具的基础\upcite{heafield2011kenlm,stolcke2002srilm}。还有很多以此为基础衍生出来的算法，感兴趣的读者可以通过参考文献自行了解\upcite{parsing2009speech,ney1994structuring,chen1999empirical}。
 %----------------------------------------------------------------------------------------
+%    NEW SSUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsection{语言模型的评价}
+\parinterval  在使用语言模型时，往往需要知道模型的质量。{\small\sffamily\bfseries{困惑度}}\index{困惑度}（Perplexity\index{Perplexity}，PPL）是一种衡量语言模型的好坏的指标。对于一个真实的词序列$ w_1\dots w_m $，困惑度被定义为：
+\begin{eqnarray}
+{\rm{PPL}}&=&\funp{P}{(w_1\dots w_m)}^{- \frac{1}{m}}
+\label{eq:5-65}
+\end{eqnarray}
+\parinterval  本质上，PPL反映了语言模型对序列可能性预测能力的一种评估。如果$ w_1\dots w_m $\\是真实的自然语言，``完美''的模型会得到$ \funp{P}(w_1\dots w_m)=1 $，它对应了最低的困惑度PPL=1，这说明模型可以完美地对词序列出现的可能性进行预测。当然，真实的语言模型是无法达到PPL=1的，比如，在著名的Penn Treebank（PTB）数据上最好的语言模型的PPL值也只能到达35左右。可见自然语言处理任务的困难程度。
+%----------------------------------------------------------------------------------------
 %    NEW SECTION
 %----------------------------------------------------------------------------------------
@@ -762,7 +774,7 @@ c_{\textrm{KN}}(\cdot) = \left\{\begin{array}{ll}
 \begin{itemize}
 \vspace{0.5em}
-\item 预测输入句子的可能性。比如，有如下两个句子
+\item 预测输入句子的可能性。比如，有如下两个句子：
 \vspace{0.8em}
 \hspace{10em} The boy caught the cat.
@@ -772,7 +784,7 @@ c_{\textrm{KN}}(\cdot) = \left\{\begin{array}{ll}
 \vspace{0.8em}
-可以利用语言模型对其进行打分，即计算句子的生成概率，之后把语言模型的得分作为判断句子合理性的依据。显然，在这个例子中，第一句的语言模型得分更高，因此句子也更加合理。
+\noindent 可以利用语言模型对其进行打分，即计算句子的生成概率，之后把语言模型的得分作为判断句子合理性的依据。显然，在这个例子中，第一句的语言模型得分更高，因此句子也更加合理。
 \vspace{0.5em}
 \item 预测可能生成的单词或者单词序列。比如，对于如下的例子
@@ -781,14 +793,14 @@ c_{\textrm{KN}}(\cdot) = \left\{\begin{array}{ll}
 \hspace{10em} The boy caught \ \ \underline{\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ }
 \vspace{0.8em}
-下划线的部分是缺失的内容，现在要将缺失的部分生成出来。理论上，所有可能的单词串都可以构成缺失部分的内容。这时可以使用语言模型得到所有可能词串构成句子的概率，之后找到概率最高的词串填入下划线处。
+\noindent 下划线的部分是缺失的内容，现在要将缺失的部分生成出来。理论上，所有可能的单词串都可以构成缺失部分的内容。这时可以使用语言模型得到所有可能词串构成句子的概率，之后找到概率最高的词串填入下划线处。
 \vspace{0.5em}
 \end{itemize}
 \parinterval 从词序列建模的角度看，这两类预测问题本质上是一样的。因为，它们都在使用语言模型对词序列进行概率评估。但是，从实现上看，词序列的生成问题更难。因为，它不仅要对所有可能的词序列进行打分，同时要“找到”最好的词序列。由于潜在的词序列不计其数，因此这个“找”最优词序列的过程并不简单。
-\parinterval 实际上，生成最优词序列的问题也对应着自然语言处理中的一大类问题\ \dash\ {\small\bfnew{序列生成}}\index{序列生成}（Sequence Generation）\index{Sequence Generation}。机器翻译就是一个非常典型的序列生成问题：在机器翻译任务中，需要根据源语言词序列生成与之相对应的目标语言词序列。但是语言模型本身并不能“制造”单词序列的。因此，严格地说，序列生成问题的本质并非让语言模型凭空“生成”序列，而是使用语言模型在所有候选的单词序列中“找出”最佳序列。这个过程对应着经典的{\small\bfnew{搜索问题}}\index{搜索问题}（Search Problem）\index{Search Problem}。下面将着重介绍序列生成背后的建模方法，以及在序列生成里常用的搜索技术。
+\parinterval 实际上，生成最优词序列的问题也是自然语言处理中的一大类问题\ \dash\ {\small\bfnew{序列生成}}\index{序列生成}（Sequence Generation）\index{Sequence Generation}。机器翻译就是一个非常典型的序列生成问题：在机器翻译任务中，需要根据源语言词序列生成与之相对应的目标语言词序列。但是语言模型本身并不能“制造”单词序列的。因此，严格地说，序列生成问题的本质并非让语言模型凭空“生成”序列，而是使用语言模型在所有候选的单词序列中“找出”最佳序列。这个过程对应着经典的{\small\bfnew{搜索问题}}\index{搜索问题}（Search Problem）\index{Search Problem}。下面将着重介绍序列生成背后的建模方法，以及在序列生成里常用的搜索技术。
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -819,7 +831,7 @@ c_{\textrm{KN}}(\cdot) = \left\{\begin{array}{ll}
 \parinterval 在这种序列生成方式的基础上，实现搜索通常有两种方法\ \dash\ 深度优先遍历和宽度优先遍历\upcite{DBLP:books/mg/CormenLR89}。在深度优先遍历中，每次从词表中可重复地选择一个单词，然后从左至右地生成序列，直到<eos>被选择，此时一个完整的单词序列被生成出来。然后从<eos>回退到上一个单词，选择之前词表中未被选择到的候选单词代替<eos>，并继续挑选下一个单词直到<eos>被选到，如果上一个单词的所有可能都被枚举过，那么回退到上上一个单词继续枚举，直到回退到<sos>，这时候枚举结束。在宽度优先遍历中，每次不是只选择一个单词，而是枚举所有单词。
-有一个简单的例子。假设词表只含两个单词$\{a, b\}$，从<sos>开始枚举所有候选，有三种可能：
+\parinterval 有一个简单的例子。假设词表只含两个单词$\{a, b\}$，从<sos>开始枚举所有候选，有三种可能：
 \begin{eqnarray}
 \{\text{<sos>}\ a, \text{<sos>}\ b, \text{<sos>}\ \text{<eos>}\} \nonumber
 \end{eqnarray}
@@ -866,7 +878,7 @@ c_{\textrm{KN}}(\cdot) = \left\{\begin{array}{ll}
 }\end{table}
 %------------------------------------------------------
-\parinterval 那么是否有比枚举策略更高效的方法呢？答案是肯定的。一种直观的方法是将搜索的过程表示成树型结构，称为解空间树。它包含了搜索过程中可生成的全部序列。该树的根节点恒为<sos>，代表序列均从<sos> 开始。该树结构中非叶子节点的兄弟节点有$|V|$个，由词表和结束符号<eos>构成。从图\ref{fig:2-14}可以看到，对于一个最大长度为4的序列的搜索过程，生成某个单词序列的过程实际上就是访问解空间树中从根节点<sos> 开始一直到叶子节点<eos>结束的某条路径，而这条的路径上节点按顺序组成了一段独特的单词序列。此时对所有可能单词序列的枚举就变成了对解空间树的遍历。并且枚举的过程与语言模型打分的过程也是一致的，每枚举一个词$i$也就是在上图选择$w_i$一列的一个节点，语言模型就可以为当前的树节点$w_i$给出一个分值，即$\funp{P}(w_i | w_1 w_2 \ldots w_{i-1})$。对于$n$-gram语言模型，这个分值$\funp{P}(w_i | w_1 w_2 \ldots w_{i-1})=\funp{P}(w_i | w_{i-n+1} \ldots w_{i-1})$
+\parinterval 那么是否有比枚举策略更高效的方法呢？答案是肯定的。一种直观的方法是将搜索的过程表示成树型结构，称为解空间树。它包含了搜索过程中可生成的全部序列。该树的根节点恒为<sos>，代表序列均从<sos> 开始。该树结构中非叶子节点的兄弟节点有$|V|+1$个，由词表和结束符号<eos>构成。从图\ref{fig:2-14}可以看到，对于一个最大长度为4的序列的搜索过程，生成某个单词序列的过程实际上就是访问解空间树中从根节点<sos> 开始一直到叶子节点<eos>结束的某条路径，而这条的路径上节点按顺序组成了一段独特的单词序列。此时对所有可能单词序列的枚举就变成了对解空间树的遍历。并且枚举的过程与语言模型打分的过程也是一致的，每枚举一个词$i$也就是在上图选择$w_i$一列的一个节点，语言模型就可以为当前的树节点$w_i$给出一个分值，即$\funp{P}(w_i | w_1 w_2 \ldots w_{i-1})$。对于$n$-gram语言模型，这个分值可以表示为$\funp{P}(w_i | w_1 w_2 \ldots w_{i-1})=\funp{P}(w_i | w_{i-n+1} \ldots w_{i-1})$
 %----------------------------------------------
 \begin{figure}[htp]
@@ -884,7 +896,7 @@ c_{\textrm{KN}}(\cdot) = \left\{\begin{array}{ll}
 \label{eq:2-43}
 \end{eqnarray}
-通常，$\textrm{score}(\cdot)$也被称作{\small\bfnew{模型得分}}\index{模型得分}（Model Score\index{Model Score}）。如图\ref{fig:2-15}所示，可知红线所示单词序列“<sos>\ I\ agree\ <eos>”的模型得分为：
+\parinterval 通常，$\textrm{score}(\cdot)$也被称作{\small\bfnew{模型得分}}\index{模型得分}（Model Score\index{Model Score}）。如图\ref{fig:2-15}所示，可知红线所示单词序列“<sos>\ I\ agree\ <eos>”的模型得分为：
 \begin{eqnarray}
 &&\textrm{score(<sos>\ I\ agree\ <eos>)}   \nonumber \\
 & = & \log \funp{P}(\textrm{<sos>}) + \log \funp{P}(\textrm{I} | \textrm{<sos>}) + \log \funp{P}(\textrm{agree} | \textrm{<sos>\ I}) + \log \funp{P}(\textrm{<sos>}| \textrm{<sos>\ I\ agree})   \nonumber \\
@@ -910,13 +922,13 @@ c_{\textrm{KN}}(\cdot) = \left\{\begin{array}{ll}
 \subsection{经典搜索}
-人工智能领域有很多经典的搜索策略，这里将对无信息搜索和启发性搜索进行简要介绍。
+\parinterval 人工智能领域有很多经典的搜索策略，这里将对无信息搜索和启发性搜索进行简要介绍。
 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
 %----------------------------------------------------------------------------------------
-\subsubsection{1.无信息搜索}
+\subsubsection{1. 无信息搜索}
 \parinterval 在解空间树中，在每次对一个节点进行扩展的时候，可以借助语言模型计算当前节点的权重。因此很自然的一个想法是：使用权重信息可以帮助系统更快地找到合适的解。
@@ -950,7 +962,7 @@ c_{\textrm{KN}}(\cdot) = \left\{\begin{array}{ll}
 %    NEW SUBSUB-SECTION
 %----------------------------------------------------------------------------------------
-\subsubsection{2.启发式搜索}
+\subsubsection{2. 启发式搜索}
 \parinterval 在搜索问题中，一个单词序列的生成可以分为两部分：已生成部分和未生成部分。既然最终目标是使得一个完整的单词序列得分最高，那么关注未生成部分的得分也许能为搜索策略的改进提供思路。
@@ -974,7 +986,7 @@ c_{\textrm{KN}}(\cdot) = \left\{\begin{array}{ll}
 %    NEW SUBSUB-SECTION
 %----------------------------------------------------------------------------------------
-\subsubsection{1.贪婪搜索}
+\subsubsection{1. 贪婪搜索}
 \parinterval {\small\bfnew{贪婪搜索}}\index{贪婪搜索}（Greedy Search）\index{Greedy Search}基于一种思想：当一个问题可以拆分为多个子问题时，如果一直选择子问题的最优解就能得到原问题的最优解，那么就可以不必遍历原始的解空间，而是使用这种“贪婪”的策略进行搜索。基于这种思想，它每次都优先挑选得分最高的词进行扩展，这一点与改进过的深度优先搜索类似。但是它们的区别在于，贪婪搜索在搜索到一个完整的序列，也就是搜索到<eos>即停止，而改进的深度优先搜索会遍历整个解空间。因此贪婪搜索非常高效，其时间和空间复杂度仅为$O(m)$，这里$m$为单词序列的长度。
@@ -993,7 +1005,7 @@ c_{\textrm{KN}}(\cdot) = \left\{\begin{array}{ll}
 %    NEW SUBSUB-SECTION
 %----------------------------------------------------------------------------------------
-\subsubsection{2.束搜索}
+\subsubsection{2. 束搜索}
 \parinterval 贪婪搜索会产生质量比较差的解是由于当前单词的错误选择造成的。既然每次只挑选一个单词可能会产生错误，那么可以通过同时考虑更多候选单词来缓解这个问题，也就是对于一个位置，可以同时将其扩展到若干个节点。这样就扩大了搜索的范围，进而使得优质解被找到的概率增大。
@@ -1021,20 +1033,20 @@ c_{\textrm{KN}}(\cdot) = \left\{\begin{array}{ll}
 \sectionnewpage
 \section{小结及深入阅读} \label{sec2:summary}
-\parinterval 本章重点介绍了如何对自然语言处理问题进行统计建模，并从数据中自动学习统计模型的参数，最终使用学习到的模型对新的问题进行处理。之后，将这种思想应用到语言建模任务中，该任务与机器翻译有着紧密的联系。通过系统化的建模，可以发现：经过适当的假设和化简，统计模型可以很好的描述复杂的自然语言处理问题。进一步，本章对面向语言模型预测的搜索方法进行了介绍。相关概念和方法也会在后续章节的内容中被广泛使用。
+\parinterval 本章重点介绍了如何对自然语言处理问题进行统计建模，并从数据中自动学习统计模型的参数，最终使用学习到的模型对新的问题进行处理。之后，将这种思想应用到语言建模任务中，该任务与机器翻译有着紧密的联系。通过系统化的建模，可以发现：经过适当的假设和化简，统计模型可以很好地描述复杂的自然语言处理问题。进一步，本章对面向语言模型预测的搜索方法进行了介绍。相关概念和方法也会在后续章节的内容中被广泛使用。
 \parinterval 此外，有几方面内容，读者可以继续深入了解：
 \begin{adjustwidth}{1em}{}
 \begin{itemize}
 \vspace{0.5em}
-\item 在$n$-gram语言模型中，由于语料中往往存在大量的低频词以及未登录词，模型会产生不合理的概率预测结果。因此本章介绍了三种平滑方法，以解决上述问题。实际上，平滑方法是语言建模中的重要研究方向。除了上述三种方法之外，还有Jelinek–Mercer平滑\upcite{jelinek1980interpolated}、Katz 平滑\upcite{katz1987estimation}以及Witten–Bell平滑等等\upcite{bell1990text,witten1991the}。 相关工作也对这些平滑方法进行了详细对比\upcite{chen1999empirical,goodman2001a}。
+\item 在$n$-gram语言模型中，由于语料中往往存在大量的低频词以及未登录词，模型会产生不合理的概率预测结果。因此本章介绍了三种平滑方法，以解决上述问题。实际上，平滑方法是语言建模中的重要研究方向。除了上述三种方法之外，还有Jelinek–Mercer平滑\upcite{jelinek1980interpolated}、Katz 平滑\upcite{katz1987estimation}以及Witten–Bell平滑等等\upcite{bell1990text,witten1991the}。相关工作也对这些平滑方法进行了详细对比\upcite{chen1999empirical,goodman2001a}。
 \vspace{0.5em}
 \item 除了平滑方法，也有很多工作对$n$-gram语言模型进行改进。比如，对于形态学丰富的语言，可以考虑对单词的形态学变化进行建模。这类语言模型在一些机器翻译系统中也体现出了很好的潜力\upcite{kirchhoff2005improved,sarikaya2007joint,koehn2007factored}。此外，如何使用超大规模数据进行语言模型训练也是备受关注的研究方向。比如，有研究者探索了对超大语言模型进行压缩和存储的方法\upcite{federico2007efficient,federico2006how,heafield2011kenlm}。另一个有趣的方向是，利用随机存储算法对大规模语言模型进行有效存储\upcite{talbot2007smoothed,talbot2007randomised}，比如，在语言模型中使用Bloom\ Filter等随机存储的数据结构。
 \vspace{0.5em}
 \item 本章更多地关注了语言模型的基本问题和求解思路，但是基于$n$-gram的方法并不是语言建模的唯一方法。从现在自然语言处理的前沿看，端到端的深度学习方法在很多任务中都取得了领先的性能。语言模型同样可以使用这些方法\upcite{jing2019a}，而且在近些年取得了巨大成功。例如，最早提出的前馈神经语言模型\upcite{bengio2003a}和后来的基于循环单元的语言模型\upcite{mikolov2010recurrent}、基于长短期记忆单元的语言模型\upcite{sundermeyer2012lstm}以及现在非常流行的Transformer\upcite{vaswani2017attention}。 关于神经语言模型的内容，会在{\chapternine}进行进一步介绍。
 \vspace{0.5em}
-\item 最后，本章结合语言模型的序列生成任务对搜索技术进行了介绍。类似地，机器翻译任务也需要从大量的翻译候选中快速寻找最优译文。因此在机器翻译任务中也使用了搜索方法，这个过程通常被称作{\small\bfnew{解码}}\index{解码}（Decoding）\index{Decoding}。例如，有研究者在基于词的翻译模型中尝试使用启发式搜索\upcite{DBLP:conf/acl/OchUN01,DBLP:conf/acl/WangW97,tillmann1997a}以及贪婪搜索方法\upcite{germann2001fast}\upcite{germann2003greedy}，也有研究者研究基于短语的栈解码方法\upcite{Koehn2007Moses,DBLP:conf/amta/Koehn04}。此外，解码方法还包括有限状态机解码\upcite{bangalore2001a}\upcite{DBLP:journals/mt/BangaloreR02}以及基于语言学约束的解码\upcite{venugopal2007an,zollmann2007the,liu2006tree,galley2006scalable,chiang2005a}。相关内容将在{\chaptereight} 和{\chapterfourteen} 进行介绍。
+\item 最后，本章结合语言模型的序列生成任务对搜索技术进行了介绍。类似地，机器翻译任务也需要从大量的翻译候选中快速寻找最优译文。因此在机器翻译任务中也使用了搜索方法，这个过程通常被称作{\small\bfnew{解码}}\index{解码}（Decoding）\index{Decoding}。例如，有研究者在基于词的翻译模型中尝试使用启发式搜索\upcite{DBLP:conf/acl/OchUN01,DBLP:conf/acl/WangW97,tillmann1997a}以及贪婪搜索方法\upcite{germann2001fast}\upcite{germann2003greedy}，也有研究者探索基于短语的栈解码方法\upcite{Koehn2007Moses,DBLP:conf/amta/Koehn04}。此外，解码方法还包括有限状态机解码\upcite{bangalore2001a}\upcite{DBLP:journals/mt/BangaloreR02}以及基于语言学约束的解码\upcite{venugopal2007an,zollmann2007the,liu2006tree,galley2006scalable,chiang2005a}。相关内容将在{\chaptereight}和{\chapterfourteen}进行介绍。
 \vspace{0.5em}
 \end{itemize}
 \end{adjustwidth}
--- a/Chapter3/Figures/figure-a-simple-pre-processing-process.tex
+++ b/Chapter3/Figures/figure-a-simple-pre-processing-process.tex
@@ -13,7 +13,7 @@
    \node [ugreen] (input) at (0,0) {猫喜欢吃鱼};
    \node [draw,thick,anchor=west,ublue] (preprocessing) at ([xshift=1em]input.east) {分词系统};
    \node [ugreen,anchor=west] (mtinput) at ([xshift=1em]preprocessing.east) {猫/喜欢/吃/鱼};
-    \node [draw,thick,anchor=west,ublue] (smt) at ([xshift=1em]mtinput.east) {MT系统};
+    \node [draw,thick,anchor=west,ublue] (smt) at ([xshift=1em]mtinput.east) {机器翻译系统};
    \node [anchor=west] (mtoutput) at ([xshift=1em]smt.east) {...};
    \draw [->,thick,ublue] ([xshift=0.1em]input.east) -- ([xshift=-0.2em]preprocessing.west);
    \draw [->,thick,ublue] ([xshift=0.2em]preprocessing.east) -- ([xshift=-0.1em]mtinput.west);

--- a/Chapter3/Figures/figure-crf-to-deal-with-sequence-problems.tex
+++ b/Chapter3/Figures/figure-crf-to-deal-with-sequence-problems.tex
@@ -5,11 +5,11 @@
 		\node[anchor=west,hide](y2)at([xshift=2em]y1.east){$y_2$};
 		\node[anchor=west,hide](y3)at([xshift=2em]y2.east){$y_3$};
 		\node[anchor=west,line width=1pt,inner sep=2pt,minimum size=2em](dots)at([xshift=2em]y3.east){$\cdots$};
-		\node[anchor=west,hide](yn-1)at([xshift=2em]dots.east){$y_{n-1}$};
+		\node[anchor=west,hide](yn-1)at([xshift=2em]dots.east){$y_{m-1}$};
-		\node[anchor=west,hide](yn)at([xshift=2em]yn-1.east){$y_n$};
+		\node[anchor=west,hide](yn)at([xshift=2em]yn-1.east){$y_m$};
-		\node[anchor=north,draw,line width=1pt,inner sep=2pt,fill=red!30,minimum height=2em,minimum width=12em](see)at ([yshift=-3em,xshift=2em]y3.south){$\mathbf{X}=(x_1,x_2,\ldots,x_{n-1},x_n)$};
+		\node[anchor=north,draw,line width=1pt,inner sep=2pt,fill=red!30,minimum height=2em,minimum width=12em](see)at ([yshift=-3em,xshift=2em]y3.south){${X}=(x_1,x_2,\ldots,x_{m-1},x_m)$};
-		\node[anchor=south,font=\footnotesize] at ([yshift=1em,xshift=2em]y3.north){待预测的隐藏状态序列};
+		\node[anchor=south,font=\footnotesize] at ([yshift=1em,xshift=2em]y3.north){待预测的隐含状态序列};
 		\node[anchor=north,font=\footnotesize] at ([yshift=-1em]see.south){可见状态序列};
 		\draw[line width=1pt] (y1.east) -- (y2.west);

--- a/Chapter3/Figures/figure-evaluation-of-probability-for-grammar.tex
+++ b/Chapter3/Figures/figure-evaluation-of-probability-for-grammar.tex
@@ -56,18 +56,18 @@
 \node[rectangle,draw=ublue, inner sep=0.2em] [fit = (treebanklabel) (t1n1) (t2w1) (t2wn)] (treebank) {};
 \end{pgfonlayer}
-\node [anchor=north west] (math1) at ([xshift=2em]treebank.north east) {P(VP $\to$ VV NN)};
+\node [anchor=north west] (math1) at ([xshift=2em]treebank.north east) {$\funp{P}$(VP $\to$ VV NN)};
-\node [anchor=north west] (math1part2) at ([xshift=-1em,yshift=0.2em]math1.south west) {$=\frac{\textrm{``VP''和``VV NN''同时出现的次数=1}}{\textrm{``VP''出现的次数}=4}$};
+\node [anchor=north west] (math1part2) at ([xshift=-1em,yshift=0.2em]math1.south west) {$=\frac{\textrm{VP和VV NN同时出现的次数=1}}{\textrm{VP出现的次数}=4}$};
 \node [anchor=north west] (math1part3) at ([yshift=0.2em]math1part2.south west){$=\frac{1}{4}$};
-\node [anchor=north west] (math2) at ([yshift=-6em]math1.north west) {P(NP $\to$ NN)};
+\node [anchor=north west] (math2) at ([yshift=-6em]math1.north west) {$\funp{P}$(NP $\to$ NN)};
-\node [anchor=north west] (math2part2) at ([xshift=-1em,yshift=0.2em]math2.south west) {$=\frac{\textrm{``NP''和``NN''同时出现的次数=2}}{\textrm{``NP''出现的次数}=3}$};
+\node [anchor=north west] (math2part2) at ([xshift=-1em,yshift=0.2em]math2.south west) {$=\frac{\textrm{NP和NN同时出现的次数=2}}{\textrm{NP出现的次数}=3}$};
 \node [anchor=north west] (math2part3) at ([yshift=0.2em]math2part2.south west){$=\frac{2}{3}$};
-\node [anchor=north west] (math3) at ([yshift=-6em]math2.north west) {P(IP $\to$ NP NP)};
+\node [anchor=north west] (math3) at ([yshift=-6em]math2.north west) {$\funp{P}$(IP $\to$ NP NP)};
-\node [anchor=north west] (math3part2) at ([xshift=-1em,yshift=0.2em]math3.south west) {$=\frac{\textrm{``IP''和``NP NP''同时出现的次数=0}}{\textrm{``IP''出现的次数}=3}$};
+\node [anchor=north west] (math3part2) at ([xshift=-1em,yshift=0.2em]math3.south west) {$=\frac{\textrm{IP和NP NP同时出现的次数=0}}{\textrm{IP出现的次数}=3}$};
 \node [anchor=north west] (math3part3) at ([yshift=0.2em]math3part2.south west){$=\frac{0}{3}$};
 \begin{pgfonlayer}{background}

--- a/Chapter3/Figures/figure-example-of-hmm.tex
+++ b/Chapter3/Figures/figure-example-of-hmm.tex
@@ -2,15 +2,15 @@
 	\tikzstyle{unit} = [draw,circle,line width=0.8pt,align=center,fill=green!30,minimum size=1em]
 		\node[minimum width=3em,minimum height=1.8em] (o) at (0,0){};
-		\node[anchor=north,inner sep=1pt,font=\footnotesize] (state_A) at ([xshift=-0em,yshift=-1em]o.south){隐藏状态A};
+		\node[anchor=north,inner sep=1pt,font=\footnotesize] (state_A) at ([xshift=-0em,yshift=-1em]o.south){隐含状态$A$};
-		\node[anchor=north,inner sep=1pt,font=\footnotesize] (state_B) at ([yshift=-1.6em]state_A.south){隐藏状态B};
+		\node[anchor=north,inner sep=1pt,font=\footnotesize] (state_B) at ([yshift=-1.6em]state_A.south){隐含状态$B$};
-		\node[anchor=north,inner sep=1pt,font=\footnotesize] (state_C) at ([yshift=-1.6em]state_B.south){隐藏状态C};
+		\node[anchor=north,inner sep=1pt,font=\footnotesize] (state_C) at ([yshift=-1.6em]state_B.south){隐含状态$C$};
-		\node[anchor=north,inner sep=1pt,font=\footnotesize] (state_D) at ([yshift=-1.6em]state_C.south){隐藏状态D};
+		\node[anchor=north,inner sep=1pt,font=\footnotesize] (state_D) at ([yshift=-1.6em]state_C.south){隐含状态$D$};
-		\node[anchor=west,inner sep=1pt,font=\footnotesize] (c1) at ([yshift=0.2em,xshift=2em]o.east){T};
+		\node[anchor=west,inner sep=1pt,font=\footnotesize] (c1) at ([yshift=0.2em,xshift=2em]o.east){$T$};
-		\node[anchor=west,inner sep=1pt,font=\footnotesize] (c2) at ([xshift=5em]c1.east){F};
+		\node[anchor=west,inner sep=1pt,font=\footnotesize] (c2) at ([xshift=5em]c1.east){$F$};
-		\node[anchor=west,inner sep=1pt,font=\footnotesize] (c3) at ([xshift=5em]c2.east){F};
+		\node[anchor=west,inner sep=1pt,font=\footnotesize] (c3) at ([xshift=5em]c2.east){$F$};
-		\node[anchor=west,inner sep=1pt,font=\footnotesize] (c4) at ([xshift=5em]c3.east){T};
+		\node[anchor=west,inner sep=1pt,font=\footnotesize] (c4) at ([xshift=5em]c3.east){$T$};
 		\node[anchor=south,font=\scriptsize] (cl1) at (c1.north) {时刻1};
 		\node[anchor=south,font=\scriptsize] (cl2) at (c2.north) {时刻2};
 		\node[anchor=south,font=\scriptsize] (cl3) at (c3.north) {时刻3};

--- a/Chapter3/Figures/figure-examples-of-chinese-word-segmentation-based-on-1-gram-model.tex
+++ b/Chapter3/Figures/figure-examples-of-chinese-word-segmentation-based-on-1-gram-model.tex
@@ -19,7 +19,7 @@
 \end{pgfonlayer}
 }
-\node [anchor=west,ugreen] (P) at ([xshift=5.2em,yshift=-0.8em]corpus.east){\large{\funp{P}($\cdot$)}};
+\node [anchor=west,ugreen] (P) at ([xshift=5.2em,yshift=-0.8em]corpus.east){\large{$\funp{P}(\cdot)$}};
 \node [anchor=south] (modellabel) at (P.north) {{\color{ublue} {\scriptsize \textbf{统计模型}}}};
 \begin{pgfonlayer}{background}
@@ -41,9 +41,9 @@
 {\footnotesize
 {
-\node [anchor=west] (label1) at (0,6em) {实际上，通过学习我们得到了一个分词模型\funp{P}($\cdot$)，给定任意的分词结果};
+\node [anchor=west] (label1) at (0,6em) {实际上，通过学习我们得到了一个分词模型$\funp{P}(\cdot)$，给定任意的分词结果};
-\node [anchor=north west] (label1part2) at ([yshift=0.5em]label1.south west) {$W=w_1 w_2...w_n$，都能通过\funp{P}($W$)=$\funp{P}(w_1) \cdot \funp{P}(w_2) \cdot ... \cdot \funp{P}(w_n)$ 计算这种分词的\hspace{0.13em} };
+\node [anchor=north west] (label1part2) at ([yshift=0.5em]label1.south west) {$W=w_1 w_2...w_n$，都能通过$\funp{P}(W)=\funp{P}(w_1) \cdot \funp{P}(w_2) \cdot ... \cdot \funp{P}(w_n)$ 计算这种分\hspace{0.13em} };
-\node [anchor=north west] (label1part3) at ([yshift=0.5em]label1part2.south west) {概率值};
+\node [anchor=north west] (label1part3) at ([yshift=0.5em]label1part2.south west) {词的概率值};
 }
 \begin{pgfonlayer}{background}
@@ -96,13 +96,13 @@
 \node [anchor=north west,minimum height=1.6em] (data33) at ([yshift=0.3em]data23.south west) {};
 {
-\node [anchor=north west] (data41) at (data31.south west) {确实/现在/数据/很多};
+\node [anchor=north west] (data41) at (data31.south west) {确实/现在/数据/很/多};
 }
 {
 \node [anchor=north west] (data42) at (data32.south west) {$\funp{P}(\textrm{确实}) \cdot \funp{P}(\textrm{现在}) \cdot \funp{P}(\textrm{数据}) \cdot $};
 }
 {
-\node [anchor=north west] (data43) at ([yshift=-0.2em,xshift=-2em]data33.south west) {\color{red}{\textbf{输出}}};
+\node [anchor=north west] (data43) at ([yshift=-0.4em,xshift=-1.4em]data33.south west) {\color{red}{\textbf{输出}}};
 \draw [->,red,thick] (data43.west)--([xshift=-1em]data43.west);
 }
 {

--- a/Chapter3/Figures/figure-mt-system-as-a-black-box.tex
+++ b/Chapter3/Figures/figure-mt-system-as-a-black-box.tex
@@ -35,7 +35,7 @@
 }
 {
-\node[rectangle,fill=ublue,inner sep=2pt] [fit = (mtinputlabel) (mtoutputlabel) (inputmarking) (outputmarking)] {{\color{white} \textbf{\Large{MT 系统}}}};
+\node[rectangle,fill=ublue,inner sep=2pt] [fit = (mtinputlabel) (mtoutputlabel) (inputmarking) (outputmarking)] {{\color{white} \textbf{\Large{机器翻译系统}}}};
 }

--- a/Chapter3/Figures/figure-mt=language-analysis+translation-engine.tex
+++ b/Chapter3/Figures/figure-mt=language-analysis+translation-engine.tex
@@ -25,7 +25,7 @@
 }
 \end{scope}
-\node [anchor=west,draw,thick,inner sep=3pt,ublue] (mtengine) at ([xshift=1.05in]input.east) {{\scriptsize MT系统}};
+\node [anchor=west,draw,thick,inner sep=3pt,ublue] (mtengine) at ([xshift=1.0in]input.east) {{\scriptsize 机器翻译系统}};
 \begin{scope}[scale=0.8,xshift=3.0in,yshift=-0.87in,level distance=20pt,sibling distance=-3pt,grow'=up]
 {\scriptsize
@@ -49,8 +49,8 @@
 \draw[->,thick] ([xshift=-6pt]output.west) -- ([xshift=2pt]output.west);
 {
-\draw[->,thick] ([xshift=-12pt]mtengine.west) -- ([xshift=-2pt]mtengine.west);
+\draw[->,thick] ([xshift=-10pt]mtengine.west) -- ([xshift=-2pt]mtengine.west);
-\draw[->,thick] ([xshift=2pt]mtengine.east) -- ([xshift=12pt]mtengine.east);
+\draw[->,thick] ([xshift=2pt]mtengine.east) -- ([xshift=10pt]mtengine.east);
 }
 {

--- a/Chapter3/Figures/figure-perspectives-of-expert-ordinary-and-syntactic-parser.tex
+++ b/Chapter3/Figures/figure-perspectives-of-expert-ordinary-and-syntactic-parser.tex
@@ -72,9 +72,9 @@
 \\ 
-语言学家: & 不对 & 对 & 不对  \\ 
+语言学家： & 不对 & 对 & 不对  \\ 
-我们: & 似乎对了 & 比较肯定 & 不太可能 \\ 
+我们： & 似乎对了 & 比较肯定 & 不太可能 \\ 
-分析器: & $\textrm{P}=0.2$ & $\textrm{P}=0.6$ & $\textrm{P}=0.1$
+分析器： & $\funp{P}=0.2$ & $\funp{P}=0.6$ & $\funp{P}=0.1$
 \end{tabular}
 %---------------------------------------------------------------------

--- a/Chapter3/Figures/figure-probability-values-corresponding-to-different-derivations.tex
+++ b/Chapter3/Figures/figure-probability-values-corresponding-to-different-derivations.tex
@@ -76,11 +76,11 @@
 \node [] (d2) at (0em,-10em) {$d_2$};
 \node [] (d3) at (8.5em,-10em) {$d_2$};
-\node [anchor=east] (d1p) at ([xshift=0.4em]d1.west) {$\textrm{P}($};
+\node [anchor=east] (d1p) at ([xshift=0.4em]d1.west) {$\funp{P}($};
 \node [anchor=west] (d1p2) at ([xshift=-0.4em]d1.east) {$)=0.0123$};
-\node [anchor=east] (d2p) at ([xshift=0.4em]d2.west) {$\textrm{P}($};
+\node [anchor=east] (d2p) at ([xshift=0.4em]d2.west) {$\funp{P}($};
 \node [anchor=west] (d2p2) at ([xshift=-0.4em]d2.east) {$)=0.4031$};
-\node [anchor=east] (d3p) at ([xshift=0.4em]d3.west) {$\textrm{P}($};
+\node [anchor=east] (d3p) at ([xshift=0.4em]d3.west) {$\funp{P}($};
 \node [anchor=west] (d3p2) at ([xshift=-0.4em]d3.east) {$)=0.0056$};
 \end{tikzpicture}

--- a/Chapter3/Figures/figure-process-of-statistical-syntax-analysis.tex
+++ b/Chapter3/Figures/figure-process-of-statistical-syntax-analysis.tex
@@ -46,7 +46,7 @@
 \end{pgfonlayer}
 }
-\node [anchor=west,ugreen] (P) at ([xshift=5.95em,yshift=-0.8em]corpus.east){\large{P($\cdot$)}};
+\node [anchor=west,ugreen] (P) at ([xshift=5.95em,yshift=-0.8em]corpus.east){\large{$P(\cdot)$}};
 \node [anchor=south] (modellabel) at (P.north) {{\color{ublue} {\scriptsize \textbf{统计分析模型}}}};
 \begin{pgfonlayer}{background}

--- a/Chapter3/Figures/figure-transition-prob-and-launch-prob-in-coin-toss-game.tex
+++ b/Chapter3/Figures/figure-transition-prob-and-launch-prob-in-coin-toss-game.tex
@@ -2,13 +2,13 @@
 	\begin{scope}
 	\node[minimum width=3em,minimum height=1.5em] (o) at (0,0){};
-	\node[anchor=west,inner sep=0pt] (ca) at ([yshift=0.2em,xshift=1.4em]o.east){\scriptsize\bfnew{硬币A}};
+	\node[anchor=west,inner sep=0pt] (ca) at ([yshift=0.2em,xshift=1.4em]o.east){\scriptsize\bfnew{硬币$\boldsymbol A$}};
-	\node[anchor=west,inner sep=0pt] (cb) at ([xshift=1.4em]ca.east){\scriptsize\bfnew{硬币B}};
+	\node[anchor=west,inner sep=0pt] (cb) at ([xshift=1.4em]ca.east){\scriptsize\bfnew{硬币$\boldsymbol B$}};
-	\node[anchor=west,inner sep=0pt] (cc) at ([xshift=1.4em]cb.east){\scriptsize\bfnew{硬币C}};
+	\node[anchor=west,inner sep=0pt] (cc) at ([xshift=1.4em]cb.east){\scriptsize\bfnew{硬币$\boldsymbol C$}};
-	\node[anchor=north,inner sep=0pt] (ra) at ([yshift=-0.6em,xshift=-0.4em]o.south){\scriptsize\bfnew{硬币A}};
+	\node[anchor=north,inner sep=0pt] (ra) at ([yshift=-0.6em,xshift=-0.4em]o.south){\scriptsize\bfnew{硬币$\boldsymbol A$}};
-	\node[anchor=north,inner sep=0pt] (rb) at ([yshift=-1.4em]ra.south){\scriptsize\bfnew{硬币B}};
+	\node[anchor=north,inner sep=0pt] (rb) at ([yshift=-1.4em]ra.south){\scriptsize\bfnew{硬币$\boldsymbol B$}};
-	\node[anchor=north,inner sep=0pt] (rc) at ([yshift=-1.4em]rb.south){\scriptsize\bfnew{硬币C}};
+	\node[anchor=north,inner sep=0pt] (rc) at ([yshift=-1.4em]rb.south){\scriptsize\bfnew{硬币$\boldsymbol C$}};
 	\node[anchor=north,inner sep=0pt] (n11) at ([yshift=-0.9em]ca.south){\small{$\frac{1}{3}$}};
 	\node[anchor=north,inner sep=0pt] (n21) at ([yshift=-1em]n11.south){\small{$\frac{1}{3}$}};
@@ -38,9 +38,9 @@
 	\node[anchor=west,inner sep=0pt] (ca) at ([yshift=0.2em,xshift=1.4em]o.east){\scriptsize\bfnew{正面}};
 	\node[anchor=west,inner sep=0pt] (cb) at ([xshift=1.4em]ca.east){\scriptsize\bfnew{反面}};
-	\node[anchor=north,inner sep=0pt] (ra) at ([yshift=-0.6em,xshift=-0.4em]o.south){\scriptsize\bfnew{硬币A}};
+	\node[anchor=north,inner sep=0pt] (ra) at ([yshift=-0.6em,xshift=-0.4em]o.south){\scriptsize\bfnew{硬币$\boldsymbol A$}};
-	\node[anchor=north,inner sep=0pt] (rb) at ([yshift=-1.5em]ra.south){\scriptsize\bfnew{硬币B}};
+	\node[anchor=north,inner sep=0pt] (rb) at ([yshift=-1.5em]ra.south){\scriptsize\bfnew{硬币$\boldsymbol B$}};
-	\node[anchor=north,inner sep=0pt] (rc) at ([yshift=-1.5em]rb.south){\scriptsize\bfnew{硬币C}};
+	\node[anchor=north,inner sep=0pt] (rc) at ([yshift=-1.5em]rb.south){\scriptsize\bfnew{硬币$\boldsymbol C$}};
 	\node[anchor=north,inner sep=0pt] (n11) at ([yshift=-1.2em]ca.south){\footnotesize{$0.3$}};
 	\node[anchor=north,inner sep=0pt] (n21) at ([yshift=-1.7em]n11.south){\footnotesize{$0.5$}};
@@ -52,11 +52,11 @@
 	\draw[thick] (o.north west) -- (o.south east);
 	\node[anchor=south west] at ([yshift=-1em,xshift=-1.4em]o.45){\tiny{可见}};
-	\node[anchor=north east] at ([yshift=1em,xshift=1em]o.-135){\tiny{隐藏}};
+	\node[anchor=north east] at ([yshift=1em,xshift=1em]o.-135){\tiny{隐含}};
 	\begin{pgfonlayer}{background}
        	\node [rectangle,inner sep=0.5em,rounded corners=2pt,fill=red!10] [fit = (o)(n32)(rc)(cb) ] (box1) {};
    	\end{pgfonlayer}
-   \node[anchor=south] at (box1.north){\scriptsize{发射概率$\funp{P}$(可见状态|隐藏状态)}};
+   \node[anchor=south] at (box1.north){\scriptsize{发射概率$\funp{P}$(可见状态|隐含状态)}};
 	\end{scope}
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter3/Figures/figure-word-segmentation-based-on-statistics.tex
+++ b/Chapter3/Figures/figure-word-segmentation-based-on-statistics.tex
@@ -29,7 +29,7 @@
 }
 {
-\node [anchor=west,ugreen] (P) at ([xshift=5.2em,yshift=-0.8em]corpus.east){\large{\funp{P}($\cdot$)}};
+\node [anchor=west,ugreen] (P) at ([xshift=5.2em,yshift=-0.8em]corpus.east){{$\funp{P}(\cdot)$}};
 \node [anchor=south] (modellabel) at (P.north) {{\color{ublue} {\scriptsize \textbf{统计模型}}}};
 }
@@ -59,16 +59,16 @@
 }
 {
 \node [anchor=north west] (seg4) at ([xshift=-1.0em,yshift=0.4em]seg3.south west) {...};
-\node [anchor=east,ugreen] (p1seg1) at ([xshift=0.5em]seg1.west) {P(};
+\node [anchor=east,ugreen] (p1seg1) at ([xshift=0.5em]seg1.west) {$\funp{P}($};
 \node [anchor=west,ugreen] (p2seg1) at ([xshift=-0.5em]seg1.east) {)=0.1};
-\node [anchor=east,ugreen] (p1seg2) at ([xshift=0.5em]seg2.west) {P(};
+\node [anchor=east,ugreen] (p1seg2) at ([xshift=0.5em]seg2.west) {$\funp{P}($};
 \node [anchor=west,ugreen] (p2seg2) at ([xshift=-0.5em]seg2.east) {)=0.6};
-\node [anchor=east,ugreen] (p1seg3) at ([xshift=0.5em]seg3.west) {P(};
+\node [anchor=east,ugreen] (p1seg3) at ([xshift=0.5em]seg3.west) {$\funp{P}($};
 \node [anchor=west,ugreen] (p2seg3) at ([xshift=-0.5em]seg3.east) {)=0.2};
 }
 {
-\node [anchor=east,draw,dashed,red,thick,minimum width=13em,minimum height=1.4em] (final) at (p2seg2.east) {};
+\node [anchor=east,draw,dashed,red,thick,minimum width=13.2em,minimum height=1.4em] (final) at (p2seg2.east) {};
 \node [anchor=west,red] (finallabel) at ([xshift=3.1em]sentlabel.east) {输出概率最大的结果};
 %\node [anchor=north east,red] (finallabel2) at ([yshift=0.5em]finallabel.south east) {的结果};
 \draw [->,thick,red] ([xshift=0.0em,yshift=-0.5em]final.north east) ..controls +(east:0.2) and +(south:1.0).. ([xshift=2.0em]finallabel.south);

--- a/Chapter3/chapter3.tex
+++ b/Chapter3/chapter3.tex
--- a/Chapter4/Figures/The process of statistical hypothesis testing.tex
+++ b/Chapter4/Figures/The process of statistical hypothesis testing.tex
+\usetikzlibrary{shapes.geometric}
+\begin{tikzpicture}
+	\node[font=\footnotesize] (overall) at (0,0){\small\bfnew{总 \ \ \ \ 体}};
+	\node[anchor=north,font=\footnotesize] (hypo) at ([yshift=-3em]overall.south){某种假设};
+	\coordinate (A) at ([yshift=-3.5em,xshift=-1.6em]overall);
+	\coordinate (B) at ([yshift=0.3em,xshift=0.3em]A);
+	\draw[] (B)  .. controls +(east:1.3em) and +(west:0.3em) .. ([xshift=1.5em,yshift=2em]B) .. controls +(east:0.3em) and +(west:1.3em) .. ([xshift=3em]B);
+	\draw[<->] ([yshift=2.4em]A) -- (A) -- ([xshift=3.7em]A);
+	\begin{pgfonlayer}{background}
+        	\node [draw,thick,rectangle,inner sep=0.5em,rounded corners=2pt,fill=red!15,drop shadow] [fit = (overall)(hypo)] (box1) {};
+    \end{pgfonlayer}
+	\node[draw,fill=yellow!15,thick,anchor=west,font=\footnotesize,align=center,drop shadow](sample) at ([xshift=4em]box1.east){样本\\观察结果};
+	\node[anchor=west,draw,diamond,fill=ugreen!15,drop shadow,aspect=2,font=\scriptsize,align=center,inner sep=1pt,thick] (judge) at ([xshift=3em]sample.east){小概率事件\\发生？};
+	\node[draw,fill=blue!10,thick,drop shadow,anchor=west,font=\footnotesize,align=center,thick](refuse) at ([xshift=6em]judge.north){拒绝原假设};
+	\node[draw,fill=blue!10,thick,drop shadow,anchor=west,font=\footnotesize,align=center,thick](accept) at ([xshift=6em]judge.south){接受原假设};
+	\draw[->,thick] (box1.east) -- node[above,font=\scriptsize]{抽样}(sample.west);
+	\draw[->,thick] (sample.east) -- node[above,font=\scriptsize]{检验}(judge.west);
+	\draw[->,thick] (judge.north) -- node[above,font=\scriptsize]{是}(refuse.west);
+	\draw[->,thick] (judge.south) -- node[below,font=\scriptsize]{否}(accept.west);
+\end{tikzpicture}
--- a/Chapter4/Figures/absolute-match-word-alignment-1.tex
+++ b/Chapter4/Figures/absolute-match-word-alignment-1.tex
 \begin{tikzpicture}[scale=0.5]
-	\tikzstyle{cand} = [draw,line width=1pt,align=center,minimum width=2.6em,minimum height=1.6em,drop shadow={shadow xshift=0.15em},fill=green!30]
+	\tikzstyle{cand} = [draw,line width=1pt,align=center,minimum width=2.6em,minimum height=1.6em,drop shadow={shadow xshift=0.15em},fill=green!15]
-	\tikzstyle{ref} = [draw,line width=1pt,align=center,minimum width=2.6em,minimum height=1.6em,drop shadow={shadow xshift=0.15em},fill=red!30]
+	\tikzstyle{ref} = [draw,line width=1pt,align=center,minimum width=2.6em,minimum height=1.6em,drop shadow={shadow xshift=0.15em},fill=red!15]
-		\node[align=center,minimum width=2.4em,minimum height=1.6em,minimum width=6em] (n11) at (0,0){\small\bfnew{Candidate :}};
+		\node[align=center,minimum width=2.4em,minimum height=1.6em,minimum width=6em] (n11) at (0,0){\small{机器译文：}};
 		\node[cand,anchor=west] (n12) at ([xshift=0.0em]n11.east){Can};
 		\node[cand,anchor=west] (n13) at ([xshift=1em]n12.east){I};
 		\node[cand,anchor=west] (n14) at ([xshift=1em]n13.east){have};
-		\node[cand,anchor=west] (n15) at ([xshift=1em]n14.east){this};
+		\node[cand,anchor=west] (n15) at ([xshift=1em]n14.east){it};
 		\node[cand,anchor=west] (n16) at ([xshift=1em]n15.east){like};
 		\node[cand,anchor=west] (n17) at ([xshift=1em]n16.east){he};
-		\node[cand,anchor=west] (n18) at ([xshift=1em]n17.east){do};
+		\node[cand,anchor=west] (n18) at ([xshift=1em]n17.east){?};
-		\node[cand,anchor=west] (n19) at ([xshift=1em]n18.east){it};
-		\node[cand,anchor=west] (n20) at ([xshift=1em]n19.east){?};
-		\node[align=center,minimum width=2.4em,minimum height=1.6em,anchor=north,minimum width=6em] (n21) at ([yshift=-6em]n11.south){\small\bfnew{Reference :}};
+		\node[align=center,minimum width=2.4em,minimum height=1.6em,anchor=north,minimum width=6em] (n21) at ([yshift=-6em]n11.south){\small{参考答案：}};
 		\node[ref,anchor=west] (n22) at ([xshift=0.0em]n21.east){Can};
 		\node[ref,anchor=west] (n23) at ([xshift=1em]n22.east){I};
 		\node[ref,anchor=west] (n24) at ([xshift=1em]n23.east){eat};
 		\node[ref,anchor=west] (n25) at ([xshift=1em]n24.east){this};
 		\node[ref,anchor=west] (n26) at ([xshift=1em]n25.east){Can};
 		\node[ref,anchor=west] (n27) at ([xshift=1em]n26.east){like};
-		\node[ref,anchor=west] (n28) at ([xshift=1em]n27.east){he};
+		\node[ref,anchor=west] (n28) at ([xshift=1em]n27.east){him};
-		\node[ref,anchor=west] (n29) at ([xshift=1em]n28.east){did};
+		\node[ref,anchor=west] (n29) at ([xshift=1em]n28.east){?};
-		\node[ref,anchor=west] (n60) at ([xshift=1em]n29.east){?};
 		\draw[line width=1.6pt,blue!40] (n12.south) -- (n22.north);
 		\draw[line width=1.6pt,blue!40] (n13.south) -- (n23.north);
-		\draw[line width=1.6pt,blue!40] (n15.south) -- (n25.north);
 		\draw[line width=1.6pt,blue!40] (n16.south) -- (n27.north);
-		\draw[line width=1.6pt,blue!40] (n17.south) -- (n28.north);
+		\draw[line width=1.6pt,blue!40] (n18.south) -- (n29.north);
-		\draw[line width=1.6pt,blue!40] (n20.south) -- (n60.north);
 \end{tikzpicture}
--- a/Chapter4/Figures/absolute-match-word-alignment-2.tex
+++ b/Chapter4/Figures/absolute-match-word-alignment-2.tex
 \begin{tikzpicture}[scale=0.5]
-	\tikzstyle{cand} = [draw,line width=1pt,align=center,minimum width=2.6em,minimum height=1.6em,drop shadow={shadow xshift=0.15em},fill=green!30]
+	\tikzstyle{cand} = [draw,line width=1pt,align=center,minimum width=2.6em,minimum height=1.6em,drop shadow={shadow xshift=0.15em},fill=green!15]
-	\tikzstyle{ref} = [draw,line width=1pt,align=center,minimum width=2.6em,minimum height=1.6em,drop shadow={shadow xshift=0.15em},fill=red!30]
+	\tikzstyle{ref} = [draw,line width=1pt,align=center,minimum width=2.6em,minimum height=1.6em,drop shadow={shadow xshift=0.15em},fill=red!15]
-	\node[align=center,minimum width=2.4em,minimum height=1.6em,minimum width=6em] (n11) at (0,0){\small\bfnew{Candidate :}};
+		\node[align=center,minimum width=2.4em,minimum height=1.6em,minimum width=6em] (n11) at (0,0){\small{机器译文：}};
-	\node[cand,anchor=west] (n12) at ([xshift=0.0em]n11.east){Can};
+		\node[cand,anchor=west] (n12) at ([xshift=0.0em]n11.east){Can};
-	\node[cand,anchor=west] (n13) at ([xshift=1em]n12.east){I};
+		\node[cand,anchor=west] (n13) at ([xshift=1em]n12.east){I};
-	\node[cand,anchor=west] (n14) at ([xshift=1em]n13.east){have};
+		\node[cand,anchor=west] (n14) at ([xshift=1em]n13.east){have};
-	\node[cand,anchor=west] (n15) at ([xshift=1em]n14.east){this};
+		\node[cand,anchor=west] (n15) at ([xshift=1em]n14.east){it};
-	\node[cand,anchor=west] (n16) at ([xshift=1em]n15.east){like};
+		\node[cand,anchor=west] (n16) at ([xshift=1em]n15.east){like};
-	\node[cand,anchor=west] (n17) at ([xshift=1em]n16.east){he};
+		\node[cand,anchor=west] (n17) at ([xshift=1em]n16.east){he};
-	\node[cand,anchor=west] (n18) at ([xshift=1em]n17.east){do};
+		\node[cand,anchor=west] (n18) at ([xshift=1em]n17.east){?};
-	\node[cand,anchor=west] (n19) at ([xshift=1em]n18.east){it};
-	\node[cand,anchor=west] (n20) at ([xshift=1em]n19.east){?};
-	\node[align=center,minimum width=2.4em,minimum height=1.6em,anchor=north,minimum width=6em] (n21) at ([yshift=-6em]n11.south){\small\bfnew{Reference :}};
+		\node[align=center,minimum width=2.4em,minimum height=1.6em,anchor=north,minimum width=6em] (n21) at ([yshift=-6em]n11.south){\small{参考答案：}};
-	\node[ref,anchor=west] (n22) at ([xshift=0.0em]n21.east){Can};
+		\node[ref,anchor=west] (n22) at ([xshift=0.0em]n21.east){Can};
-	\node[ref,anchor=west] (n23) at ([xshift=1em]n22.east){I};
+		\node[ref,anchor=west] (n23) at ([xshift=1em]n22.east){I};
-	\node[ref,anchor=west] (n24) at ([xshift=1em]n23.east){eat};
+		\node[ref,anchor=west] (n24) at ([xshift=1em]n23.east){eat};
-	\node[ref,anchor=west] (n25) at ([xshift=1em]n24.east){this};
+		\node[ref,anchor=west] (n25) at ([xshift=1em]n24.east){this};
-	\node[ref,anchor=west] (n26) at ([xshift=1em]n25.east){Can};
+		\node[ref,anchor=west] (n26) at ([xshift=1em]n25.east){Can};
-	\node[ref,anchor=west] (n27) at ([xshift=1em]n26.east){like};
+		\node[ref,anchor=west] (n27) at ([xshift=1em]n26.east){like};
-	\node[ref,anchor=west] (n28) at ([xshift=1em]n27.east){he};
+		\node[ref,anchor=west] (n28) at ([xshift=1em]n27.east){him};
-	\node[ref,anchor=west] (n29) at ([xshift=1em]n28.east){did};
+		\node[ref,anchor=west] (n29) at ([xshift=1em]n28.east){?};
-	\node[ref,anchor=west] (n30) at ([xshift=1em]n29.east){?};
-	\draw[line width=1.6pt,blue!40] (n12.south) -- (n26.north);
+		\draw[line width=1.6pt,blue!40] (n12.south) -- (n26.north);
-	\draw[line width=1.6pt,blue!40] (n13.south) -- (n23.north);
+		\draw[line width=1.6pt,blue!40] (n13.south) -- (n23.north);
-	\draw[line width=1.6pt,blue!40] (n15.south) -- (n25.north);
+		\draw[line width=1.6pt,blue!40] (n16.south) -- (n27.north);
-	\draw[line width=1.6pt,blue!40] (n16.south) -- (n27.north);
+		\draw[line width=1.6pt,blue!40] (n18.south) -- (n29.north);
-	\draw[line width=1.6pt,blue!40] (n17.south) -- (n28.north);
-	\draw[line width=1.6pt,blue!40] (n20.south) -- (n30.north);
 \end{tikzpicture}
--- a/Chapter4/Figures/determine-final-word-alignment.tex
+++ b/Chapter4/Figures/determine-final-word-alignment.tex
 \definecolor{ugreen}{rgb}{0,0.5,0}
 \begin{tikzpicture}[scale=0.5]
-	\tikzstyle{cand} = [draw,line width=1pt,align=center,minimum width=2.6em,minimum height=1.6em,drop shadow={shadow xshift=0.15em},fill=green!30]
+	\tikzstyle{cand} = [draw,line width=1pt,align=center,minimum width=2.6em,minimum height=1.6em,drop shadow={shadow xshift=0.15em},fill=green!15]
-	\tikzstyle{ref} = [draw,line width=1pt,align=center,minimum width=2.6em,minimum height=1.6em,drop shadow={shadow xshift=0.15em},fill=red!30]
+	\tikzstyle{ref} = [draw,line width=1pt,align=center,minimum width=2.6em,minimum height=1.6em,drop shadow={shadow xshift=0.15em},fill=red!15]
-		\node[align=center,minimum width=2.4em,minimum height=1.6em,minimum width=6em] (n11) at (0,0){\small\bfnew{Candidate :}};
+		\node[align=center,minimum width=2.4em,minimum height=1.6em,minimum width=6em] (n11) at (0,0){\small{机器译文：}};
 		\node[cand,anchor=west] (n12) at ([xshift=0.0em]n11.east){Can};
 		\node[cand,anchor=west] (n13) at ([xshift=1em]n12.east){I};
 		\node[cand,anchor=west] (n14) at ([xshift=1em]n13.east){have};
-		\node[cand,anchor=west] (n15) at ([xshift=1em]n14.east){this};
+		\node[cand,anchor=west] (n15) at ([xshift=1em]n14.east){it};
 		\node[cand,anchor=west] (n16) at ([xshift=1em]n15.east){like};
 		\node[cand,anchor=west] (n17) at ([xshift=1em]n16.east){he};
-		\node[cand,anchor=west] (n18) at ([xshift=1em]n17.east){do};
+		\node[cand,anchor=west] (n18) at ([xshift=1em]n17.east){?};
-		\node[cand,anchor=west] (n19) at ([xshift=1em]n18.east){it};
-		\node[cand,anchor=west] (n20) at ([xshift=1em]n19.east){?};
+		\node[align=center,minimum width=2.4em,minimum height=1.6em,anchor=north,minimum width=6em] (n21) at ([yshift=-6em]n11.south){\small{参考答案：}};
-		\node[align=center,minimum width=2.4em,minimum height=1.6em,anchor=north,minimum width=6em] (n21) at ([yshift=-6em]n11.south){\small\bfnew{Reference :}};
 		\node[ref,anchor=west] (n22) at ([xshift=0.0em]n21.east){Can};
 		\node[ref,anchor=west] (n23) at ([xshift=1em]n22.east){I};
 		\node[ref,anchor=west] (n24) at ([xshift=1em]n23.east){eat};
 		\node[ref,anchor=west] (n25) at ([xshift=1em]n24.east){this};
 		\node[ref,anchor=west] (n26) at ([xshift=1em]n25.east){Can};
 		\node[ref,anchor=west] (n27) at ([xshift=1em]n26.east){like};
-		\node[ref,anchor=west] (n28) at ([xshift=1em]n27.east){he};
+		\node[ref,anchor=west] (n28) at ([xshift=1em]n27.east){him};
-		\node[ref,anchor=west] (n29) at ([xshift=1em]n28.east){did};
+		\node[ref,anchor=west] (n29) at ([xshift=1em]n28.east){?};
-		\node[ref,anchor=west] (n30) at ([xshift=1em]n29.east){?};
 		\draw[line width=1.6pt,blue!40] (n12.south) -- (n22.north);
 		\draw[line width=1.6pt,blue!40] (n13.south) -- (n23.north);
-		\draw[line width=1.6pt,blue!40] (n15.south) -- (n25.north);
 		\draw[line width=1.6pt,blue!40] (n16.south) -- (n27.north);
-		\draw[line width=1.6pt,blue!40] (n17.south) -- (n28.north);
+		\draw[line width=1.6pt,blue!40] (n18.south) -- (n29.north);
-		\draw[line width=1.6pt,blue!40] (n20.south) -- (n30.north);
+		\draw[line width=1.6pt,orange!40] (n17.south) -- (n28.north);
-		\draw[line width=1.6pt,orange!40] (n18.south) -- (n29.north);
 		\draw[line width=1.6pt,ugreen!40](n14.south) -- (n24.north);

--- a/Chapter4/Figures/match-words-with-stem.tex
+++ b/Chapter4/Figures/match-words-with-stem.tex
 \begin{tikzpicture}[scale=0.5]
-	\tikzstyle{cand} = [draw,line width=1pt,align=center,minimum width=2.6em,minimum height=1.6em,drop shadow={shadow xshift=0.15em},fill=green!30]
+	\tikzstyle{cand} = [draw,line width=1pt,align=center,minimum width=2.6em,minimum height=1.6em,drop shadow={shadow xshift=0.15em},fill=green!15]
-	\tikzstyle{ref} = [draw,line width=1pt,align=center,minimum width=2.6em,minimum height=1.6em,drop shadow={shadow xshift=0.15em},fill=red!30]
+	\tikzstyle{ref} = [draw,line width=1pt,align=center,minimum width=2.6em,minimum height=1.6em,drop shadow={shadow xshift=0.15em},fill=red!15]
-		\node[align=center,minimum width=2.4em,minimum height=1.6em,minimum width=6em] (n11) at (0,0){\small\bfnew{Candidate :}};
+		\node[align=center,minimum width=2.4em,minimum height=1.6em,minimum width=6em] (n11) at (0,0){\small{机器译文：}};
 		\node[cand,anchor=west] (n12) at ([xshift=0.0em]n11.east){Can};
 		\node[cand,anchor=west] (n13) at ([xshift=1em]n12.east){I};
 		\node[cand,anchor=west] (n14) at ([xshift=1em]n13.east){have};
-		\node[cand,anchor=west] (n15) at ([xshift=1em]n14.east){this};
+		\node[cand,anchor=west] (n15) at ([xshift=1em]n14.east){it};
 		\node[cand,anchor=west] (n16) at ([xshift=1em]n15.east){like};
 		\node[cand,anchor=west] (n17) at ([xshift=1em]n16.east){he};
-		\node[cand,anchor=west] (n18) at ([xshift=1em]n17.east){do};
+		\node[cand,anchor=west] (n18) at ([xshift=1em]n17.east){?};
-		\node[cand,anchor=west] (n19) at ([xshift=1em]n18.east){it};
-		\node[cand,anchor=west] (n20) at ([xshift=1em]n19.east){?};
-		\node[align=center,minimum width=2.4em,minimum height=1.6em,anchor=north,minimum width=6em] (n21) at ([yshift=-6em]n11.south){\small\bfnew{Reference :}};
+		\node[align=center,minimum width=2.4em,minimum height=1.6em,anchor=north,minimum width=6em] (n21) at ([yshift=-6em]n11.south){\small{参考答案：}};
 		\node[ref,anchor=west] (n22) at ([xshift=0.0em]n21.east){Can};
 		\node[ref,anchor=west] (n23) at ([xshift=1em]n22.east){I};
 		\node[ref,anchor=west] (n24) at ([xshift=1em]n23.east){eat};
 		\node[ref,anchor=west] (n25) at ([xshift=1em]n24.east){this};
 		\node[ref,anchor=west] (n26) at ([xshift=1em]n25.east){Can};
 		\node[ref,anchor=west] (n27) at ([xshift=1em]n26.east){like};
-		\node[ref,anchor=west] (n28) at ([xshift=1em]n27.east){he};
+		\node[ref,anchor=west] (n28) at ([xshift=1em]n27.east){him};
-		\node[ref,anchor=west] (n29) at ([xshift=1em]n28.east){did};
+		\node[ref,anchor=west] (n29) at ([xshift=1em]n28.east){?};
-		\node[ref,anchor=west] (n30) at ([xshift=1em]n29.east){?};
-		\node[align=center,minimum width=2.4em,minimum height=1.6em,minimum width=6em] (n31) at ([yshift=-5em]n21.south){\small\bfnew{Candidate :}};
+		\node[align=center,minimum width=2.4em,minimum height=1.6em,minimum width=6em] (n31) at ([yshift=-5em]n21.south){\small{机器译文：}};
 		\node[cand,anchor=west] (n32) at ([xshift=0.0em]n31.east){Can};
 		\node[cand,anchor=west] (n33) at ([xshift=1em]n32.east){I};
 		\node[cand,anchor=west] (n34) at ([xshift=1em]n33.east){have};
-		\node[cand,anchor=west] (n35) at ([xshift=1em]n34.east){this};
+		\node[cand,anchor=west] (n35) at ([xshift=1em]n34.east){it};
 		\node[cand,anchor=west] (n36) at ([xshift=1em]n35.east){like};
 		\node[cand,anchor=west] (n37) at ([xshift=1em]n36.east){he};
-		\node[cand,anchor=west] (n38) at ([xshift=1em]n37.east){do};
+		\node[cand,anchor=west] (n38) at ([xshift=1em]n37.east){?};
-		\node[cand,anchor=west] (n39) at ([xshift=1em]n38.east){it};
-		\node[cand,anchor=west] (n40) at ([xshift=1em]n39.east){?};
+		\node[align=center,minimum width=2.4em,minimum height=1.6em,anchor=north,minimum width=6em] (n41) at ([yshift=-6em]n31.south){\small{参考答案：}};
-		\node[align=center,minimum width=2.4em,minimum height=1.6em,anchor=north,minimum width=6em] (n41) at ([yshift=-6em]n31.south){\small\bfnew{Reference :}};
 		\node[ref,anchor=west] (n42) at ([xshift=0.0em]n41.east){Can};
 		\node[ref,anchor=west] (n43) at ([xshift=1em]n42.east){I};
 		\node[ref,anchor=west] (n44) at ([xshift=1em]n43.east){eat};
 		\node[ref,anchor=west] (n45) at ([xshift=1em]n44.east){this};
 		\node[ref,anchor=west] (n46) at ([xshift=1em]n45.east){Can};
 		\node[ref,anchor=west] (n47) at ([xshift=1em]n46.east){like};
-		\node[ref,anchor=west] (n48) at ([xshift=1em]n47.east){he};
+		\node[ref,anchor=west] (n48) at ([xshift=1em]n47.east){him};
-		\node[ref,anchor=west] (n49) at ([xshift=1em]n48.east){did};
+		\node[ref,anchor=west] (n49) at ([xshift=1em]n48.east){?};
-		\node[ref,anchor=west] (n50) at ([xshift=1em]n49.east){?};
 		\draw[line width=1.6pt,blue!40] (n12.south) -- (n22.north);
 		\draw[line width=1.6pt,blue!40] (n13.south) -- (n23.north);
-		\draw[line width=1.6pt,blue!40] (n15.south) -- (n25.north);
 		\draw[line width=1.6pt,blue!40] (n16.south) -- (n27.north);
-		\draw[line width=1.6pt,blue!40] (n17.south) -- (n28.north);
+		\draw[line width=1.6pt,blue!40] (n18.south) -- (n29.north);
-		\draw[line width=1.6pt,blue!40] (n20.south) -- (n30.north);
+		\draw[line width=2pt,orange!40] (n17.south) -- (n28.north);
-		\draw[line width=2pt,orange!40] (n18.south) -- (n29.north);
 		\draw[line width=1.6pt,blue!40] (n32.south) -- (n46.north);
 		\draw[line width=1.6pt,blue!40] (n33.south) -- (n43.north);
-		\draw[line width=1.6pt,blue!40] (n35.south) -- (n45.north);
 		\draw[line width=1.6pt,blue!40] (n36.south) -- (n47.north);
-		\draw[line width=1.6pt,blue!40] (n37.south) -- (n48.north);
+		\draw[line width=1.6pt,blue!40] (n38.south) -- (n49.north);
-		\draw[line width=1.6pt,blue!40] (n40.south) -- (n50.north);
+		\draw[line width=2pt,orange!40] (n37.south) -- (n48.north);
-		\draw[line width=2pt,orange!40] (n38.south) -- (n49.north);
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter4/Figures/schematic-diagram-of-word-level-quality-assessment-task.tex
+++ b/Chapter4/Figures/schematic-diagram-of-word-level-quality-assessment-task.tex
 \definecolor{ugreen}{rgb}{0,0.5,0}
 \begin{tikzpicture}[scale=0.6]
-	\tikzstyle{unit} = [draw,inner sep=3pt,font=\tiny,minimum height=1.2em]
+	\tikzstyle{unit} = [draw,inner sep=3pt,font=\tiny,minimum height=1.2em,drop shadow={shadow xshift=0.1em,shadow yshift=-0.15em}]
-	\tikzstyle{bad_tag} = [fill=red!15,inner sep=1pt,align=center,font=\tiny,text=red]
+	\tikzstyle{bad_tag} = [fill=red!15,inner sep=1pt,align=center,font=\tiny,text=red!80]
-	\tikzstyle{ok_tag} = [fill=ugreen!15,inner sep=1pt,align=center,font=\tiny,text=ugreen]
+	\tikzstyle{ok_tag} = [fill=ugreen!15,inner sep=1pt,align=center,font=\tiny,text=ugreen!80]
 	\coordinate (o) at (0, 0);
 	\node[anchor=west,inner sep=0pt,align=center,font=\scriptsize] (n1_1) at ([yshift=5.5em]o.east){\textbf{Source}};
-	\node[unit,anchor=west,fill=green!20](n1_2) at ([xshift=7.6em]n1_1.east){The};
+	\node[unit,anchor=west,fill=green!20](n1_2) at ([xshift=7.6em]n1_1.east){Draw};
-	\node[unit,anchor=west,fill=green!20](n1_3) at ([xshift=0.8em]n1_2.east){Sharpen};
+	\node[unit,anchor=west,fill=green!20](n1_3) at ([xshift=0.8em]n1_2.east){or};
-	\node[unit,anchor=west,fill=green!20](n1_4) at ([xshift=0.8em]n1_3.east){tool};
+	\node[unit,anchor=west,fill=green!20](n1_4) at ([xshift=0.8em]n1_3.east){select};
-	\node[unit,anchor=west,fill=green!20](n1_5) at ([xshift=0.8em]n1_4.east){sharpens};
+	\node[unit,anchor=west,fill=green!20](n1_5) at ([xshift=0.8em]n1_4.east){a};
-	\node[unit,anchor=west,fill=green!20](n1_6) at ([xshift=0.8em]n1_5.east){areas};
+	\node[unit,anchor=west,fill=green!20](n1_6) at ([xshift=0.8em]n1_5.east){line};
-	\node[unit,anchor=west,fill=green!20](n1_7) at ([xshift=0.8em]n1_6.east){in};
+	\node[unit,anchor=west,fill=green!20](n1_7) at ([xshift=0.8em]n1_6.east){.};
-	\node[unit,anchor=west,fill=green!20](n1_8) at ([xshift=0.8em]n1_7.east){an};
-	\node[unit,anchor=west,fill=green!20](n1_9) at ([xshift=0.8em]n1_8.east){image};
-	\node[unit,anchor=west,fill=green!20](n1_10) at ([xshift=0.8em]n1_9.east){.};
 	\node[anchor=west,inner sep=0pt,align=center,font=\scriptsize] (n2_1) at (o.east){\textbf{PE}};
-	\node[unit,anchor=west,fill=red!20](n2_2) at ([xshift=1em]n2_1.east){Mit};
+	\node[unit,anchor=west,fill=red!20](n2_2) at ([xshift=1em]n2_1.east){Zeichnen};
-	\node[unit,anchor=west,fill=red!20](n2_3) at ([xshift=0.8em]n2_2.east){dem};
+	\node[unit,anchor=west,fill=red!20](n2_3) at ([xshift=0.8em]n2_2.east){oder};
-	\node[unit,anchor=west,fill=red!20](n2_4) at ([xshift=0.8em]n2_3.east){Scharfzeichner};
+	\node[unit,anchor=west,fill=red!20](n2_4) at ([xshift=0.8em]n2_3.east){Sie};
-	\node[unit,anchor=west,fill=red!20](n2_5) at ([xshift=0.8em]n2_4.east){können};
+	\node[unit,anchor=west,fill=red!20](n2_5) at ([xshift=0.8em]n2_4.east){eine};
-	\node[unit,anchor=west,fill=red!20](n2_6) at ([xshift=0.8em]n2_5.east){Sie};
+	\node[unit,anchor=west,fill=red!20](n2_6) at ([xshift=0.8em]n2_5.east){linie};
-	\node[unit,anchor=west,fill=red!20](n2_7) at ([xshift=0.8em]n2_6.east){einzelne};
+	\node[unit,anchor=west,fill=red!20](n2_7) at ([xshift=0.8em]n2_6.east){,};
-	\node[unit,anchor=west,fill=red!20](n2_8) at ([xshift=0.8em]n2_7.east){Bereiche};
+	\node[unit,anchor=west,fill=red!20](n2_8) at ([xshift=0.8em]n2_7.east){order};
-	\node[unit,anchor=west,fill=red!20](n2_9) at ([xshift=0.8em]n2_8.east){in};
+	\node[unit,anchor=west,fill=red!20](n2_9) at ([xshift=0.8em]n2_8.east){wählen};
-	\node[unit,anchor=west,fill=red!20](n2_10) at ([xshift=0.8em]n2_9.east){einem};
+	\node[unit,anchor=west,fill=red!20](n2_10) at ([xshift=0.8em]n2_9.east){Sie};
-	\node[unit,anchor=west,fill=red!20](n2_11) at ([xshift=0.8em]n2_10.east){Bild};
+	\node[unit,anchor=west,fill=red!20](n2_11) at ([xshift=0.8em]n2_10.east){eine};
-	\node[unit,anchor=west,fill=red!20](n2_12) at ([xshift=0.8em]n2_11.east){scharfzeichnen};
+	\node[unit,anchor=west,fill=red!20](n2_12) at ([xshift=0.8em]n2_11.east){aus};
 	\node[unit,anchor=west,fill=red!20](n2_13) at ([xshift=0.8em]n2_12.east){.};
 	\node[anchor=west,inner sep=0pt,align=center,font=\scriptsize] (n3_1) at ([yshift=-5.5em]o.east){\textbf{MT}};
-	\node[unit,anchor=west,fill=blue!20](n3_2) at ([xshift=4.7em]n3_1.east){Der};
+	\node[unit,anchor=west,fill=blue!20](n3_2) at ([xshift=4.7em]n3_1.east){Zeichnen};
-	\node[unit,anchor=west,fill=blue!20](n3_3) at ([xshift=0.8em]n3_2.east){Schärfen-Werkezug};
+	\node[unit,anchor=west,fill=blue!20](n3_3) at ([xshift=0.8em]n3_2.east){oder};
-	\node[unit,anchor=west,fill=blue!20](n3_4) at ([xshift=0.8em]n3_3.east){Bereiche};
+	\node[unit,anchor=west,fill=blue!20](n3_4) at ([xshift=0.8em]n3_3.east){wählen};
-	\node[unit,anchor=west,fill=blue!20](n3_5) at ([xshift=0.8em]n3_4.east){in};
+	\node[unit,anchor=west,fill=blue!20](n3_5) at ([xshift=0.8em]n3_4.east){Sie};
-	\node[unit,anchor=west,fill=blue!20](n3_6) at ([xshift=0.8em]n3_5.east){einem};
+	\node[unit,anchor=west,fill=blue!20](n3_6) at ([xshift=0.8em]n3_5.east){eine};
-	\node[unit,anchor=west,fill=blue!20](n3_7) at ([xshift=0.8em]n3_6.east){Bild};
+	\node[unit,anchor=west,fill=blue!20](n3_7) at ([xshift=0.8em]n3_6.east){Linie};
-	\node[unit,anchor=west,fill=blue!20](n3_8) at ([xshift=0.8em]n3_7.east){Schärfer};
+	\node[unit,anchor=west,fill=blue!20](n3_8) at ([xshift=0.8em]n3_7.east){aus};
-	\node[unit,anchor=west,fill=blue!20](n3_9) at ([xshift=0.8em]n3_8.east){erscheint};
+	\node[unit,anchor=west,fill=blue!20](n3_9) at ([xshift=0.8em]n3_8.east){.};
-	\node[unit,anchor=west,fill=blue!20](n3_10) at ([xshift=0.8em]n3_9.east){.};
 	\node[bad_tag,anchor=south] at ([yshift=2pt]n1_2.north){BAD};
 	\node[bad_tag,anchor=south] at ([yshift=2pt]n1_3.north){BAD};
-	\node[bad_tag,anchor=south] at ([yshift=2pt]n1_4.north){BAD};
+	\node[ok_tag,anchor=south] at ([yshift=2pt]n1_4.north){OK};
 	\node[bad_tag,anchor=south] at ([yshift=2pt]n1_5.north){BAD};
-	\node[ok_tag,anchor=south] at ([yshift=2pt]n1_6.north){OK};
+	\node[bad_tag,anchor=south] at ([yshift=2pt]n1_6.north){BAD};
-	\node[ok_tag,anchor=south] at ([yshift=2pt]n1_7.north){OK};
+	\node[ok_tag,anchor=south] (tag1) at ([yshift=2pt]n1_7.north){OK};
-	\node[ok_tag,anchor=south] at ([yshift=2pt]n1_8.north){OK};
-	\node[ok_tag,anchor=south] at ([yshift=2pt]n1_9.north){OK};
-	\node[ok_tag,anchor=south] (tag1) at ([yshift=2pt]n1_10.north){OK};
-	\node[bad_tag,anchor=north] at ([yshift=-2pt]n3_2.south){BAD};
+	\node[ok_tag,anchor=north] at ([yshift=-3pt]n3_2.south){OK};
-	\node[bad_tag,anchor=north] at ([yshift=-2pt]n3_3.south){BAD};
+	\node[ok_tag,anchor=north] at ([yshift=-3pt]n3_3.south){OK};
-	\node[ok_tag,anchor=north] at ([yshift=-2pt]n3_4.south){OK};
+	\node[ok_tag,anchor=north] at ([yshift=-3pt]n3_4.south){OK};
-	\node[ok_tag,anchor=north] at ([yshift=-2pt]n3_5.south){OK};
+	\node[ok_tag,anchor=north] at ([yshift=-3pt]n3_5.south){OK};
-	\node[ok_tag,anchor=north] at ([yshift=-2pt]n3_6.south){OK};
+	\node[ok_tag,anchor=north] at ([yshift=-3pt]n3_6.south){OK};
-	\node[ok_tag,anchor=north] at ([yshift=-2pt]n3_7.south){OK};
+	\node[bad_tag,anchor=north] at ([yshift=-3pt]n3_7.south){BAD};
-	\node[bad_tag,anchor=north] at ([yshift=-2pt]n3_8.south){BAD};
+	\node[ok_tag,anchor=north] at ([yshift=-3pt]n3_8.south){OK};
-	\node[bad_tag,anchor=north] at ([yshift=-2pt]n3_9.south){BAD};
+	\node[ok_tag,anchor=north] (tag2) at ([yshift=-3pt]n3_9.south){OK};
-	\node[ok_tag,anchor=north] (tag2) at ([yshift=-2pt]n3_10.south){OK};
-	\node[bad_tag,anchor=north] (gap_1)at ([xshift=-2em,yshift=-2em]n3_2.south){BAD};
+	\node[ok_tag,anchor=north] (gap_1)at ([xshift=-2.6em,yshift=-2em]n3_2.south){OK};
-	\node[ok_tag,anchor=north] (gap_2)at ([xshift=1.6em,yshift=-2em]n3_2.south){OK};
+	\node[bad_tag,anchor=north] (gap_2)at ([xshift=2.55em,yshift=-2em]n3_2.south){BAD};
-	\node[bad_tag,anchor=north] (gap_3)at ([xshift=4.4em,yshift=-2em]n3_3.south){BAD};
+	\node[ok_tag,anchor=north] (gap_3)at ([xshift=1.85em,yshift=-2em]n3_3.south){OK};
-	\node[ok_tag,anchor=north] (gap_4)at ([xshift=2.5em,yshift=-2em]n3_4.south){OK};
+	\node[ok_tag,anchor=north] (gap_4)at ([xshift=2.3em,yshift=-2em]n3_4.south){OK};
-	\node[ok_tag,anchor=north] (gap_5)at ([xshift=1.3em,yshift=-2em]n3_5.south){OK};
+	\node[ok_tag,anchor=north] (gap_5)at ([xshift=1.5em,yshift=-2em]n3_5.south){OK};
-	\node[ok_tag,anchor=north] (gap_6)at ([xshift=2em,yshift=-2em]n3_6.south){OK};
+	\node[ok_tag,anchor=north] (gap_6)at ([xshift=1.8em,yshift=-2em]n3_6.south){OK};
-	\node[ok_tag,anchor=north] (gap_7)at ([xshift=1.7em,yshift=-2em]n3_7.south){OK};
+	\node[ok_tag,anchor=north] (gap_7)at ([xshift=2.0em,yshift=-2em]n3_7.south){OK};
-	\node[ok_tag,anchor=north] (gap_8)at ([xshift=2.4em,yshift=-2em]n3_8.south){OK};
+	\node[ok_tag,anchor=north] (gap_8)at ([xshift=1.60em,yshift=-2em]n3_8.south){OK};
-	\node[ok_tag,anchor=north] (gap_9)at ([xshift=2.5em,yshift=-2em]n3_9.south){OK};
+	\node[ok_tag,anchor=north] (tag3) at ([xshift=1.7em,yshift=-2em]n3_9.south){OK};
-	\node[ok_tag,anchor=north](tag3) at ([xshift=1.3em,yshift=-2em]n3_10.south){OK};
 	\draw[dash pattern=on 2pt off 1pt,gray,line width=1pt](gap_1.north) -- ([yshift=2em]gap_1.north);
 	\draw[dash pattern=on 2pt off 1pt,gray,line width=1pt](gap_2.north) -- ([yshift=2em]gap_2.north);
@@ -83,29 +74,26 @@
 	\draw[dash pattern=on 2pt off 1pt,gray,line width=1pt](gap_6.north) -- ([yshift=2em]gap_6.north);
 	\draw[dash pattern=on 2pt off 1pt,gray,line width=1pt](gap_7.north) -- ([yshift=2em]gap_7.north);
 	\draw[dash pattern=on 2pt off 1pt,gray,line width=1pt](gap_8.north) -- ([yshift=2em]gap_8.north);
-	\draw[dash pattern=on 2pt off 1pt,gray,line width=1pt](gap_9.north) -- ([yshift=2em]gap_9.north);
 	\draw[dash pattern=on 2pt off 1pt,gray,line width=1pt](tag3.north) -- ([yshift=2em]tag3.north);
-	\draw [line width=1pt](n1_2.south) -- (n2_3.north);
+	\draw [line width=1pt,blue!30](n1_2.south) -- (n2_2.north);
-	\draw [line width=1pt](n1_3.south) -- (n2_4.north);
+	\draw [line width=1pt,blue!30](n1_3.south) -- (n2_3.north);
-	\draw [line width=1pt](n1_4.south) -- (n2_4.north);
+	\draw [line width=1pt,blue!30](n1_4.south) -- (n2_4.north);
-	\draw [line width=1pt](n1_5.south) -- (n2_12.north);
+	\draw [line width=1pt,blue!30](n1_4.south) -- (n2_9.north);
-	\draw [line width=1pt](n1_6.south) -- (n2_8.north);
+	\draw [line width=1pt,blue!30](n1_5.south) -- (n2_5.north);
-	\draw [line width=1pt](n1_7.south) -- (n2_9.north);
+	\draw [line width=1pt,blue!30](n1_6.south) -- (n2_6.north);
-	\draw [line width=1pt](n1_8.south) -- (n2_10.north);
+	\draw [line width=1pt,blue!30](n1_7.south) -- (n2_13.north);
-	\draw [line width=1pt](n1_9.south) -- (n2_11.north);
-	\draw [line width=1pt](n1_10.south) -- (n2_13.north);
+	\draw[line width=1pt,ugreen!60] (n2_2.south) -- (n3_2.north);
+	\draw[line width=1pt,ugreen!60] (n2_3.south) -- (n3_3.north);
-	\draw[line width=1pt,red!60] (n2_3.south) -- (n3_2.north);
+	\draw[line width=1pt,ugreen!60] (n2_4.south) -- (n3_5.north);
-	\draw[line width=1pt,red!60] (n2_4.south) -- (n3_3.north);
+	\draw[line width=1pt,ugreen!60] (n2_5.south) -- (n3_6.north);
-	\draw[line width=1pt,ugreen!60] (n2_8.south) -- (n3_4.north);
+	\draw[line width=1pt,red!60] (n2_6.south) -- (n3_7.north);
-	\draw[line width=1pt,ugreen!60] (n2_9.south) -- (n3_5.north);
+	\draw[line width=1pt,ugreen!60] (n2_9.south) -- (n3_4.north);
-	\draw[line width=1pt,ugreen!60] (n2_10.south) -- (n3_6.north);
+	\draw[line width=1pt,ugreen!60] (n2_12.south) -- (n3_8.north);
-	\draw[line width=1pt,ugreen!60] (n2_11.south) -- (n3_7.north);
+	\draw[line width=1pt,ugreen!60] (n2_13.south) -- (n3_9.north);
-	\draw[line width=1pt,red!60] (n2_12.south) -- (n3_8.north);
-	\draw[line width=1pt,ugreen!60] (n2_13.south) -- (n3_10.north);
-	\node[anchor=west,inner sep=0pt,align=center,font=\scriptsize]  at ([xshift=4.6em]tag1.east){\textbf{Source tags}};
+	\node[anchor=west,inner sep=0pt,align=center,font=\scriptsize](st)  at ([xshift=8em]tag1.east){\textbf{Source tags}};
-	\node[anchor=west,inner sep=0pt,align=center,font=\scriptsize]  at ([xshift=2.6em]tag2.east){\textbf{MT tags}};
+	\node[anchor=west,inner sep=0pt,align=center,font=\scriptsize]  at ([xshift=3.6em]tag2.east){\textbf{MT tags}};
-	\node[anchor=west,inner sep=0pt,align=center,font=\scriptsize]  at ([xshift=1.1em]tag3.east){\textbf{Gap tags}};
+	\node[anchor=west,inner sep=0pt,align=center,font=\scriptsize] (gt) at ([xshift=1.6em]tag3.east){\textbf{Gap tags}};
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter4/Figures/synonym-matching-word-alignment.tex
+++ b/Chapter4/Figures/synonym-matching-word-alignment.tex
 \definecolor{ugreen}{rgb}{0,0.5,0}
 \begin{tikzpicture}[scale=0.5]
-	\tikzstyle{cand} = [draw,line width=1pt,align=center,minimum width=2.6em,minimum height=1.6em,drop shadow={shadow xshift=0.15em},fill=green!30]
+	\tikzstyle{cand} = [draw,line width=1pt,align=center,minimum width=2.6em,minimum height=1.6em,drop shadow={shadow xshift=0.15em},fill=green!15]
-	\tikzstyle{ref} = [draw,line width=1pt,align=center,minimum width=2.6em,minimum height=1.6em,drop shadow={shadow xshift=0.15em},fill=red!30]
+	\tikzstyle{ref} = [draw,line width=1pt,align=center,minimum width=2.6em,minimum height=1.6em,drop shadow={shadow xshift=0.15em},fill=red!15]
-		\node[align=center,minimum width=2.4em,minimum height=1.6em,minimum width=6em] (n11) at (0,0){\small\bfnew{Candidate :}};
+		\node[align=center,minimum width=2.4em,minimum height=1.6em,minimum width=6em] (n11) at (0,0){\small{机器译文：}};
 		\node[cand,anchor=west] (n12) at ([xshift=0.0em]n11.east){Can};
 		\node[cand,anchor=west] (n13) at ([xshift=1em]n12.east){I};
 		\node[cand,anchor=west] (n14) at ([xshift=1em]n13.east){have};
-		\node[cand,anchor=west] (n15) at ([xshift=1em]n14.east){this};
+		\node[cand,anchor=west] (n15) at ([xshift=1em]n14.east){it};
 		\node[cand,anchor=west] (n16) at ([xshift=1em]n15.east){like};
 		\node[cand,anchor=west] (n17) at ([xshift=1em]n16.east){he};
-		\node[cand,anchor=west] (n18) at ([xshift=1em]n17.east){do};
+		\node[cand,anchor=west] (n18) at ([xshift=1em]n17.east){?};
-		\node[cand,anchor=west] (n19) at ([xshift=1em]n18.east){it};
-		\node[cand,anchor=west] (n20) at ([xshift=1em]n19.east){?};
+		\node[align=center,minimum width=2.4em,minimum height=1.6em,anchor=north,minimum width=6em] (n21) at ([yshift=-6em]n11.south){\small{参考答案：}};
-		\node[align=center,minimum width=2.4em,minimum height=1.6em,anchor=north,minimum width=6em] (n21) at ([yshift=-6em]n11.south){\small\bfnew{Reference :}};
 		\node[ref,anchor=west] (n22) at ([xshift=0.0em]n21.east){Can};
 		\node[ref,anchor=west] (n23) at ([xshift=1em]n22.east){I};
 		\node[ref,anchor=west] (n24) at ([xshift=1em]n23.east){eat};
 		\node[ref,anchor=west] (n25) at ([xshift=1em]n24.east){this};
 		\node[ref,anchor=west] (n26) at ([xshift=1em]n25.east){Can};
 		\node[ref,anchor=west] (n27) at ([xshift=1em]n26.east){like};
-		\node[ref,anchor=west] (n28) at ([xshift=1em]n27.east){he};
+		\node[ref,anchor=west] (n28) at ([xshift=1em]n27.east){him};
-		\node[ref,anchor=west] (n29) at ([xshift=1em]n28.east){did};
+		\node[ref,anchor=west] (n29) at ([xshift=1em]n28.east){?};
-		\node[ref,anchor=west] (n30) at ([xshift=1em]n29.east){?};
-		\node[align=center,minimum width=2.4em,minimum height=1.6em,anchor=north,minimum width=6em] (n31) at ([yshift=-5em]n21.south){\small\bfnew{Candidate :}};
+		\node[align=center,minimum width=2.4em,minimum height=1.6em,anchor=north,minimum width=6em] (n31) at ([yshift=-5em]n21.south){\small{机器译文：}};
 		\node[cand,anchor=west] (n32) at ([xshift=0.0em]n31.east){Can};
 		\node[cand,anchor=west] (n33) at ([xshift=1em]n32.east){I};
 		\node[cand,anchor=west] (n34) at ([xshift=1em]n33.east){have};
-		\node[cand,anchor=west] (n35) at ([xshift=1em]n34.east){this};
+		\node[cand,anchor=west] (n35) at ([xshift=1em]n34.east){it};
 		\node[cand,anchor=west] (n36) at ([xshift=1em]n35.east){like};
 		\node[cand,anchor=west] (n37) at ([xshift=1em]n36.east){he};
-		\node[cand,anchor=west] (n38) at ([xshift=1em]n37.east){do};
+		\node[cand,anchor=west] (n38) at ([xshift=1em]n37.east){?};
-		\node[cand,anchor=west] (n39) at ([xshift=1em]n38.east){it};
-		\node[cand,anchor=west] (n40) at ([xshift=1em]n39.east){?};
+		\node[align=center,minimum width=2.4em,minimum height=1.6em,anchor=north,minimum width=6em] (n41) at ([yshift=-6em]n31.south){\small{参考答案：}};
-		\node[align=center,minimum width=2.4em,minimum height=1.6em,anchor=north,minimum width=6em] (n41) at ([yshift=-6em]n31.south){\small\bfnew{Candidate :}};
 		\node[ref,anchor=west] (n42) at ([xshift=0.0em]n41.east){Can};
 		\node[ref,anchor=west] (n43) at ([xshift=1em]n42.east){I};
 		\node[ref,anchor=west] (n44) at ([xshift=1em]n43.east){eat};
 		\node[ref,anchor=west] (n45) at ([xshift=1em]n44.east){this};
 		\node[ref,anchor=west] (n46) at ([xshift=1em]n45.east){Can};
 		\node[ref,anchor=west] (n47) at ([xshift=1em]n46.east){like};
-		\node[ref,anchor=west] (n48) at ([xshift=1em]n47.east){he};
+		\node[ref,anchor=west] (n48) at ([xshift=1em]n47.east){him};
-		\node[ref,anchor=west] (n49) at ([xshift=1em]n48.east){did};
+		\node[ref,anchor=west] (n49) at ([xshift=1em]n48.east){?};
-		\node[ref,anchor=west] (n50) at ([xshift=1em]n49.east){?};
 		\draw[line width=1.6pt,blue!40] (n12.south) -- (n22.north);
 		\draw[line width=1.6pt,blue!40] (n13.south) -- (n23.north);
-		\draw[line width=1.6pt,blue!40] (n15.south) -- (n25.north);
 		\draw[line width=1.6pt,blue!40] (n16.south) -- (n27.north);
-		\draw[line width=1.6pt,blue!40] (n17.south) -- (n28.north);
+		\draw[line width=1.6pt,blue!40] (n18.south) -- (n29.north);
-		\draw[line width=1.6pt,blue!40] (n20.south) -- (n30.north);
+		\draw[line width=1.6pt,orange!40] (n17.south) -- (n28.north);
-		\draw[line width=1.6pt,orange!40] (n18.south) -- (n29.north);
 		\draw[line width=2pt,ugreen!40](n14.south) -- (n24.north);
 		\draw[line width=1.6pt,blue!40] (n32.south) -- (n46.north);
 		\draw[line width=1.6pt,blue!40] (n33.south) -- (n43.north);
-		\draw[line width=1.6pt,blue!40] (n35.south) -- (n45.north);
 		\draw[line width=1.6pt,blue!40] (n36.south) -- (n47.north);
-		\draw[line width=1.6pt,blue!40] (n37.south) -- (n48.north);
+		\draw[line width=1.6pt,blue!40] (n38.south) -- (n49.north);
-		\draw[line width=1.6pt,blue!40] (n40.south) -- (n50.north);
+		\draw[line width=1.6pt,orange!40] (n37.south) -- (n48.north);
-		\draw[line width=1.6pt,orange!40] (n38.south) -- (n49.north);
 		\draw[line width=2pt,ugreen!40](n34.south) -- (n44.north);
 \end{tikzpicture}
--- a/Chapter4/chapter4.aux
+++ b/Chapter4/chapter4.aux
+\relax 
+\providecommand\zref@newlabel[2]{}
+\providecommand\hyper@newdestlabel[2]{}
+\@writefile{toc}{\defcounter {refsection}{0}\relax }\@writefile{toc}{\contentsline {chapter}{\numberline {1}翻译质量评价}{11}{chapter.1}\protected@file@percent }
+\@writefile{lof}{\defcounter {refsection}{0}\relax }\@writefile{lof}{\addvspace {10\p@ }}
+\@writefile{lot}{\defcounter {refsection}{0}\relax }\@writefile{lot}{\addvspace {10\p@ }}
+\@writefile{toc}{\defcounter {refsection}{0}\relax }\@writefile{toc}{\contentsline {section}{\numberline {1.1}译文质量评价所面临的挑战}{11}{section.1.1}\protected@file@percent }
+\@writefile{lot}{\defcounter {refsection}{0}\relax }\@writefile{lot}{\contentsline {table}{\numberline {1.1}{\ignorespaces 汉译英译文质量评价实例\relax }}{12}{table.caption.3}\protected@file@percent }
+\providecommand*\caption@xref[2]{\@setref\relax\@undefined{#1}}
+\newlabel{tab:4-1}{{1.1}{12}{汉译英译文质量评价实例\relax }{table.caption.3}{}}
+\@writefile{toc}{\defcounter {refsection}{0}\relax }\@writefile{toc}{\contentsline {section}{\numberline {1.2}人工评价}{13}{section.1.2}\protected@file@percent }
+\newlabel{Manual evaluation}{{1.2}{13}{人工评价}{section.1.2}{}}
+\@writefile{lof}{\defcounter {refsection}{0}\relax }\@writefile{lof}{\contentsline {figure}{\numberline {1.1}{\ignorespaces 译文质量评价方法逻辑图\relax }}{14}{figure.caption.4}\protected@file@percent }
+\newlabel{fig:4-2}{{1.1}{14}{译文质量评价方法逻辑图\relax }{figure.caption.4}{}}
+\@writefile{toc}{\defcounter {refsection}{0}\relax }\@writefile{toc}{\contentsline {subsection}{\numberline {1.2.1}评价策略}{14}{subsection.1.2.1}\protected@file@percent }
+\@writefile{toc}{\defcounter {refsection}{0}\relax }\@writefile{toc}{\contentsline {subsection}{\numberline {1.2.2}打分标准}{15}{subsection.1.2.2}\protected@file@percent }
+\newlabel{sec:human-eval-scoring}{{1.2.2}{15}{打分标准}{subsection.1.2.2}{}}
+\newlabel{eq:4-1}{{1.1}{16}{打分标准}{equation.1.2.1}{}}
+\newlabel{eq:4-2}{{1.2}{16}{打分标准}{equation.1.2.2}{}}
+\@writefile{toc}{\defcounter {refsection}{0}\relax }\@writefile{toc}{\contentsline {section}{\numberline {1.3}有参考答案的自动评价}{17}{section.1.3}\protected@file@percent }
+\newlabel{Automatic evaluation with reference answers}{{1.3}{17}{有参考答案的自动评价}{section.1.3}{}}
+\@writefile{toc}{\defcounter {refsection}{0}\relax }\@writefile{toc}{\contentsline {subsection}{\numberline {1.3.1}基于词串比对的方法}{17}{subsection.1.3.1}\protected@file@percent }
+\@writefile{toc}{\defcounter {refsection}{0}\relax }\@writefile{toc}{\contentsline {subsubsection}{1.基于距离的方法}{17}{section*.5}\protected@file@percent }
+\newlabel{eq:4-3}{{1.3}{17}{1.基于距离的方法}{equation.1.3.3}{}}
+\newlabel{eg:4-1}{{1.1}{18}{}{exampleT.1.1}{}}
+\@writefile{toc}{\defcounter {refsection}{0}\relax }\@writefile{toc}{\contentsline {subsubsection}{2.基于$\bm  {n}$-gram的方法}{18}{section*.6}\protected@file@percent }
+\newlabel{sec:ngram-eval}{{1.3.1}{18}{2.基于$\bm {n}$-gram的方法}{section*.6}{}}
+\newlabel{eq:4-4}{{1.4}{18}{2.基于$\bm {n}$-gram的方法}{equation.1.3.4}{}}
+\newlabel{eg:4-bleu-example}{{1.2}{18}{}{exampleT.1.2}{}}
+\newlabel{eq:4-5}{{1.5}{18}{2.基于$\bm {n}$-gram的方法}{equation.1.3.5}{}}
+\newlabel{eq:4-6}{{1.6}{19}{2.基于$\bm {n}$-gram的方法}{equation.1.3.6}{}}
+\newlabel{eq:4-7}{{1.7}{19}{2.基于$\bm {n}$-gram的方法}{equation.1.3.7}{}}
+\@writefile{toc}{\defcounter {refsection}{0}\relax }\@writefile{toc}{\contentsline {subsection}{\numberline {1.3.2}基于词对齐的方法}{20}{subsection.1.3.2}\protected@file@percent }
+\newlabel{eg:4-2}{{1.3}{20}{}{exampleT.1.3}{}}
+\@writefile{lof}{\defcounter {refsection}{0}\relax }\@writefile{lof}{\contentsline {figure}{\numberline {1.2}{\ignorespaces “绝对”匹配模型\relax }}{20}{figure.caption.7}\protected@file@percent }
+\newlabel{fig:4-3}{{1.2}{20}{“绝对”匹配模型\relax }{figure.caption.7}{}}
+\@writefile{lof}{\defcounter {refsection}{0}\relax }\@writefile{lof}{\contentsline {subfigure}{\numberline{(a)}{\ignorespaces {“绝对”匹配词对齐-1}}}{20}{figure.caption.7}\protected@file@percent }
+\@writefile{lof}{\defcounter {refsection}{0}\relax }\@writefile{lof}{\contentsline {subfigure}{\numberline{(b)}{\ignorespaces {“绝对”匹配词对齐-2}}}{20}{figure.caption.7}\protected@file@percent }
+\@writefile{lof}{\defcounter {refsection}{0}\relax }\@writefile{lof}{\contentsline {figure}{\numberline {1.3}{\ignorespaces “波特词干”匹配词对齐\relax }}{21}{figure.caption.8}\protected@file@percent }
+\newlabel{fig:4-4}{{1.3}{21}{“波特词干”匹配词对齐\relax }{figure.caption.8}{}}
+\@writefile{lof}{\defcounter {refsection}{0}\relax }\@writefile{lof}{\contentsline {figure}{\numberline {1.4}{\ignorespaces “同义词”匹配词对齐\relax }}{21}{figure.caption.9}\protected@file@percent }
+\newlabel{fig:4-5}{{1.4}{21}{“同义词”匹配词对齐\relax }{figure.caption.9}{}}
+\@writefile{lof}{\defcounter {refsection}{0}\relax }\@writefile{lof}{\contentsline {figure}{\numberline {1.5}{\ignorespaces 确定最终词对齐\relax }}{22}{figure.caption.10}\protected@file@percent }
+\newlabel{fig:4-6}{{1.5}{22}{确定最终词对齐\relax }{figure.caption.10}{}}
+\newlabel{eq:4-8}{{1.8}{22}{基于词对齐的方法}{equation.1.3.8}{}}
+\newlabel{eq:4-9}{{1.9}{22}{基于词对齐的方法}{equation.1.3.9}{}}
+\newlabel{eq:4-10}{{1.10}{22}{基于词对齐的方法}{equation.1.3.10}{}}
+\newlabel{eq:4-11}{{1.11}{22}{基于词对齐的方法}{equation.1.3.11}{}}
+\newlabel{eq:4-12}{{1.12}{22}{基于词对齐的方法}{equation.1.3.12}{}}
+\@writefile{toc}{\defcounter {refsection}{0}\relax }\@writefile{toc}{\contentsline {subsection}{\numberline {1.3.3}基于检测点的方法}{23}{subsection.1.3.3}\protected@file@percent }
+\newlabel{eg:4-3}{{1.4}{23}{}{exampleT.1.4}{}}
+\newlabel{eg:4-4}{{1.5}{23}{}{exampleT.1.5}{}}
+\newlabel{eg:4-5}{{1.6}{24}{}{exampleT.1.6}{}}
+\@writefile{toc}{\defcounter {refsection}{0}\relax }\@writefile{toc}{\contentsline {subsection}{\numberline {1.3.4}多策略融合的评价方法}{24}{subsection.1.3.4}\protected@file@percent }
+\newlabel{Evaluation method of Multi Strategy fusion}{{1.3.4}{24}{多策略融合的评价方法}{subsection.1.3.4}{}}
+\@writefile{toc}{\defcounter {refsection}{0}\relax }\@writefile{toc}{\contentsline {subsection}{\numberline {1.3.5}译文多样性}{25}{subsection.1.3.5}\protected@file@percent }
+\@writefile{toc}{\defcounter {refsection}{0}\relax }\@writefile{toc}{\contentsline {subsubsection}{1.增大参考答案集}{25}{section*.11}\protected@file@percent }
+\@writefile{lof}{\defcounter {refsection}{0}\relax }\@writefile{lof}{\contentsline {figure}{\numberline {1.6}{\ignorespaces HyTER中参考答案集的表示方式\relax }}{26}{figure.caption.12}\protected@file@percent }
+\newlabel{fig:4-7}{{1.6}{26}{HyTER中参考答案集的表示方式\relax }{figure.caption.12}{}}
+\newlabel{eg:4-6}{{1.7}{26}{}{exampleT.1.7}{}}
+\@writefile{lof}{\defcounter {refsection}{0}\relax }\@writefile{lof}{\contentsline {figure}{\numberline {1.7}{\ignorespaces 使用HyTER构造的参考答案集\relax }}{26}{figure.caption.13}\protected@file@percent }
+\newlabel{fig:4-8}{{1.7}{26}{使用HyTER构造的参考答案集\relax }{figure.caption.13}{}}
+\@writefile{lof}{\defcounter {refsection}{0}\relax }\@writefile{lof}{\contentsline {subfigure}{\numberline{(a)}{\ignorespaces {英语参考答案集表示}}}{26}{figure.caption.13}\protected@file@percent }
+\@writefile{lof}{\defcounter {refsection}{0}\relax }\@writefile{lof}{\contentsline {subfigure}{\numberline{(b)}{\ignorespaces {捷克语参考答案集表示}}}{26}{figure.caption.13}\protected@file@percent }
+\newlabel{eq:4-14}{{1.13}{27}{1.增大参考答案集}{equation.1.3.13}{}}
+\newlabel{eq:4-15}{{1.14}{27}{1.增大参考答案集}{equation.1.3.13}{}}
+\@writefile{toc}{\defcounter {refsection}{0}\relax }\@writefile{toc}{\contentsline {subsubsection}{2.利用分布式表示进行质量评价}{27}{section*.14}\protected@file@percent }
+\@writefile{lot}{\defcounter {refsection}{0}\relax }\@writefile{lot}{\contentsline {table}{\numberline {1.2}{\ignorespaces 常见的单词及句子分布表示\relax }}{28}{table.caption.15}\protected@file@percent }
+\newlabel{tab:4-2}{{1.2}{28}{常见的单词及句子分布表示\relax }{table.caption.15}{}}
+\newlabel{eq:4-16}{{1.15}{28}{2.利用分布式表示进行质量评价}{equation.1.3.15}{}}
+\newlabel{eq:4-17}{{1.16}{28}{2.利用分布式表示进行质量评价}{equation.1.3.16}{}}
+\newlabel{eq:4-18}{{1.17}{28}{2.利用分布式表示进行质量评价}{equation.1.3.17}{}}
+\newlabel{eq:4-19}{{1.18}{28}{2.利用分布式表示进行质量评价}{equation.1.3.18}{}}
+\@writefile{toc}{\defcounter {refsection}{0}\relax }\@writefile{toc}{\contentsline {subsection}{\numberline {1.3.6}相关性与显著性}{29}{subsection.1.3.6}\protected@file@percent }
+\@writefile{toc}{\defcounter {refsection}{0}\relax }\@writefile{toc}{\contentsline {subsubsection}{1. 自动评价与人工评价的相关性}{29}{section*.16}\protected@file@percent }
+\@writefile{toc}{\defcounter {refsection}{0}\relax }\@writefile{toc}{\contentsline {subsubsection}{2. 自动评价方法的统计显著性}{30}{section*.17}\protected@file@percent }
+\@writefile{lof}{\defcounter {refsection}{0}\relax }\@writefile{lof}{\contentsline {figure}{\numberline {1.8}{\ignorespaces 统计假设检验的流程\relax }}{31}{figure.caption.18}\protected@file@percent }
+\newlabel{fig:4-13}{{1.8}{31}{统计假设检验的流程\relax }{figure.caption.18}{}}
+\newlabel{eq:4-21}{{1.19}{31}{2. 自动评价方法的统计显著性}{equation.1.3.19}{}}
+\@writefile{toc}{\defcounter {refsection}{0}\relax }\@writefile{toc}{\contentsline {section}{\numberline {1.4}无参考答案的自动评价}{32}{section.1.4}\protected@file@percent }
+\@writefile{toc}{\defcounter {refsection}{0}\relax }\@writefile{toc}{\contentsline {subsection}{\numberline {1.4.1}质量评估任务}{32}{subsection.1.4.1}\protected@file@percent }
+\@writefile{toc}{\defcounter {refsection}{0}\relax }\@writefile{toc}{\contentsline {subsubsection}{1.单词级质量评估}{32}{section*.19}\protected@file@percent }
+\newlabel{eg:4-7}{{1.8}{32}{}{exampleT.1.8}{}}
+\@writefile{lof}{\defcounter {refsection}{0}\relax }\@writefile{lof}{\contentsline {figure}{\numberline {1.9}{\ignorespaces 单词级质量评估任务示意图\relax }}{33}{figure.caption.20}\protected@file@percent }
+\newlabel{fig:4-11}{{1.9}{33}{单词级质量评估任务示意图\relax }{figure.caption.20}{}}
+\@writefile{toc}{\defcounter {refsection}{0}\relax }\@writefile{toc}{\contentsline {subsubsection}{2.短语级质量评估}{33}{section*.21}\protected@file@percent }
+\@writefile{lof}{\defcounter {refsection}{0}\relax }\@writefile{lof}{\contentsline {figure}{\numberline {1.10}{\ignorespaces 短语级质量评估任务示意图\relax }}{34}{figure.caption.22}\protected@file@percent }
+\newlabel{fig:4-12}{{1.10}{34}{短语级质量评估任务示意图\relax }{figure.caption.22}{}}
+\newlabel{eg:4-8}{{1.9}{34}{}{exampleT.1.9}{}}
+\@writefile{toc}{\defcounter {refsection}{0}\relax }\@writefile{toc}{\contentsline {subsubsection}{3.句子级质量评估}{35}{section*.23}\protected@file@percent }
+\newlabel{eq:4-20}{{1.20}{36}{3.句子级质量评估}{equation.1.4.20}{}}
+\@writefile{toc}{\defcounter {refsection}{0}\relax }\@writefile{toc}{\contentsline {subsubsection}{4.文档级质量评估}{36}{section*.24}\protected@file@percent }
+\newlabel{eg:4-9}{{1.10}{36}{}{exampleT.1.10}{}}
+\@writefile{toc}{\defcounter {refsection}{0}\relax }\@writefile{toc}{\contentsline {subsection}{\numberline {1.4.2}怎样构建质量评估模型}{37}{subsection.1.4.2}\protected@file@percent }
+\@writefile{toc}{\defcounter {refsection}{0}\relax }\@writefile{toc}{\contentsline {subsection}{\numberline {1.4.3}质量评估的应用场景}{38}{subsection.1.4.3}\protected@file@percent }
+\@writefile{toc}{\defcounter {refsection}{0}\relax }\@writefile{toc}{\contentsline {section}{\numberline {1.5}小结及深入阅读}{39}{section.1.5}\protected@file@percent }
+\@setckpt{Chapter4/chapter4}{
+\setcounter{page}{42}
+\setcounter{equation}{20}
+\setcounter{enumi}{0}
+\setcounter{enumii}{0}
+\setcounter{enumiii}{0}
+\setcounter{enumiv}{0}
+\setcounter{footnote}{2}
+\setcounter{mpfootnote}{0}
+\setcounter{part}{0}
+\setcounter{chapter}{1}
+\setcounter{section}{5}
+\setcounter{subsection}{0}
+\setcounter{subsubsection}{0}
+\setcounter{paragraph}{0}
+\setcounter{subparagraph}{0}
+\setcounter{figure}{10}
+\setcounter{table}{2}
+\setcounter{tabx@nest}{0}
+\setcounter{listtotal}{0}
+\setcounter{listcount}{0}
+\setcounter{liststart}{0}
+\setcounter{liststop}{0}
+\setcounter{citecount}{0}
+\setcounter{citetotal}{0}
+\setcounter{multicitecount}{0}
+\setcounter{multicitetotal}{0}
+\setcounter{instcount}{142}
+\setcounter{maxnames}{3}
+\setcounter{minnames}{1}
+\setcounter{maxitems}{3}
+\setcounter{minitems}{1}
+\setcounter{citecounter}{0}
+\setcounter{maxcitecounter}{0}
+\setcounter{savedcitecounter}{0}
+\setcounter{uniquelist}{0}
+\setcounter{uniquename}{0}
+\setcounter{refsection}{0}
+\setcounter{refsegment}{0}
+\setcounter{maxextratitle}{0}
+\setcounter{maxextratitleyear}{0}
+\setcounter{maxextraname}{3}
+\setcounter{maxextradate}{0}
+\setcounter{maxextraalpha}{0}
+\setcounter{abbrvpenalty}{50}
+\setcounter{highnamepenalty}{50}
+\setcounter{lownamepenalty}{25}
+\setcounter{maxparens}{3}
+\setcounter{parenlevel}{0}
+\setcounter{mincomprange}{10}
+\setcounter{maxcomprange}{100000}
+\setcounter{mincompwidth}{1}
+\setcounter{afterword}{0}
+\setcounter{savedafterword}{0}
+\setcounter{annotator}{0}
+\setcounter{savedannotator}{0}
+\setcounter{author}{0}
+\setcounter{savedauthor}{0}
+\setcounter{bookauthor}{0}
+\setcounter{savedbookauthor}{0}
+\setcounter{commentator}{0}
+\setcounter{savedcommentator}{0}
+\setcounter{editor}{0}
+\setcounter{savededitor}{0}
+\setcounter{editora}{0}
+\setcounter{savededitora}{0}
+\setcounter{editorb}{0}
+\setcounter{savededitorb}{0}
+\setcounter{editorc}{0}
+\setcounter{savededitorc}{0}
+\setcounter{foreword}{0}
+\setcounter{savedforeword}{0}
+\setcounter{holder}{0}
+\setcounter{savedholder}{0}
+\setcounter{introduction}{0}
+\setcounter{savedintroduction}{0}
+\setcounter{namea}{0}
+\setcounter{savednamea}{0}
+\setcounter{nameb}{0}
+\setcounter{savednameb}{0}
+\setcounter{namec}{0}
+\setcounter{savednamec}{0}
+\setcounter{translator}{0}
+\setcounter{savedtranslator}{0}
+\setcounter{shortauthor}{0}
+\setcounter{savedshortauthor}{0}
+\setcounter{shorteditor}{0}
+\setcounter{savedshorteditor}{0}
+\setcounter{labelname}{0}
+\setcounter{savedlabelname}{0}
+\setcounter{institution}{0}
+\setcounter{savedinstitution}{0}
+\setcounter{lista}{0}
+\setcounter{savedlista}{0}
+\setcounter{listb}{0}
+\setcounter{savedlistb}{0}
+\setcounter{listc}{0}
+\setcounter{savedlistc}{0}
+\setcounter{listd}{0}
+\setcounter{savedlistd}{0}
+\setcounter{liste}{0}
+\setcounter{savedliste}{0}
+\setcounter{listf}{0}
+\setcounter{savedlistf}{0}
+\setcounter{location}{0}
+\setcounter{savedlocation}{0}
+\setcounter{organization}{0}
+\setcounter{savedorganization}{0}
+\setcounter{origlocation}{0}
+\setcounter{savedoriglocation}{0}
+\setcounter{origpublisher}{0}
+\setcounter{savedorigpublisher}{0}
+\setcounter{publisher}{0}
+\setcounter{savedpublisher}{0}
+\setcounter{language}{0}
+\setcounter{savedlanguage}{0}
+\setcounter{origlanguage}{0}
+\setcounter{savedoriglanguage}{0}
+\setcounter{pageref}{0}
+\setcounter{savedpageref}{0}
+\setcounter{textcitecount}{0}
+\setcounter{textcitetotal}{0}
+\setcounter{textcitemaxnames}{0}
+\setcounter{biburlbigbreakpenalty}{100}
+\setcounter{biburlbreakpenalty}{200}
+\setcounter{biburlnumpenalty}{0}
+\setcounter{biburlucpenalty}{0}
+\setcounter{biburllcpenalty}{0}
+\setcounter{smartand}{1}
+\setcounter{bbx:relatedcount}{0}
+\setcounter{bbx:relatedtotal}{0}
+\setcounter{parentequation}{0}
+\setcounter{notation}{0}
+\setcounter{dummy}{0}
+\setcounter{problem}{0}
+\setcounter{exerciseT}{0}
+\setcounter{exampleT}{10}
+\setcounter{vocabulary}{0}
+\setcounter{definitionT}{0}
+\setcounter{mdf@globalstyle@cnt}{0}
+\setcounter{mdfcountframes}{0}
+\setcounter{mdf@env@i}{0}
+\setcounter{mdf@env@ii}{0}
+\setcounter{mdf@zref@counter}{0}
+\setcounter{Item}{0}
+\setcounter{Hfootnote}{2}
+\setcounter{Hy@AnnotLevel}{0}
+\setcounter{bookmark@seq@number}{0}
+\setcounter{caption@flags}{0}
+\setcounter{continuedfloat}{0}
+\setcounter{cp@cnt}{0}
+\setcounter{cp@tempcnt}{0}
+\setcounter{subfigure}{0}
+\setcounter{lofdepth}{1}
+\setcounter{subtable}{0}
+\setcounter{lotdepth}{1}
+\setcounter{@pps}{0}
+\setcounter{@ppsavesec}{0}
+\setcounter{@ppsaveapp}{0}
+\setcounter{tcbbreakpart}{0}
+\setcounter{tcblayer}{0}
+\setcounter{tcolorbox@number}{0}
+\setcounter{section@level}{1}
+}
--- a/Chapter4/chapter4.tex
+++ b/Chapter4/chapter4.tex
--- a/Chapter5/Figures/figure-alignment-of-all-words-in-zh-en-sentence.tex
+++ b/Chapter5/Figures/figure-alignment-of-all-words-in-zh-en-sentence.tex
@@ -13,7 +13,7 @@
    \draw [-] (s1.south) -- (t0.north);
    \draw [-] (s2.south) -- (t0.north);
   {
-    \node [anchor=south east,inner sep=0pt] (p) at (t0.north west) {\small{{\color{ugreen} P(}}};
+    \node [anchor=south east,inner sep=0pt] (p) at (t0.north west) {\small{{\color{ugreen} P\;(}}};
    \node [anchor=south west,inner sep=0pt] (p2) at ([yshift=0.2em]t2.north east) {\small{{\color{ugreen} )}}};
    \node [anchor=west] (eq) at ([xshift=0.7em]p2.east) {\small{+}};
    }
@@ -29,7 +29,7 @@
    \draw [-] (s1.south) -- (t0.north);
    \draw [-] (s2.south) -- (t1.north);
    {
-    \node [anchor=south east,inner sep=0pt] (p) at (t0.north west) {\small{{\color{ugreen} P(}}};
+    \node [anchor=south east,inner sep=0pt] (p) at (t0.north west) {\small{{\color{ugreen} P\;(}}};
    \node [anchor=south west,inner sep=0pt] (p2) at ([yshift=0.2em]t2.north east) {\small{{\color{ugreen} )}}};
    \node [anchor=west] (eq) at ([xshift=0.7em]p2.east) {\small{+}};
    }
@@ -45,7 +45,7 @@
    \draw [-] (s1.south) -- (t0.north);
    \draw [-] (s2.south) -- (t2.north);
    {
-    \node [anchor=south east,inner sep=0pt] (p) at (t0.north west) {\small{{\color{ugreen} P(}}};
+    \node [anchor=south east,inner sep=0pt] (p) at (t0.north west) {\small{{\color{ugreen} P\;(}}};
    \node [anchor=south west,inner sep=0pt] (p2) at ([yshift=0.2em]t2.north east) {\small{{\color{ugreen} )}}};
    \node [anchor=west] (eq) at ([xshift=0.7em]p2.east) {\small{+}};
    }
@@ -61,7 +61,7 @@
    \draw [-] (s1.south) -- ([yshift=-0.2em]t1.north);
    \draw [-] (s2.south) -- (t0.north);
    {
-    \node [anchor=south east,inner sep=0pt] (p) at (t0.north west) {\small{{\color{ugreen} P(}}};
+    \node [anchor=south east,inner sep=0pt] (p) at (t0.north west) {\small{{\color{ugreen} P\;(}}};
    \node [anchor=south west,inner sep=0pt] (p2) at ([yshift=0.2em]t2.north east) {\small{{\color{ugreen} )}}};
    \node [anchor=west] (eq) at ([xshift=0.7em]p2.east) {\small{+}};
    }
@@ -77,7 +77,7 @@
    \draw [-] (s1.south) -- ([yshift=-0.2em]t1.north);
    \draw [-] (s2.south) -- ([yshift=-0.2em]t1.north);
    {
-    \node [anchor=south east,inner sep=0pt] (p) at (t0.north west) {\small{{\color{ugreen} P(}}};
+    \node [anchor=south east,inner sep=0pt] (p) at (t0.north west) {\small{{\color{ugreen} P\;(}}};
    \node [anchor=south west,inner sep=0pt] (p2) at ([yshift=0.2em]t2.north east) {\small{{\color{ugreen} )}}};
    \node [anchor=west] (eq) at ([xshift=0.7em]p2.east) {\small{+}};
    }
@@ -93,7 +93,7 @@
    \draw [-] (s1.south) -- ([yshift=-0.2em]t1.north);
    \draw [-] (s2.south) -- (t2.north);
    {
-    \node [anchor=south east,inner sep=0pt] (p) at (t0.north west) {\small{{\color{ugreen} P(}}};
+    \node [anchor=south east,inner sep=0pt] (p) at (t0.north west) {\small{{\color{ugreen} P\;(}}};
    \node [anchor=south west,inner sep=0pt] (p2) at ([yshift=0.2em]t2.north east) {\small{{\color{ugreen} )}}};
    \node [anchor=west] (eq) at ([xshift=0.7em]p2.east) {\small{+}};
    }
@@ -109,7 +109,7 @@
    \draw [-] (s1.south) -- (t2.north);
    \draw [-] (s2.south) -- (t0.north);
    {
-    \node [anchor=south east,inner sep=0pt] (p) at (t0.north west) {\small{{\color{ugreen} P(}}};
+    \node [anchor=south east,inner sep=0pt] (p) at (t0.north west) {\small{{\color{ugreen} P\;(}}};
    \node [anchor=south west,inner sep=0pt] (p2) at ([yshift=0.2em]t2.north east) {\small{{\color{ugreen} )}}};
    \node [anchor=west] (eq) at ([xshift=0.7em]p2.east) {\small{+}};
    }
@@ -125,7 +125,7 @@
    \draw [-] (s1.south) -- (t2.north);
    \draw [-] (s2.south) -- (t1.north);
   {
-    \node [anchor=south east,inner sep=0pt] (p) at (t0.north west) {\small{{\color{ugreen} P(}}};
+    \node [anchor=south east,inner sep=0pt] (p) at (t0.north west) {\small{{\color{ugreen} P\;(}}};
    \node [anchor=south west,inner sep=0pt] (p2) at ([yshift=0.2em]t2.north east) {\small{{\color{ugreen} )}}};
    \node [anchor=west] (eq) at ([xshift=0.7em]p2.east) {\small{+}};
    }
@@ -141,9 +141,9 @@
    \draw [-] (s1.south) -- (t2.north);
    \draw [-] (s2.south) -- (t2.north);
    {
-    \node [anchor=south east,inner sep=0pt] (p) at (t0.north west) {\small{{\color{ugreen} P(}}};
+    \node [anchor=south east,inner sep=0pt] (p) at (t0.north west) {\small{{\color{ugreen} P\;(}}};
    \node [anchor=south west,inner sep=0pt] (p2) at ([yshift=0.2em]t2.north east) {\small{{\color{ugreen} )}}};
-    \node [anchor=west] (eq) at ([xshift=0.7em]p2.east) {\normalsize{= \ P($\mathbf{s}|\mathbf{t}$)}};
+    \node [anchor=west] (eq) at ([xshift=0.7em]p2.east) {\normalsize{= \ P\,($\mathbf{s}|\mathbf{t}$)}};
    }
    }
    \end{scope}

--- a/Chapter5/Figures/figure-greedy-mt-decoding-pseudo-code.tex
+++ b/Chapter5/Figures/figure-greedy-mt-decoding-pseudo-code.tex
@@ -8,10 +8,10 @@
 \node [anchor=north west,inner sep=2pt,align=left] (line4) at ([yshift=-1pt]line3.south west) {\textrm{3: \textbf{for} $i$ in $[1,m]$ \textbf{do}}};
 \node [anchor=north west,inner sep=2pt,align=left] (line5) at ([yshift=-1pt]line4.south west) {\textrm{4: \hspace{1em} $h = \phi$}};
 \node [anchor=north west,inner sep=2pt,align=left] (line6) at ([yshift=-1pt]line5.south west) {\textrm{5: \hspace{1em} \textbf{foreach} $j$ in $[1,m]$ \textbf{do}}};
-\node [anchor=north west,inner sep=2pt,align=left] (line7) at ([yshift=-1pt]line6.south west) {\textrm{6: \hspace{2em} \textbf{if} $used[j]=$ \textbf{true} \textbf{then}}};
+\node [anchor=north west,inner sep=2pt,align=left] (line7) at ([yshift=-1pt]line6.south west) {\textrm{6: \hspace{2em} \textbf{if} $used[j]=$ \textbf{false} \textbf{then}}};
 \node [anchor=north west,inner sep=2pt,align=left] (line8) at ([yshift=-1pt]line7.south west) {\textrm{7: \hspace{3em} $h = h \cup \textrm{\textsc{Join}}(best,\pi[j])$}};
 \node [anchor=north west,inner sep=2pt,align=left] (line9) at ([yshift=-1pt]line8.south west) {\textrm{8: \hspace{1em} $best = \textrm{\textsc{PruneForTop1}}(h)$}};
-\node [anchor=north west,inner sep=2pt,align=left] (line10) at ([yshift=-1pt]line9.south west) {\textrm{9: \hspace{1em} $used[best.j] = \textrm{\textsc{\textbf{true}}}$}};
+\node [anchor=north west,inner sep=2pt,align=left] (line10) at ([yshift=-1pt]line9.south west) {\textrm{9: \hspace{1em} $used[best.j] = \textrm{\textbf{true}}$}};
 \node [anchor=north west,inner sep=2pt,align=left] (line11) at ([yshift=-1pt]line10.south west) {\textrm{10: \textbf{return} $best.translatoin$}};
 \node [anchor=south west,inner sep=2pt,align=left] (head1) at ([yshift=1pt]line1.north west) {输出: 找的最佳译文};

--- a/Chapter5/Figures/figure-translation-pipeline.tex
+++ b/Chapter5/Figures/figure-translation-pipeline.tex
@@ -47,8 +47,8 @@
 }
 {
 \node [anchor=north west] (label1) at ([xshift=0.6em,yshift=0.0em]sent-1.south east) {{分析}};
-\node [anchor=north west] (label2) at ([yshift=-1.8em]label1.south west) {{转换}};
+\node [anchor=north west] (label2) at ([yshift=-1.5em]label1.south west) {{转换}};
-\node [anchor=north west] (label3) at ([yshift=-1.3em]label2.south west) {{生成}};
+\node [anchor=north west] (label3) at ([yshift=-1.1em]label2.south west) {{生成}};
 }
 {\scriptsize
 		\begin{scope}

--- a/Chapter5/chapter5.tex
+++ b/Chapter5/chapter5.tex
--- a/Chapter6/chapter6.tex
+++ b/Chapter6/chapter6.tex
@@ -74,7 +74,7 @@
 \parinterval 对于建模来说，IBM模型1很好地化简了翻译问题，但是由于使用了很强的假设，导致模型和实际情况有较大差异。其中一个比较严重的问题是假设词对齐的生成概率服从均匀分布。IBM模型2抛弃了这个假设\upcite{DBLP:journals/coling/BrownPPM94}。它认为词对齐是有倾向性的，它与源语言单词的位置和目标语言单词的位置有关。具体来说，对齐位置$a_j$的生成概率与位置$j$、源语言句子长度$m$和目标语言句子长度$l$有关，形式化表述为：
 \begin{eqnarray}
-\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\vectorn{t}) \equiv a(a_j|j,m,l)
+\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\vectorn{\emph{t}}) \equiv a(a_j|j,m,l)
 \label{eq:6-1}
 \end{eqnarray}
@@ -92,23 +92,23 @@
 \parinterval IBM模型2的其他假设均与模型1相同，即源语言长度预测概率及源语言单词生成概率被定义为：
 \begin{eqnarray}
-\funp{P}(m|\vectorn{t}) & \equiv & \varepsilon \label{eq:s-len-gen-prob} \\
+\funp{P}(m|\vectorn{\emph{t}}) & \equiv & \varepsilon \label{eq:s-len-gen-prob} \\
-\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\vectorn{t}) & \equiv & f(s_j|t_{a_j})
+\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\vectorn{\emph{t}}) & \equiv & f(s_j|t_{a_j})
 \label{eq:s-word-gen-prob}
 \end{eqnarray}
-把公式\ref{eq:s-len-gen-prob}、\ref{eq:s-word-gen-prob}和\ref{eq:6-1} 重新带入公式$\funp{P}(\vectorn{s},\vectorn{a}|\vectorn{t})=\funp{P}(m|\vectorn{t})\prod_{j=1}^{m}{\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\vectorn{t})}$\\${\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\vectorn{t})}$ 和$\funp{P}(\vectorn{s}|\vectorn{t})= \sum_{\vectorn{a}}\funp{P}(\vectorn{s},\vectorn{a}|\vectorn{t})$，可以得到IBM模型2的数学描述：
+把公式\eqref{eq:6-1}、\eqref{eq:s-len-gen-prob}和\eqref{eq:s-word-gen-prob}和 重新带入公式$\funp{P}(\vectorn{\emph{s}},\vectorn{\emph{a}}|\vectorn{\emph{t}})=\funp{P}(m|\vectorn{\emph{t}})\prod_{j=1}^{m}{\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\vectorn{\emph{t}})}$\\${\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\vectorn{\emph{t}})}$ 和$\funp{P}(\vectorn{\emph{s}}|\vectorn{\emph{t}})= \sum_{\vectorn{\emph{a}}}\funp{P}(\vectorn{\emph{s}},\vectorn{\emph{a}}|\vectorn{\emph{t}})$，可以得到IBM模型2的数学描述：
 \begin{eqnarray}
-\funp{P}(\vectorn{s}| \vectorn{t}) & = &  \sum_{\vectorn{a}}{\funp{P}(\vectorn{s},\vectorn{a}| \vectorn{t})} \nonumber \\
+\funp{P}(\vectorn{\emph{s}}| \vectorn{\emph{t}}) & = &  \sum_{\vectorn{\emph{a}}}{\funp{P}(\vectorn{\emph{s}},\vectorn{\emph{a}}| \vectorn{\emph{t}})} \nonumber \\
                       & = & \sum_{a_1=0}^{l}{\cdots}\sum _{a_m=0}^{l}{\varepsilon}\prod_{j=1}^{m}{a(a_j|j,m,l)f(s_j|t_{a_j})}
 \label{eq:6-4}
 \end{eqnarray}
-\parinterval 类似于模型1，模型2的表达式\ref{eq:6-4}也能被拆分为两部分进行理解。第一部分：遍历所有的$\vectorn{a}$；第二部分：对于每个$\vectorn{a}$累加对齐概率$\funp{P}(\vectorn{s},\vectorn{a}| \vectorn{t})$，即计算对齐概率$a(a_j|j,m,l)$和词汇翻译概率$f(s_j|t_{a_j})$对于所有源语言位置的乘积。
+\parinterval 类似于模型1，模型2的表达式\eqref{eq:6-4}也能被拆分为两部分进行理解。第一部分：遍历所有的$\vectorn{\emph{a}}$；第二部分：对于每个$\vectorn{\emph{a}}$累加对齐概率$\funp{P}(\vectorn{\emph{s}},\vectorn{\emph{a}}| \vectorn{\emph{t}})$，即计算对齐概率$a(a_j|j,m,l)$和词汇翻译概率$f(s_j|t_{a_j})$对于所有源语言位置的乘积。
 \parinterval 同样的，模型2的解码及训练优化和模型1的十分相似，在此不再赘述，详细推导过程可以参看{\chapterfive}解码及计算优化部分。这里直接给出IBM模型2的最终表达式：
 \begin{eqnarray}
-\funp{P}(\vectorn{s}| \vectorn{t}) & = & \varepsilon \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} a(i|j,m,l) f(s_j|t_i)
+\funp{P}(\vectorn{\emph{s}}| \vectorn{\emph{t}}) & = & \varepsilon \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} a(i|j,m,l) f(s_j|t_i)
 \label{eq:6-5}
 \end{eqnarray}
@@ -132,15 +132,15 @@
 \parinterval 针对此问题，基于HMM的词对齐模型抛弃了IBM模型1-2的绝对位置假设，将一阶隐马尔可夫模型用于词对齐问题\upcite{vogel1996hmm}。HMM词对齐模型认为，单词与单词之间并不是毫无联系的，对齐概率应该取决于对齐位置的差异而不是本身单词所在的位置。具体来说，位置$j$的对齐概率$a_j$与前一个位置$j-1$的对齐位置$a_{j-1}$和译文长度$l$有关，形式化的表述为：
 \begin{eqnarray}
-\funp{P}(a_{j}|a_{1}^{j-1},s_{1}^{j-1},m,\vectorn{t})\equiv\funp{P}(a_{j}|a_{j-1},l)
+\funp{P}(a_{j}|a_{1}^{j-1},s_{1}^{j-1},m,\vectorn{\emph{t}})\equiv\funp{P}(a_{j}|a_{j-1},l)
 \label{eq:6-6}
 \end{eqnarray}
 \parinterval 这里用图\ref{fig:6-4}的例子对公式进行说明。在IBM模型1-2中，单词的对齐都是与单词所在的绝对位置有关。但在HMM词对齐模型中，``你''对齐到``you''被形式化为$\funp{P}(a_{j}|a_{j-1},l)= P(5|4,5)$，意思是对于源语言位置$3(j=3)$上的单词，如果它的译文是第5个目标语言单词，上一个对齐位置是$4(a_{2}=4)$，对齐到目标语言位置$5(a_{j}=5)$的概率是多少？理想的情况下，通过$\funp{P}(a_{j}|a_{j-1},l)$，``你''对齐到``you''应该得到更高的概率，并且由于源语言单词``对''和``你''距离很近，因此其对应的对齐位置``with''和``you''的距离也应该很近。
-\parinterval 把公式$\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\vectorn{t}) \equiv f(s_j|t_{a_j})$和\ref{eq:6-6}重新带入公式$\funp{P}(\vectorn{s},\vectorn{a}|\vectorn{t})=\funp{P}(m|\vectorn{t})$\\$\prod_{j=1}^{m}{\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\vectorn{t})\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\vectorn{t})}$和$\funp{P}(\vectorn{s}|\vectorn{t})= \sum_{\vectorn{a}}\funp{P}(\vectorn{s},\vectorn{a}|\vectorn{t})$,可得HMM词对齐模型的数学描述：
+\parinterval 把公式$\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\vectorn{\emph{t}}) \equiv f(s_j|t_{a_j})$和\eqref{eq:6-6}重新带入公式$\funp{P}(\vectorn{\emph{s}},\vectorn{\emph{a}}|\vectorn{\emph{t}})=\funp{P}(m|\vectorn{\emph{t}})$\\$\prod_{j=1}^{m}{\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\vectorn{\emph{t}})\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\vectorn{\emph{t}})}$和$\funp{P}(\vectorn{\emph{s}}|\vectorn{\emph{t}})= \sum_{\vectorn{\emph{a}}}\funp{P}(\vectorn{\emph{s}},\vectorn{\emph{a}}|\vectorn{\emph{t}})$,可得HMM词对齐模型的数学描述：
 \begin{eqnarray}
-\funp{P}(\vectorn{s}| \vectorn{t})=\sum_{\vectorn{a}}{\funp{P}(m|\vectorn{t})}\prod_{j=1}^{m}{\funp{P}(a_{j}|a_{j-1},l)f(s_{j}|t_{a_j})}
+\funp{P}(\vectorn{\emph{s}}| \vectorn{\emph{t}})=\sum_{\vectorn{\emph{a}}}{\funp{P}(m|\vectorn{\emph{t}})}\prod_{j=1}^{m}{\funp{P}(a_{j}|a_{j-1},l)f(s_{j}|t_{a_j})}
 \label{eq:6-7}
 \end{eqnarray}
@@ -152,7 +152,7 @@
 \noindent 其中，$\mu( \cdot )$是隐马尔可夫模型的参数，可以通过训练得到。
-\parinterval 需要注意的是，公式\ref{eq:6-7}之所以被看作是一种隐马尔可夫模型，是由于其形式与标准的一阶隐马尔可夫模型无异。$\funp{P}(a_{j}|a_{j-1},l)$可以被看作是一种状态转移概率，$f(s_{j}|t_{a_j})$可以被看作是一种发射概率。关于隐马尔可夫模型具体的数学描述也可参考{\chapterthree}中的相关内容。
+\parinterval 需要注意的是，公式\eqref{eq:6-7}之所以被看作是一种隐马尔可夫模型，是由于其形式与标准的一阶隐马尔可夫模型无异。$\funp{P}(a_{j}|a_{j-1},l)$可以被看作是一种状态转移概率，$f(s_{j}|t_{a_j})$可以被看作是一种发射概率。关于隐马尔可夫模型具体的数学描述也可参考{\chapterthree}中的相关内容。
@@ -175,7 +175,7 @@
 \parinterval 这里将会给出另一个翻译模型，能在一定程度上解决上面提到的问题\upcite{DBLP:journals/coling/BrownPPM94,och2003systematic}。该模型把目标语言生成源语言的过程分解为如下几个步骤：首先，确定每个目标语言单词生成源语言单词的个数，这里把它称为{\small\sffamily\bfseries{繁衍率}}\index{繁衍率}或{\small\sffamily\bfseries{产出率}}\index{产出率}（Fertility）\index{Fertility}；其次，决定目标语言句子中每个单词生成的源语言单词都是什么，即决定生成的第一个源语言单词是什么，生成的第二个源语言单词是什么，以此类推。这样每个目标语言单词就对应了一个源语言单词列表；最后把各组源语言单词列表中的每个单词都放置到合适的位置上，完成目标语言译文到源语言句子的生成。
-\parinterval 对于句对$(\vectorn{s},\vectorn{t})$，令$\varphi$表示产出率，同时令${\tau}$表示每个目标语言单词对应的源语言单词列表。图{\ref{fig:6-5}}描述了一个英语句子生成汉语句子的过程。
+\parinterval 对于句对$(\vectorn{\emph{s}},\vectorn{\emph{t}})$，令$\varphi$表示产出率，同时令${\tau}$表示每个目标语言单词对应的源语言单词列表。图{\ref{fig:6-5}}描述了一个英语句子生成汉语句子的过程。
 \begin{itemize}
 \vspace{0.3em}
@@ -183,7 +183,7 @@
 \vspace{0.3em}
 \item 其次，确定英语句子中每个单词生成的汉语单词列表。比如``Scientists''生成``科学家''和``们''两个汉语单词，可表示为${\tau}_1=\{{\tau}_{11}=\textrm{``科学家''},{\tau}_{12}=\textrm{``们''}\}$。 这里用特殊的空标记NULL表示翻译对空的情况；
 \vspace{0.3em}
-\item 最后，把生成的所有汉语单词放在合适的位置。比如``科学家''和``们''分别放在$\vectorn{s}$的位置1和位置2。可以用符号$\pi$记录生成的单词在源语言句子$\vectorn{s}$中的位置。比如``Scientists'' 生成的汉语单词在$\vectorn{s}$ 中的位置表示为${\pi}_{1}=\{{\pi}_{11}=1,{\pi}_{12}=2\}$。
+\item 最后，把生成的所有汉语单词放在合适的位置。比如``科学家''和``们''分别放在$\vectorn{\emph{s}}$的位置1和位置2。可以用符号$\pi$记录生成的单词在源语言句子$\vectorn{\emph{s}}$中的位置。比如``Scientists'' 生成的汉语单词在$\vectorn{\emph{s}}$ 中的位置表示为${\pi}_{1}=\{{\pi}_{11}=1,{\pi}_{12}=2\}$。
 \vspace{0.3em}
 \end{itemize}
@@ -196,13 +196,13 @@
 \end{figure}
 %----------------------------------------------
-\parinterval 为了表述清晰，这里重新说明每个符号的含义。$\vectorn{s}$、$\vectorn{t}$、$m$和$l$分别表示源语言句子、目标语言译文、源语言单词数量以及译文单词数量。$\vectorn{\varphi}$、$\vectorn{\tau}$ 和$\vectorn{\pi}$分别表示产出率、生成的源语言单词以及它们在源语言句子中的位置。${\varphi}_{i}$表示第$i$个目标语言单词$t_i$的产出率。${\tau}_{i}$和${\pi}_i$ 分别表示$t_i$生成的源语言单词列表及其在源语言句子$\vectorn{s}$中的位置列表。
+\parinterval 为了表述清晰，这里重新说明每个符号的含义。$\vectorn{\emph{s}}$、$\vectorn{\emph{t}}$、$m$和$l$分别表示源语言句子、目标语言译文、源语言单词数量以及译文单词数量。$\mathbf{\varphi}$、$\mathbf{\tau}$ 和$\mathbf{\pi}$分别表示产出率、生成的源语言单词以及它们在源语言句子中的位置。${\varphi}_{i}$表示第$i$个目标语言单词$t_i$的产出率。${\tau}_{i}$和${\pi}_i$ 分别表示$t_i$生成的源语言单词列表及其在源语言句子$\vectorn{\emph{s}}$中的位置列表。
-\parinterval 可以看出，一组$\tau$和$\pi$（记为$<\tau,\pi>$）可以决定一个对齐$\vectorn{a}$和一个源语句子$\vectorn{s}$。
+\parinterval 可以看出，一组$\tau$和$\pi$（记为$<\tau,\pi>$）可以决定一个对齐$\vectorn{\emph{a}}$和一个源语句子$\vectorn{\emph{s}}$。
-\noindent 相反的，一个对齐$\vectorn{a}$和一个源语句子$\vectorn{s}$可以对应多组$<\tau,\pi>$。如图\ref{fig:6-6}所示，不同的$<\tau,\pi>$对应同一个源语言句子和词对齐。它们的区别在于目标语单词``Scientists''生成的源语言单词``科学家''和`` 们''的顺序不同。这里把不同的$<\tau,\pi>$对应到的相同的源语句子$\vectorn{s}$和对齐$\vectorn{a}$记为$<\vectorn{s},\vectorn{a}>$。因此计算$\funp{P}(\vectorn{s},\vectorn{a}| \vectorn{t})$时需要把每个可能结果的概率加起来，如下：
+\noindent 相反的，一个对齐$\vectorn{\emph{a}}$和一个源语句子$\vectorn{\emph{s}}$可以对应多组$<\tau,\pi>$。如图\ref{fig:6-6}所示，不同的$<\tau,\pi>$对应同一个源语言句子和词对齐。它们的区别在于目标语单词``Scientists''生成的源语言单词``科学家''和`` 们''的顺序不同。这里把不同的$<\tau,\pi>$对应到的相同的源语句子$\vectorn{\emph{s}}$和对齐$\vectorn{\emph{a}}$记为$<\vectorn{\emph{s}},\vectorn{\emph{a}}>$。因此计算$\funp{P}(\vectorn{\emph{s}},\vectorn{\emph{a}}| \vectorn{\emph{t}})$时需要把每个可能结果的概率加起来，如下：
 \begin{equation}
-\funp{P}(\vectorn{s},\vectorn{a}| \vectorn{t})=\sum_{{<\tau,\pi>}\in{<\vectorn{s},\vectorn{a}>}}{\funp{P}(\tau,\pi|\vectorn{t}) }
+\funp{P}(\vectorn{\emph{s}},\vectorn{\emph{a}}| \vectorn{\emph{t}})=\sum_{{<\tau,\pi>}\in{<\vectorn{\emph{s}},\vectorn{\emph{a}}>}}{\funp{P}(\tau,\pi|\vectorn{\emph{t}}) }
 \label{eq:6-9}
 \end{equation}
@@ -216,10 +216,10 @@
 %----------------------------------------------
-\parinterval 不过$<\vectorn{s},\vectorn{a}>$中有多少组$<\tau,\pi>$呢？通过图\ref{fig:6-5}中的例子，可以推出$<\vectorn{s},\vectorn{a}>$应该包含$\prod_{i=0}^{l}{\varphi_i !}$个不同的二元组$<\tau,\pi>$。 这是因为在给定源语言句子和词对齐时，对于每一个$\tau_i$都有$\varphi_{i}!$种排列。
+\parinterval 不过$<\vectorn{\emph{s}},\vectorn{\emph{a}}>$中有多少组$<\tau,\pi>$呢？通过图\ref{fig:6-5}中的例子，可以推出$<\vectorn{\emph{s}},\vectorn{\emph{a}}>$应该包含$\prod_{i=0}^{l}{\varphi_i !}$个不同的二元组$<\tau,\pi>$。 这是因为在给定源语言句子和词对齐时，对于每一个$\tau_i$都有$\varphi_{i}!$种排列。
-\parinterval 进一步，$\funp{P}(\tau,\pi|\vectorn{t})$可以被表示如图\ref{fig:6-7}的形式。其中$\tau_{i1}^{k-1}$表示$\tau_{i1}\tau_{i2}\cdots \tau_{i(k-1)}$，$\pi_{i1}^{ k-1}$表示$\pi_{i1}\pi_{i2}\cdots \pi_{i(k-1)}$。可以把图\ref{fig:6-7}中的公式分为5个部分，并用不同的序号和颜色进行标注。每部分的具体含义是：
+\parinterval 进一步，$\funp{P}(\tau,\pi|\vectorn{\emph{t}})$可以被表示如图\ref{fig:6-7}的形式。其中$\tau_{i1}^{k-1}$表示$\tau_{i1}\tau_{i2}\cdots \tau_{i(k-1)}$，$\pi_{i1}^{ k-1}$表示$\pi_{i1}\pi_{i2}\cdots \pi_{i(k-1)}$。可以把图\ref{fig:6-7}中的公式分为5个部分，并用不同的序号和颜色进行标注。每部分的具体含义是：
 %----------------------------------------------
 \begin{figure}[htp]
@@ -233,11 +233,11 @@
 \begin{itemize}
 \vspace{0.5em}
-\item 第一部分：每个$i\in[1,l]$的目标语单词的产出率建模（{\color{red!70} 红色}），即$\varphi_i$的生成概率。它依赖于$\vectorn{t}$和区间$[1,i-1]$的目标语单词的产出率$\varphi_1^{i-1}$。\footnote{这里约定，当$i=1$ 时，$\varphi_1^0$ 表示空。}
+\item 第一部分：每个$i\in[1,l]$的目标语单词的产出率建模（{\color{red!70} 红色}），即$\varphi_i$的生成概率。它依赖于$\vectorn{\emph{t}}$和区间$[1,i-1]$的目标语单词的产出率$\varphi_1^{i-1}$。\footnote{这里约定，当$i=1$ 时，$\varphi_1^0$ 表示空。}
 \vspace{0.5em}
-\item 第二部分：$i=0$时的产出率建模（{\color{blue!70} 蓝色}），即空标记$t_0$的产出率生成概率。它依赖于$\vectorn{t}$和区间$[1,i-1]$的目标语单词的产出率$\varphi_1^l$。
+\item 第二部分：$i=0$时的产出率建模（{\color{blue!70} 蓝色}），即空标记$t_0$的产出率生成概率。它依赖于$\vectorn{\emph{t}}$和区间$[1,i-1]$的目标语单词的产出率$\varphi_1^l$。
 \vspace{0.5em}
-\item 第三部分：词汇翻译建模（{\color{green!70} 绿色}），目标语言单词$t_i$生成第$k$个源语言单词$\tau_{ik}$时的概率，依赖于$\vectorn{t}$、所有目标语言单词的产出率$\varphi_0^l$、区间$i\in[1,l]$的目标语言单词生成的源语言单词$\tau_1^{i-1}$和目标语单词$t_i$生成的前$k$个源语言单词$\tau_{i1}^{k-1}$。
+\item 第三部分：词汇翻译建模（{\color{green!70} 绿色}），目标语言单词$t_i$生成第$k$个源语言单词$\tau_{ik}$时的概率，依赖于$\vectorn{\emph{t}}$、所有目标语言单词的产出率$\varphi_0^l$、区间$i\in[1,l]$的目标语言单词生成的源语言单词$\tau_1^{i-1}$和目标语单词$t_i$生成的前$k$个源语言单词$\tau_{i1}^{k-1}$。
 \vspace{0.5em}
 \item 第四部分：对于每个$i\in[1,l]$的目标语言单词生成的源语言单词的扭曲度建模（{\color{yellow!70!black} 黄色}），即第$i$个目标语言单词生成的第$k$个源语言单词在源文中的位置$\pi_{ik}$ 的概率。其中$\pi_1^{i-1}$ 表示区间$[1,i-1]$的目标语言单词生成的源语言单词的扭曲度，$\pi_{i1}^{k-1}$表示第$i$目标语言单词生成的前$k-1$个源语言单词的扭曲度。
 \vspace{0.5em}
@@ -250,51 +250,51 @@
 \subsection{IBM 模型3}
-\parinterval IBM模型3通过一些假设对图\ref{fig:6-7}所表示的基本模型进行了化简。具体来说，对于每个$i\in[1,l]$，假设$\funp{P}(\varphi_i |\varphi_1^{i-1},\vectorn{t})$仅依赖于$\varphi_i$和$t_i$，$\funp{P}(\pi_{ik}|\pi_{i1}^{k-1},\pi_1^{i-1},\tau_0^l,\varphi_0^l,\vectorn{t})$仅依赖于$\pi_{ik}$、$i$、$m$和$l$。而对于所有的$i\in[0,l]$，假设$\funp{P}(\tau_{ik}|\tau_{i1}^{k-1},\tau_1^{i-1},\varphi_0^l,\vectorn{t})$仅依赖于$\tau_{ik}$和$t_i$。这些假设的形式化描述为：
+\parinterval IBM模型3通过一些假设对图\ref{fig:6-7}所表示的基本模型进行了化简。具体来说，对于每个$i\in[1,l]$，假设$\funp{P}(\varphi_i |\varphi_1^{i-1},\vectorn{\emph{t}})$仅依赖于$\varphi_i$和$t_i$，$\funp{P}(\pi_{ik}|\pi_{i1}^{k-1},\pi_1^{i-1},\tau_0^l,\varphi_0^l,\vectorn{\emph{t}})$仅依赖于$\pi_{ik}$、$i$、$m$和$l$。而对于所有的$i\in[0,l]$，假设$\funp{P}(\tau_{ik}|\tau_{i1}^{k-1},\tau_1^{i-1},\varphi_0^l,\vectorn{\emph{t}})$仅依赖于$\tau_{ik}$和$t_i$。这些假设的形式化描述为：
 \begin{eqnarray}
-\funp{P}(\varphi_i|\varphi_1^{i-1},\vectorn{t})                                                              & = &{\funp{P}(\varphi_i|t_i)} \label{eq:6-10} \\
+\funp{P}(\varphi_i|\varphi_1^{i-1},\vectorn{\emph{t}})                                                              & = &{\funp{P}(\varphi_i|t_i)} \label{eq:6-10} \\
-\funp{P}(\tau_{ik} = s_j |\tau_{i1}^{k-1},\tau_{1}^{i-1},\varphi_0^t,\vectorn{t})             & = & t(s_j|t_i) \label{eq:6-11} \\
+\funp{P}(\tau_{ik} = s_j |\tau_{i1}^{k-1},\tau_{1}^{i-1},\varphi_0^t,\vectorn{\emph{t}})             & = & t(s_j|t_i) \label{eq:6-11} \\
-\funp{P}(\pi_{ik} = j |\pi_{i1}^{k-1},\pi_{1}^{i-1},\tau_{0}^{l},\varphi_{0}^{l},\vectorn{t}) & = & d(j|i,m,l) \label{eq:6-12}
+\funp{P}(\pi_{ik} = j |\pi_{i1}^{k-1},\pi_{1}^{i-1},\tau_{0}^{l},\varphi_{0}^{l},\vectorn{\emph{t}}) & = & d(j|i,m,l) \label{eq:6-12}
 \end{eqnarray}
-\parinterval 通常把$d(j|i,m,l)$称为扭曲度函数。这里$\funp{P}(\varphi_i|\varphi_1^{i-1},\vectorn{t})={\funp{P}(\varphi_i|t_i)}$和${\funp{P}(\pi_{ik}=j|\pi_{i1}^{k-1},}$ $\pi_{1}^{i-1},\tau_0^l,\varphi_0^l,\vectorn{t})=d(j|i,m,l)$仅对$1 \le i \le l$成立。这样就完成了图\ref{fig:6-7}中第1、 3和4部分的建模。
+\parinterval 通常把$d(j|i,m,l)$称为扭曲度函数。这里$\funp{P}(\varphi_i|\varphi_1^{i-1},\vectorn{\emph{t}})={\funp{P}(\varphi_i|t_i)}$和${\funp{P}(\pi_{ik}=j|\pi_{i1}^{k-1},}$ $\pi_{1}^{i-1},\tau_0^l,\varphi_0^l,\vectorn{\emph{t}})=d(j|i,m,l)$仅对$1 \le i \le l$成立。这样就完成了图\ref{fig:6-7}中第1、3和4部分的建模。
-\parinterval 对于$i=0$的情况需要单独进行考虑。实际上，$t_0$只是一个虚拟的单词。它要对应$\vectorn{s}$中原本为空对齐的单词。这里假设：要等其他非空对应单词都被生成（放置）后，才考虑这些空对齐单词的生成（放置）。即非空对单词都被生成后，在那些还有空的位置上放置这些空对的源语言单词。此外，在任何的空位置上放置空对的源语言单词都是等概率的，即放置空对齐源语言单词服从均匀分布。这样在已经放置了$k$个空对齐源语言单词的时候，应该还有$\varphi_0-k$个空位置。如果第$j$个源语言位置为空，那么
+\parinterval 对于$i=0$的情况需要单独进行考虑。实际上，$t_0$只是一个虚拟的单词。它要对应$\vectorn{\emph{s}}$中原本为空对齐的单词。这里假设：要等其他非空对应单词都被生成（放置）后，才考虑这些空对齐单词的生成（放置）。即非空对单词都被生成后，在那些还有空的位置上放置这些空对的源语言单词。此外，在任何的空位置上放置空对的源语言单词都是等概率的，即放置空对齐源语言单词服从均匀分布。这样在已经放置了$k$个空对齐源语言单词的时候，应该还有$\varphi_0-k$个空位置。如果第$j$个源语言位置为空，那么
 \begin{equation}
-\funp{P}(\pi_{0k}=j|\pi_{01}^{k-1},\pi_1^l,\tau_0^l,\varphi_0^l,\vectorn{t})=\frac{1}{\varphi_0-k}
+\funp{P}(\pi_{0k}=j|\pi_{01}^{k-1},\pi_1^l,\tau_0^l,\varphi_0^l,\vectorn{\emph{t}})=\frac{1}{\varphi_0-k}
 \label{eq:6-13}
 \end{equation}
 否则
 \begin{equation}
-\funp{P}(\pi_{0k}=j|\pi_{01}^{k-1},\pi_1^l,\tau_0^l,\varphi_0^l,\vectorn{t})=0
+\funp{P}(\pi_{0k}=j|\pi_{01}^{k-1},\pi_1^l,\tau_0^l,\varphi_0^l,\vectorn{\emph{t}})=0
 \label{eq:6-14}
 \end{equation}
 这样对于$t_0$所对应的$\tau_0$，就有
 {
 \begin{eqnarray}
-\prod_{k=1}^{\varphi_0}{\funp{P}(\pi_{0k}|\pi_{01}^{k-1},\pi_{1}^{l},\tau_{0}^{l},\varphi_{0}^{l},\vectorn{t})         }=\frac{1}{\varphi_{0}!}
+\prod_{k=1}^{\varphi_0}{\funp{P}(\pi_{0k}|\pi_{01}^{k-1},\pi_{1}^{l},\tau_{0}^{l},\varphi_{0}^{l},\vectorn{\emph{t}})         }=\frac{1}{\varphi_{0}!}
 \label{eq:6-15}
 \end{eqnarray}
 }
 \parinterval 而上面提到的$t_0$所对应的这些空位置是如何生成的呢？即如何确定哪些位置是要放置空对齐的源语言单词。在IBM模型3中，假设在所有的非空对齐源语言单词都被生成出来后（共$\varphi_1+\varphi_2+\cdots {\varphi}_l$个非空对源语单词），这些单词后面都以$p_1$概率随机地产生一个``槽''用来放置空对齐单词。这样，${\varphi}_0$就服从了一个二项分布。于是得到
 {
 \begin{eqnarray}
-\funp{P}(\varphi_0|\vectorn{t})=\big(\begin{array}{c}
+\funp{P}(\varphi_0|\vectorn{\emph{t}})=\big(\begin{array}{c}
 \varphi_1+\varphi_2+\cdots \varphi_l\\
 \varphi_0\\
 \end{array}\big)p_0^{\varphi_1+\varphi_2+\cdots \varphi_l-\varphi_0}p_1^{\varphi_0}
 \label{eq:6-16}
 \end{eqnarray}
 }
-\noindent 其中，$p_0+p_1=1$。到此为止，已经完成了图\ref{fig:6-7}中第2和5部分的建模。最终根据这些假设可以得到$\funp{P}(\vectorn{s}| \vectorn{t})$的形式为：
+\noindent 其中，$p_0+p_1=1$。到此为止，已经完成了图\ref{fig:6-7}中第2和5部分的建模。最终根据这些假设可以得到$\funp{P}(\vectorn{\emph{s}}| \vectorn{\emph{t}})$的形式为：
 {
 \begin{eqnarray}
-{\funp{P}(\vectorn{s}| \vectorn{t})}&= &{\sum_{a_1=0}^{l}{\cdots}\sum_{a_m=0}^{l}{\Big[\big(\begin{array}{c}
+{\funp{P}(\vectorn{\emph{s}}| \vectorn{\emph{t}})}&= &{\sum_{a_1=0}^{l}{\cdots}\sum_{a_m=0}^{l}{\Big[\big(\begin{array}{c}
 m-\varphi_0\\
 \varphi_0\\
 \end{array}\big)}p_0^{m-2\varphi_0}p_1^{\varphi_0}\prod_{i=1}^{l}{{\varphi_i}!n(\varphi_i|t_i)    }} \nonumber \\
@@ -331,20 +331,20 @@ p_0+p_1                            & = & 1 \label{eq:6-21}
 \end{figure}
 %----------------------------------------------
-\parinterval 在IBM模型的词对齐框架下，目标语的cept.只能是那些非空对齐的目标语单词，而且每个cept.只能由一个目标语言单词组成（通常把这类由一个单词组成的cept.称为独立单词cept.）。这里用$[i]$表示第$i$ 个独立单词cept.在目标语言句子中的位置。换句话说，$[i]$表示第$i$个非空对的目标语单词的位置。比如在本例中``mind''在$\vectorn{t}$中的位置表示为$[3]$。
+\parinterval 在IBM模型的词对齐框架下，目标语的cept.只能是那些非空对齐的目标语单词，而且每个cept.只能由一个目标语言单词组成（通常把这类由一个单词组成的cept.称为独立单词cept.）。这里用$[i]$表示第$i$ 个独立单词cept.在目标语言句子中的位置。换句话说，$[i]$表示第$i$个非空对的目标语单词的位置。比如在本例中``mind''在$\vectorn{\emph{t}}$中的位置表示为$[3]$。
 \parinterval 另外，可以用$\odot_{i}$表示位置为$[i]$的目标语言单词对应的那些源语言单词位置的平均值，如果这个平均值不是整数则对它向上取整。比如在本例中，目标语句中第4个cept. （``.''）对应在源语言句子中的第5个单词。可表示为${\odot}_{4}=5$。
 \parinterval 利用这些新引进的概念，模型4对模型3的扭曲度进行了修改。主要是把扭曲度分解为两类参数。对于$[i]$对应的源语言单词列表($\tau_{[i]}$)中的第一个单词($\tau_{[i]1}$），它的扭曲度用如下公式计算：
 \begin{equation}
-\funp{P}(\pi_{[i]1}=j|{\pi}_1^{[i]-1},{\tau}_0^l,{\varphi}_0^l,\vectorn{t})=d_{1}(j-{\odot}_{i-1}|A(t_{[i-1]}),B(s_j))
+\funp{P}(\pi_{[i]1}=j|{\pi}_1^{[i]-1},{\tau}_0^l,{\varphi}_0^l,\vectorn{\emph{t}})=d_{1}(j-{\odot}_{i-1}|A(t_{[i-1]}),B(s_j))
 \label{eq:6-22}
 \end{equation}
 \noindent 其中，第$i$个目标语言单词生成的第$k$个源语言单词的位置用变量$\pi_{ik}$表示。而对于列表($\tau_{[i]}$)中的其他的单词($\tau_{[i]k},1 < k \le \varphi_{[i]}$)的扭曲度，用如下公式计算：
 \begin{equation}
-\funp{P}(\pi_{[i]k}=j|{\pi}_{[i]1}^{k-1},\pi_1^{[i]-1},\tau_0^l,\varphi_0^l,\vectorn{t})=d_{>1}(j-\pi_{[i]k-1}|B(s_j))
+\funp{P}(\pi_{[i]k}=j|{\pi}_{[i]1}^{k-1},\pi_1^{[i]-1},\tau_0^l,\varphi_0^l,\vectorn{\emph{t}})=d_{>1}(j-\pi_{[i]k-1}|B(s_j))
 \label{eq:6-23}
 \end{equation}
@@ -373,19 +373,19 @@ p_0+p_1                            & = & 1 \label{eq:6-21}
 \parinterval 为了解决这个问题，模型5在模型中增加了额外的约束。基本想法是，在放置一个源语言单词的时候检查这个位置是否已经放置了单词，如果可以则把这个放置过程赋予一定的概率，否则把它作为不可能事件。基于这个想法，就需要在逐个放置源语言单词的时候判断源语言句子的哪些位置为空。这里引入一个变量$v(j, {\tau_1}^{[i]-1}, \tau_{[i]1}^{k-1})$，它表示在放置$\tau_{[i]k}$之前（$\tau_1^{[i]-1}$ 和$\tau_{[i]1}^{k-1}$已经被放置完了），从源语言句子的第一个位置到位置$j$（包含$j$）为止还有多少个空位置。这里，把这个变量简写为$v_j$。于是，对于$[i]$所对应的源语言单词列表（$\tau_{[i]}$）中的第一个单词（$\tau_{[i]1}$），有：
 \begin{eqnarray}
-\funp{P}(\pi_{[i]1} = j | \pi_1^{[i]-1}, \tau_0^l, \varphi_0^l, \vectorn{t}) & = & d_1(v_j|B(s_j), v_{\odot_{i-1}}, v_m-(\varphi_{[i]}-1)) \cdot \nonumber \\
+\funp{P}(\pi_{[i]1} = j | \pi_1^{[i]-1}, \tau_0^l, \varphi_0^l, \vectorn{\emph{t}}) & = & d_1(v_j|B(s_j), v_{\odot_{i-1}}, v_m-(\varphi_{[i]}-1)) \cdot \nonumber \\
                                                                                                   &     & (1-\delta(v_j,v_{j-1}))
 \label{eq:6-24}
 \end{eqnarray}
 \parinterval 对于其他单词（$\tau_{[i]k}$, $1 < k\le\varphi_{[i]}$），有：
 \begin{eqnarray}
-&   & \funp{P}(\pi_{[i]k}=j|\pi_{[i]1}^{k-1}, \pi_1^{[i]-1}, \tau_0^l, \varphi_0^l,\vectorn{t}) \nonumber \\
+&   & \funp{P}(\pi_{[i]k}=j|\pi_{[i]1}^{k-1}, \pi_1^{[i]-1}, \tau_0^l, \varphi_0^l,\vectorn{\emph{t}}) \nonumber \\
 &= & d_{>1}(v_j-v_{\pi_{[i]k-1}}|B(s_j), v_m-v_{\pi_{[i]k-1}}-\varphi_{[i]}+k) \cdot (1-\delta(v_j,v_{j-1}))
 \label{eq:6-25}
 \end{eqnarray}
-\noindent 这里，因子$1-\delta(v_j, v_{j-1})$是用来判断第$j$个位置是不是为空。如果第$j$个位置为空则$v_j = v_{j-1}$，这样$\funp{P}(\pi_{[i]1}=j|\pi_1^{[i]-1}, \tau_0^l, \varphi_0^l, \vectorn{t}) = 0$。这样就从模型上避免了模型3和模型4中生成不存在的字符串的问题。这里还要注意的是，对于放置第一个单词的情况，影响放置的因素有$v_j$，$B(s_i)$和$v_{j-1}$。此外还要考虑位置$j$放置了第一个源语言单词以后它的右边是不是还有足够的位置留给剩下的$k-1$个源语言单词。参数$v_m-(\varphi_{[i]}-1)$正是为了考虑这个因素，这里$v_m$表示整个源语言句子中还有多少空位置，$\varphi_{[i]}-1$ 表示源语言位置$j$右边至少还要留出的空格数。对于放置非第一个单词的情况，主要是要考虑它和前一个放置位置的相对位置。这主要体现在参数$v_j-v_{\varphi_{[i]}k-1}$上。式\ref{eq:6-25} 的其他部分都可以用上面的理论解释，这里不再赘述。
+\noindent 这里，因子$1-\delta(v_j, v_{j-1})$是用来判断第$j$个位置是不是为空。如果第$j$个位置为空则$v_j = v_{j-1}$，这样$\funp{P}(\pi_{[i]1}=j|\pi_1^{[i]-1}, \tau_0^l, \varphi_0^l, \vectorn{\emph{t}}) = 0$。这样就从模型上避免了模型3和模型4中生成不存在的字符串的问题。这里还要注意的是，对于放置第一个单词的情况，影响放置的因素有$v_j$，$B(s_i)$和$v_{j-1}$。此外还要考虑位置$j$放置了第一个源语言单词以后它的右边是不是还有足够的位置留给剩下的$k-1$个源语言单词。参数$v_m-(\varphi_{[i]}-1)$正是为了考虑这个因素，这里$v_m$表示整个源语言句子中还有多少空位置，$\varphi_{[i]}-1$ 表示源语言位置$j$右边至少还要留出的空格数。对于放置非第一个单词的情况，主要是要考虑它和前一个放置位置的相对位置。这主要体现在参数$v_j-v_{\varphi_{[i]}k-1}$上。式\eqref{eq:6-25} 的其他部分都可以用上面的理论解释，这里不再赘述。
 \parinterval 实际上，模型5和模型4的思想基本一致，即，先确定$\tau_{[i]1}$的绝对位置，然后再确定$\tau_{[i]}$中剩余单词的相对位置。模型5消除了产生不存在的句子的可能性，不过模型5的复杂性也大大增加了。
 %----------------------------------------------------------------------------------------
@@ -426,23 +426,23 @@ p_0+p_1                            & = & 1 \label{eq:6-21}
 \subsection{``缺陷''问题}
-\parinterval IBM模型的缺陷是指翻译模型会把一部分概率分配给一些根本不存在的源语言字符串。如果用$\funp{P}(\textrm{well}|\vectorn{t})$表示$\funp{P}(\vectorn{s}| \vectorn{t})$在所有的正确的（可以理解为语法上正确的）$\vectorn{s}$上的和，即
+\parinterval IBM模型的缺陷是指翻译模型会把一部分概率分配给一些根本不存在的源语言字符串。如果用$\funp{P}(\textrm{well}|\vectorn{\emph{t}})$表示$\funp{P}(\vectorn{\emph{s}}| \vectorn{\emph{t}})$在所有的正确的（可以理解为语法上正确的）$\vectorn{\emph{s}}$上的和，即
 \begin{eqnarray}
-\funp{P}(\textrm{well}|\vectorn{t})=\sum_{\vectorn{s}\textrm{\;is\;well\;formed}}{\funp{P}(\vectorn{s}| \vectorn{t})}
+\funp{P}(\textrm{well}|\vectorn{\emph{t}})=\sum_{\vectorn{\emph{s}}\textrm{\;is\;well\;formed}}{\funp{P}(\vectorn{\emph{s}}| \vectorn{\emph{t}})}
 \label{eq:6-26}
 \end{eqnarray}
-\parinterval 类似地，用$\funp{P}(\textrm{ill}|\vectorn{t})$表示$\funp{P}(\vectorn{s}| \vectorn{t})$在所有的错误的（可以理解为语法上错误的）$\vectorn{s}$上的和。如果$\funp{P}(\textrm{well}|\vectorn{t})+ \funp{P}(\textrm{ill}|\vectorn{t})<1$，就把剩余的部分定义为$\funp{P}(\textrm{failure}|\vectorn{t})$。它的形式化定义为，
+\parinterval 类似地，用$\funp{P}(\textrm{ill}|\vectorn{\emph{t}})$表示$\funp{P}(\vectorn{\emph{s}}| \vectorn{\emph{t}})$在所有的错误的（可以理解为语法上错误的）$\vectorn{\emph{s}}$上的和。如果$\funp{P}(\textrm{well}|\vectorn{\emph{t}})+ \funp{P}(\textrm{ill}|\vectorn{\emph{t}})<1$，就把剩余的部分定义为$\funp{P}(\textrm{failure}|\vectorn{\emph{t}})$。它的形式化定义为，
 \begin{eqnarray}
-\funp{P}({\textrm{failure}|\vectorn{t}})  = 1 - \funp{P}({\textrm{well}|\vectorn{t}}) - \funp{P}({\textrm{ill}|\vectorn{t}})
+\funp{P}({\textrm{failure}|\vectorn{\emph{t}}})  = 1 - \funp{P}({\textrm{well}|\vectorn{\emph{t}}}) - \funp{P}({\textrm{ill}|\vectorn{\emph{t}}})
 \label{eq:6-27}
 \end{eqnarray}
-\parinterval 本质上，模型3和模型4就是对应$\funp{P}({\textrm{failure}|\vectorn{t}})>0$的情况。这部分概率是模型损失掉的。有时候也把这类缺陷称为{\small\bfnew{物理缺陷}}\index{物理缺陷}（Physical Deficiency\index{Physical Deficiency}）或{\small\bfnew{技术缺陷}}\index{技术缺陷}（Technical Deficiency\index{Technical Deficiency}）。还有一种缺陷被称作{\small\bfnew{精神缺陷}}（Spiritual Deficiency\index{Spiritual Deficiency}）或{\small\bfnew{逻辑缺陷}}\index{逻辑缺陷}（Logical Deficiency\index{Logical Deficiency}），它是指$\funp{P}({\textrm{well}|\vectorn{t}}) + \funp{P}({\textrm{ill}|\vectorn{t}}) = 1$ 且$\funp{P}({\textrm{ill}|\vectorn{t}}) > 0$的情况。模型1 和模型2 就有逻辑缺陷。可以注意到，技术缺陷只存在于模型3 和模型4 中，模型1和模型2并没有技术缺陷问题。根本原因在于模型1和模型2的词对齐是从源语言出发对应到目标语言，$\vectorn{t}$到$\vectorn{s}$ 的翻译过程实际上是从单词$s_1$开始到单词$s_m$ 结束，依次把每个源语言单词$s_j$对应到唯一一个目标语言位置。显然，这个过程能够保证每个源语言单词仅对应一个目标语言单词。但是，模型3 和模型4中对齐是从目标语言出发对应到源语言，$\vectorn{t}$到$\vectorn{s}$的翻译过程从$t_1$开始$t_l$ 结束，依次把目标语言单词$t_i$生成的单词对应到某个源语言位置上。但是这个过程不能保证$t_i$中生成的单词所对应的位置没有被其他单词占用，因此也就产生了缺陷。
+\parinterval 本质上，模型3和模型4就是对应$\funp{P}({\textrm{failure}|\vectorn{\emph{t}}})>0$的情况。这部分概率是模型损失掉的。有时候也把这类缺陷称为{\small\bfnew{物理缺陷}}\index{物理缺陷}（Physical Deficiency\index{Physical Deficiency}）或{\small\bfnew{技术缺陷}}\index{技术缺陷}（Technical Deficiency\index{Technical Deficiency}）。还有一种缺陷被称作{\small\bfnew{精神缺陷}}（Spiritual Deficiency\index{Spiritual Deficiency}）或{\small\bfnew{逻辑缺陷}}\index{逻辑缺陷}（Logical Deficiency\index{Logical Deficiency}），它是指$\funp{P}({\textrm{well}|\vectorn{\emph{t}}}) + \funp{P}({\textrm{ill}|\vectorn{\emph{t}}}) = 1$ 且$\funp{P}({\textrm{ill}|\vectorn{\emph{t}}}) > 0$的情况。模型1 和模型2 就有逻辑缺陷。可以注意到，技术缺陷只存在于模型3 和模型4 中，模型1和模型2并没有技术缺陷问题。根本原因在于模型1和模型2的词对齐是从源语言出发对应到目标语言，$\vectorn{\emph{t}}$到$\vectorn{\emph{s}}$ 的翻译过程实际上是从单词$s_1$开始到单词$s_m$ 结束，依次把每个源语言单词$s_j$对应到唯一一个目标语言位置。显然，这个过程能够保证每个源语言单词仅对应一个目标语言单词。但是，模型3 和模型4中对齐是从目标语言出发对应到源语言，$\vectorn{\emph{t}}$到$\vectorn{\emph{s}}$的翻译过程从$t_1$开始$t_l$ 结束，依次把目标语言单词$t_i$生成的单词对应到某个源语言位置上。但是这个过程不能保证$t_i$中生成的单词所对应的位置没有被其他单词占用，因此也就产生了缺陷。
-\parinterval 这里还要强调的是，技术缺陷是模型3和模型4是模型本身的缺陷造成的，如果有一个``更好''的模型就可以完全避免这个问题。而逻辑缺陷几乎是不能从模型上根本解决的，因为对于任意一种语言都不能枚举所有的句子（$\funp{P}({\textrm{ill}|\vectorn{t}})$实际上是得不到的）。
+\parinterval 这里还要强调的是，技术缺陷是模型3和模型4是模型本身的缺陷造成的，如果有一个``更好''的模型就可以完全避免这个问题。而逻辑缺陷几乎是不能从模型上根本解决的，因为对于任意一种语言都不能枚举所有的句子（$\funp{P}({\textrm{ill}|\vectorn{\emph{t}}})$实际上是得不到的）。
-\parinterval IBM的模型5已经解决了技术缺陷问题。但逻辑缺陷的解决很困难，因为即使对于人来说也很难判断一个句子是不是``良好''的句子。当然可以考虑用语言模型来缓解这个问题，不过由于在翻译的时候源语言句子都是定义``良好''的句子，$\funp{P}({\textrm{ill}|\vectorn{t}})$对$\funp{P}(\vectorn{s}| \vectorn{t})$的影响并不大。但用输入的源语言句子$\vectorn{s}$的``良好性''并不能解决技术缺陷，因为技术缺陷是模型的问题或者模型参数估计方法的问题。无论输入什么样的$\vectorn{s}$，模型3和模型4的技术缺陷问题都存在。
+\parinterval IBM的模型5已经解决了技术缺陷问题。但逻辑缺陷的解决很困难，因为即使对于人来说也很难判断一个句子是不是``良好''的句子。当然可以考虑用语言模型来缓解这个问题，不过由于在翻译的时候源语言句子都是定义``良好''的句子，$\funp{P}({\textrm{ill}|\vectorn{\emph{t}}})$对$\funp{P}(\vectorn{\emph{s}}| \vectorn{\emph{t}})$的影响并不大。但用输入的源语言句子$\vectorn{\emph{s}}$的``良好性''并不能解决技术缺陷，因为技术缺陷是模型的问题或者模型参数估计方法的问题。无论输入什么样的$\vectorn{\emph{s}}$，模型3和模型4的技术缺陷问题都存在。
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -450,7 +450,7 @@ p_0+p_1                            & = & 1 \label{eq:6-21}
 \subsection{句子长度}
-\parinterval 在IBM模型中，$\funp{P}(\vectorn{t})\funp{P}(\vectorn{s}| \vectorn{t})$会随着目标语言句子长度的增加而减少，因为这种模型有多个概率化的因素组成，乘积项越多结果的值越小。这也就是说，IBM模型会更倾向选择长度短一些的目标语言句子。显然这种对短句子的偏向性并不是机器翻译所期望的。
+\parinterval 在IBM模型中，$\funp{P}(\vectorn{\emph{t}})\funp{P}(\vectorn{\emph{s}}| \vectorn{\emph{t}})$会随着目标语言句子长度的增加而减少，因为这种模型有多个概率化的因素组成，乘积项越多结果的值越小。这也就是说，IBM模型会更倾向选择长度短一些的目标语言句子。显然这种对短句子的偏向性并不是机器翻译所期望的。
 \parinterval 这个问题在很多机器翻译系统中都存在。它实际上也反应了一种{\small\bfnew{系统偏置}}\index{系统偏置}（System Bias）\index{System Bias}的体现。为了消除这种偏置，可以通过在模型中增加一个短句子惩罚引子来抵消掉模型对短句子的倾向性。比如，可以定义一个惩罚引子，它的值随着长度的减少而增加。不过，简单引入这样的惩罚因子会导致模型并不符合一个严格的噪声信道模型。它对应一个基于判别式框架的翻译模型，这部分内容会在{\chapterseven}进行介绍。
@@ -460,7 +460,7 @@ p_0+p_1                            & = & 1 \label{eq:6-21}
 \subsection{其他问题}
-\parinterval 模型5的意义是什么？模型5的提出是为了消除模型3和模型4的缺陷。缺陷的本质是，$\funp{P}(\vectorn{s},\vectorn{a}| \vectorn{t})$在所有合理的对齐上概率和不为1。 但是，在这里更关心是哪个对齐$\vectorn{a}$使$\funp{P}(\vectorn{s},\vectorn{a}| \vectorn{t})$达到最大，即使$\funp{P}(\vectorn{s},\vectorn{a}|\vectorn{t})$不符合概率分布的定义，也并不影响我们寻找理想的对齐$\vectorn{a}$。从工程的角度说，$\funp{P}(\vectorn{s},\vectorn{a}| \vectorn{t})$不归一并不是一个十分严重的问题。遗憾的是，实际上到现在为止有太多对IBM模型3和模型4中的缺陷进行过系统的实验和分析，但对于这个问题到底有多严重并没有定论。当然用模型5是可以解决这个问题。但是如果用一个非常复杂的模型去解决了一个并不产生严重后果的问题，那这个模型也就没有太大意义了（从实践的角度）。
+\parinterval 模型5的意义是什么？模型5的提出是为了消除模型3和模型4的缺陷。缺陷的本质是，$\funp{P}(\vectorn{\emph{s}},\vectorn{\emph{a}}| \vectorn{\emph{t}})$在所有合理的对齐上概率和不为1。 但是，在这里更关心是哪个对齐$\vectorn{\emph{a}}$使$\funp{P}(\vectorn{\emph{s}},\vectorn{\emph{a}}| \vectorn{\emph{t}})$达到最大，即使$\funp{P}(\vectorn{\emph{s}},\vectorn{\emph{a}}|\vectorn{\emph{t}})$不符合概率分布的定义，也并不影响我们寻找理想的对齐$\vectorn{\emph{a}}$。从工程的角度说，$\funp{P}(\vectorn{\emph{s}},\vectorn{\emph{a}}| \vectorn{\emph{t}})$不归一并不是一个十分严重的问题。遗憾的是，实际上到现在为止有太多对IBM模型3和模型4中的缺陷进行过系统的实验和分析，但对于这个问题到底有多严重并没有定论。当然用模型5是可以解决这个问题。但是如果用一个非常复杂的模型去解决了一个并不产生严重后果的问题，那这个模型也就没有太大意义了（从实践的角度）。
 \parinterval 概念（cept.）的意义是什么？经过前面的分析可知，IBM模型的词对齐模型使用了cept.这个概念。但是，在IBM模型中使用的cept.最多只能对应一个目标语言单词（模型并没有用到源语言cept. 的概念）。因此可以直接用单词代替cept.。这样，即使不引入cept.的概念，也并不影响IBM模型的建模。实际上，cept.的引入确实可以帮助我们从语法和语义的角度解释词对齐过程。不过，这个方法在IBM 模型中的效果究竟如何还没有定论。
@@ -478,7 +478,7 @@ p_0+p_1                            & = & 1 \label{eq:6-21}
 \item 扭曲度是机器翻译中的一个经典概念。广义上来说，事物位置的变换都可以用扭曲度进行描述，比如，在物理成像系统中，扭曲度模型可以帮助进行镜头校正\upcite{1966Decentering,ClausF05}。在机器翻译中，扭曲度本质上在描述源语言和目标源单词顺序的偏差。这种偏差可以用于对调序的建模。因此扭曲度的使用也可以被看做是一种对调序问题的描述，这也是机器翻译区别于语音识别等任务的主要因素之一。在早期的统计机器翻译系统中，如Pharaoh\upcite{DBLP:conf/amta/Koehn04}，大量使用了扭曲度这个概念。虽然，随着机器翻译的发展，更复杂的调序模型被提出\upcite{Gros2008MSD,xiong2006maximum,och2004alignment,DBLP:conf/naacl/KumarB05,li-etal-2014-neural,vaswani2017attention}，但是扭曲度所引发的对调序问题的思考是非常深刻的，这也是IBM模型最大的贡献之一。
 \vspace{0.5em}
-\item IBM模型的另一个贡献是在机器翻译中引入了繁衍率的概念。本质上，繁衍率是一种对翻译长度的建模。在IBM模型中，通过计算单词的繁衍率就可以得到整个句子的长度。需要注意的是，在机器翻译中译文长度对翻译性能有着至关重要的影响。虽然，在很多机器翻译模型中并没有直接使用繁衍率这个概念，但是几乎所有的现代机器翻译系统中都有译文长度的控制模块。比如，在统计机器翻译和神经机器翻译中，都把译文单词数量作为一个特征用于生成合理长度的译文\upcite{Koehn2007Moses,ChiangLMMRS05,bahdanau2014neural}。此外，在神经机器翻译中，非自回归的解码中也使用繁衍率模型对译文长度进行预测\ref{2018Non}。
+\item IBM模型的另一个贡献是在机器翻译中引入了繁衍率的概念。本质上，繁衍率是一种对翻译长度的建模。在IBM模型中，通过计算单词的繁衍率就可以得到整个句子的长度。需要注意的是，在机器翻译中译文长度对翻译性能有着至关重要的影响。虽然，在很多机器翻译模型中并没有直接使用繁衍率这个概念，但是几乎所有的现代机器翻译系统中都有译文长度的控制模块。比如，在统计机器翻译和神经机器翻译中，都把译文单词数量作为一个特征用于生成合理长度的译文\upcite{Koehn2007Moses,ChiangLMMRS05,bahdanau2014neural}。此外，在神经机器翻译中，非自回归的解码中也使用繁衍率模型对译文长度进行预测\upcite{Gu2017NonAutoregressiveNM}。
 \vspace{0.5em}
 \end{itemize}

--- a/Chapter7/chapter7.tex
+++ b/Chapter7/chapter7.tex
@@ -266,7 +266,7 @@ d = {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)} \c
 \end{eqnarray}
-\parinterval 公式\ref{eq:7-3}中，$\funp{P}(d,\seq{t}|\seq{s})$表示翻译推导的概率。公式\ref{eq:7-3}把翻译问题转化为翻译推导的生成问题。但是，由于翻译推导的数量十分巨大\footnote[3]{如果把推导看作是一种树结构，推导的数量与词串的长度成指数关系。}，公式\ref{eq:7-3}的右端需要对所有可能的推导进行枚举并求和，这几乎是无法计算的。
+\parinterval 公式\eqref{eq:7-3}中，$\funp{P}(d,\seq{t}|\seq{s})$表示翻译推导的概率。公式\eqref{eq:7-3}把翻译问题转化为翻译推导的生成问题。但是，由于翻译推导的数量十分巨大\footnote[3]{如果把推导看作是一种树结构，推导的数量与词串的长度成指数关系。}，公式\eqref{eq:7-3}的右端需要对所有可能的推导进行枚举并求和，这几乎是无法计算的。
 \parinterval 对于这个问题，常用的解决办法是利用一个化简的模型来近似完整的模型。如果把翻译推导的全体看作一个空间$D$，可以从$D$中选取一部分样本参与计算，而不是对整个$D$进行计算。比如，可以用最好的$n$个翻译推导来代表整个空间$D$。令$D_{n\textrm{-best}}$表示最好的$n$个翻译推导所构成的空间，于是可以定义：
 \begin{eqnarray}
@@ -274,7 +274,7 @@ d = {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)} \c
 \label{eq:7-4}
 \end{eqnarray}
-\parinterval 进一步，把公式\ref{eq:7-4}带入公式\ref{eq:7-2}，可以得到翻译的目标为：
+\parinterval 进一步，把公式\eqref{eq:7-4}带入公式\eqref{eq:7-2}，可以得到翻译的目标为：
 \begin{eqnarray}
 \hat{\seq{t}} = \arg\max_{\seq{t}} \sum_{d \in D_{n\textrm{-best}}} \funp{P}(d,\seq{t}|\seq{s})
 \label{eq:7-5}
@@ -292,7 +292,7 @@ d = {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)} \c
 \label{eq:7-7}
 \end{eqnarray}
-\parinterval 值得注意的是，翻译推导中蕴含着译文的信息，因此每个翻译推导都与一个译文对应。因此可以把公式\ref{eq:7-7}所描述的问题重新定义为：
+\parinterval 值得注意的是，翻译推导中蕴含着译文的信息，因此每个翻译推导都与一个译文对应。因此可以把公式\eqref{eq:7-7}所描述的问题重新定义为：
 \begin{eqnarray}
 \hat{d} = \arg\max_{d} \funp{P}(d,\seq{t}|\seq{s})
 \label{eq:7-8}
@@ -304,7 +304,7 @@ d = {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)} \c
 \label{eq:7-9}
 \end{eqnarray}
-\parinterval 注意，公式\ref{eq:7-8}-\ref{eq:7-9}和公式\ref{eq:7-7}本质上是一样的。它们也构成了统计机器翻译中最常用的方法\ \dash \ Viterbi方法\upcite{DBLP:journals/tit/Viterbi67}。在后面机器翻译的解码中还会看到它们的应用。而公式\ref{eq:7-5}也被称作$n$-best方法，常常作为Viterbi方法的一种改进。
+\parinterval 注意，公式\eqref{eq:7-8}-\eqref{eq:7-9}和公式\eqref{eq:7-7}本质上是一样的。它们也构成了统计机器翻译中最常用的方法\ \dash \ Viterbi方法\upcite{DBLP:journals/tit/Viterbi67}。在后面机器翻译的解码中还会看到它们的应用。而公式\eqref{eq:7-5}也被称作$n$-best方法，常常作为Viterbi方法的一种改进。
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -325,14 +325,14 @@ d = {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)} \c
 \label{eq:7-11}
 \end{eqnarray}
-\parinterval 公式\ref{eq:7-11}是一种典型的{\small\bfnew{对数线性模型}}\index{对数线性模型}（Log-linear Model）\index{Log-linear Model}。所谓“对数线性”体现在对多个量求和后进行指数运算（$\textrm{exp}(\cdot)$），这相当于对多个因素进行乘法。公式\ref{eqa4.10}的右端是一种归一化操作。分子部分可以被看作是一种对翻译推导$d$的对数线性建模。具体来说，对于每个$d$，用$M$个特征对其进行描述。每个特征用函数$h_i (d,\seq{t},\seq{s})$表示，它对应一个权重$\lambda_i$，表示特征$i$的重要性。$\sum_{i=1}^{M} \lambda_i \cdot h_i (d,\seq{t},\seq{s})$表示了对这些特征的线性加权和，值越大表示模型得分越高，相应的$d$和$\seq{t}$的质量越高。公式\ref{eqa4.10}的分母部分实际上不需要计算，因为其值与求解最佳推导的过程无关。把公式\ref{eqa4.10}带入公式\ref{eq:7-8}得到：
+\parinterval 公式\eqref{eq:7-11}是一种典型的{\small\bfnew{对数线性模型}}\index{对数线性模型}（Log-linear Model）\index{Log-linear Model}。所谓“对数线性”体现在对多个量求和后进行指数运算（$\textrm{exp}(\cdot)$），这相当于对多个因素进行乘法。公式\eqref{eqa4.10}的右端是一种归一化操作。分子部分可以被看作是一种对翻译推导$d$的对数线性建模。具体来说，对于每个$d$，用$M$个特征对其进行描述，每个特征用函数$h_i (d,\seq{t},\seq{s})$表示，它对应一个权重$\lambda_i$，表示特征$i$的重要性。$\sum_{i=1}^{M} \lambda_i \cdot h_i (d,\seq{t},\seq{s})$表示了对这些特征的线性加权和，值越大表示模型得分越高，相应的$d$和$\seq{t}$的质量越高。公式\eqref{eqa4.10}的分母部分实际上不需要计算，因为其值与求解最佳推导的过程无关。把公式\eqref{eqa4.10}带入公式\eqref{eq:7-8}得到：
 \begin{eqnarray}
 \hat{d} &=& \arg\max_{d} \frac{\textrm{exp}(\textrm{score}(d,\seq{t},\seq{s}))}{\sum_{d',\seq{t}'} \textrm{exp}(\textrm{score}(d',\seq{t}',\seq{s}))} \nonumber \\
 &=& \arg\max_{d}\ \textrm{exp}(\textrm{score}(d,\seq{t},\seq{s}))
 \label{eq:7-12}
 \end{eqnarray}
-\parinterval 公式\ref{eq:7-12}中，$\ \textrm{exp}(\textrm{score}(d,\seq{t},\seq{s}))$表示指数化的模型得分，记为$\textrm{mscore}(d,\seq{t},\seq{s}) = \textrm{exp}(\textrm{score}(d,\seq{t},\seq{s}))$。于是，翻译问题就可以被描述为：找到使函数$\textrm{mscore}(d,\seq{t},\seq{s})$达到最大的$d$。由于，$\textrm{exp}(\textrm{score}(d,\seq{t},\seq{s}))$和$\textrm{score}(d,\seq{t},\seq{s})$是单调一致的，因此有时也直接把$\textrm{score}(d,\seq{t},\seq{s})$当做模型得分。
+\parinterval 公式\eqref{eq:7-12}中，$\ \textrm{exp}(\textrm{score}(d,\seq{t},\seq{s}))$表示指数化的模型得分，记为$\textrm{mscore}(d,\seq{t},\seq{s}) = \textrm{exp}(\textrm{score}(d,\seq{t},\seq{s}))$。于是，翻译问题就可以被描述为：找到使函数$\textrm{mscore}(d,\seq{t},\seq{s})$达到最大的$d$。由于，$\textrm{exp}(\textrm{score}(d,\seq{t},\seq{s}))$和$\textrm{score}(d,\seq{t},\seq{s})$是单调一致的，因此有时也直接把$\textrm{score}(d,\seq{t},\seq{s})$当做模型得分。
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -464,7 +464,7 @@ d = {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)} \c
 \end{figure}
 %-------------------------------------------
-\parinterval 除此之外，一些外部工具也可以用来获取词对齐，如Fastalign\upcite{dyer2013a}、Berkeley Word Aligner\upcite{taskar2005a}等。词对齐的质量通常使用词对齐错误率（AER）来评价\upcite{DBLP:conf/coling/OchN00}，但是词对齐并不是一个独立的系统，它一般会服务于其他任务。因此，也可以使用下游任务来评价词对齐的好坏。比如，改进词对齐后观察机器翻译系统性能的变化。
+\parinterval 除此之外，一些外部工具也可以用来获取词对齐，如Fastalign\upcite{DBLP:conf/naacl/DyerCS13}、Berkeley Word Aligner\upcite{taskar2005a}等。词对齐的质量通常使用词对齐错误率（AER）来评价\upcite{DBLP:conf/coling/OchN00}，但是词对齐并不是一个独立的系统，它一般会服务于其他任务。因此，也可以使用下游任务来评价词对齐的好坏。比如，改进词对齐后观察机器翻译系统性能的变化。
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -583,7 +583,7 @@ dr = start_i-end_{i-1}-1
 \label{eq:7-16}
 \end{eqnarray}
-\noindent 其中，$o_i$表示（目标语言）第$i$个短语的调序方向，$\mathbf{o}=\{o_i\}$表示短语序列的调序方向，$K$表示短语的数量。短语之间的调序概率是由双语短语以及短语对齐决定的，$o$表示调序的种类，可以取M、S、D 中的任意一种。而整个句子调序的好坏就是把相邻的短语之间的调序概率相乘（对应取log后的加法）。这样，公式\ref{eq:7-16}把调序的好坏定义为新的特征，对于M、S、D总共就有三个特征。除了当前短语和前一个短语的调序特征，还可以定义当前短语和后一个短语的调序特征，即将上述公式中的$a_{i-1}$换成$a_{i+1}$。 于是，又可以得到三个特征。因此在MSD调序中总共可以有6个特征。
+\noindent 其中，$o_i$表示（目标语言）第$i$个短语的调序方向，$\mathbf{o}=\{o_i\}$表示短语序列的调序方向，$K$表示短语的数量。短语之间的调序概率是由双语短语以及短语对齐决定的，$o$表示调序的种类，可以取M、S、D 中的任意一种。而整个句子调序的好坏就是把相邻的短语之间的调序概率相乘（对应取log后的加法）。这样，公式\eqref{eq:7-16}把调序的好坏定义为新的特征，对于M、S、D总共就有三个特征。除了当前短语和前一个短语的调序特征，还可以定义当前短语和后一个短语的调序特征，即将上述公式中的$a_{i-1}$换成$a_{i+1}$。 于是，又可以得到三个特征。因此在MSD调序中总共可以有6个特征。
 \parinterval 具体实现时，通常使用词对齐对两个短语间的调序关系进行判断。图\ref{fig:7-22}展示了这个过程。先判断短语的左上角和右上角是否存在词对齐，再根据其位置对调序类型进行划分。每个短语对应的调序概率都可以用相对频次估计进行计算。而MSD调序模型也相当于在短语表中的每个双语短语后添加6个特征。不过，调序模型一般并不会和短语表一起存储，因此在系统中通常会看到两个独立的模型文件，分别保存短语表和调序模型。
@@ -651,7 +651,7 @@ dr = start_i-end_{i-1}-1
 \parinterval 想要得到最优的特征权重，最简单的方法是枚举所有的特征权重可能的取值，然后评价每组权重所对应的翻译性能，最后选择最优的特征权重作为调优的结果。但是特征权重是一个实数值，因此可以考虑把实数权重进行量化，即把权重看作是在固定间隔上的取值，比如，每隔0.01取值。即使是这样，同时枚举多个特征的权重也是非常耗时的工作，当特征数量增多时这种方法的效率仍然很低。
-\parinterval 这里介绍一种更加高效的特征权重调优方法$\ \dash \ ${\small\bfnew{最小错误率训练}}\index{最小错误率训练}（Minimum Error Rate Training\index{Minimum Error Rate Training}，MERT）。最小错误率训练是统计机器翻译发展中代表性工作，也是机器翻译领域原创的重要技术方法之一\upcite{och2003minimum}。最小错误率训练假设：翻译结果相对于标准答案的错误是可度量的，进而可以通过降低错误数量的方式来找到最优的特征权重。假设有样本集合$S = \{(s_1,\seq{r}_1),...,(s_N,\seq{r}_N)\}$，$s_i$为样本中第$i$个源语言句子，$\seq{r}_i$为相应的参考译文。注意，$\seq{r}_i$ 可以包含多个参考译文。$S$通常被称为{\small\bfnew{调优集合}}\index{调优集合}（Tuning Set）\index{Tuning Set}。对于$S$中的每个源语句子$s_i$，机器翻译模型会解码出$n$-best推导$\hat{\seq{d}}_{i} = \{\hat{d}_{ij}\}$，其中$\hat{d}_{ij}$表示对于源语言句子$s_i$得到的第$j$个最好的推导。$\{\hat{d}_{ij}\}$可以被定义如下：
+\parinterval 这里介绍一种更加高效的特征权重调优方法$\ \dash \ ${\small\bfnew{最小错误率训练}}\index{最小错误率训练}（Minimum Error Rate Training\index{Minimum Error Rate Training}，MERT）。最小错误率训练是统计机器翻译发展中代表性工作，也是机器翻译领域原创的重要技术方法之一\upcite{DBLP:conf/acl/Och03}。最小错误率训练假设：翻译结果相对于标准答案的错误是可度量的，进而可以通过降低错误数量的方式来找到最优的特征权重。假设有样本集合$S = \{(s_1,\seq{r}_1),...,(s_N,\seq{r}_N)\}$，$s_i$为样本中第$i$个源语言句子，$\seq{r}_i$为相应的参考译文。注意，$\seq{r}_i$ 可以包含多个参考译文。$S$通常被称为{\small\bfnew{调优集合}}\index{调优集合}（Tuning Set）\index{Tuning Set}。对于$S$中的每个源语句子$s_i$，机器翻译模型会解码出$n$-best推导$\hat{\seq{d}}_{i} = \{\hat{d}_{ij}\}$，其中$\hat{d}_{ij}$表示对于源语言句子$s_i$得到的第$j$个最好的推导。$\{\hat{d}_{ij}\}$可以被定义如下：
 \begin{eqnarray}
 \{\hat{d}_{ij}\} = \arg\max_{\{d_{ij}\}} \sum_{i=1}^{M} \lambda_i \cdot h_i (d,\seq{t},\seq{s})
@@ -665,7 +665,7 @@ dr = start_i-end_{i-1}-1
 \end{eqnarray}
 %公式--------------------------------------------------------------------
-\noindent 其中，\textrm{Error}$(\cdot)$是错误率函数。\textrm{Error}$(\cdot)$的定义方式有很多，一般来说\textrm{Error}$(\cdot)$会与机器翻译的评价指标相关，例如，词错误率(WER)、位置错误率(PER)、BLEU 值、NIST值等都可以用于\textrm{Error}$(\cdot)$的定义。这里使用$1-$BLEU作为错误率函数，即$\textrm{Error}(\hat{\seq{D}},\seq{R}) = 1 - \textrm{BLEU}(\hat{\seq{D}},\seq{R})$。则公式\ref{eq:7-18}可改写为：
+\noindent 其中，\textrm{Error}$(\cdot)$是错误率函数。\textrm{Error}$(\cdot)$的定义方式有很多，一般来说\textrm{Error}$(\cdot)$会与机器翻译的评价指标相关，例如，词错误率(WER)、位置错误率(PER)、BLEU 值、NIST值等都可以用于\textrm{Error}$(\cdot)$的定义。这里使用$1-$BLEU作为错误率函数，即$\textrm{Error}(\hat{\seq{D}},\seq{R}) = 1 - \textrm{BLEU}(\hat{\seq{D}},\seq{R})$。则公式\eqref{eq:7-18}可改写为：
 %公式--------------------------------------------------------------------
 \begin{eqnarray}
 \hat{\lambda} &=& \arg\min_{\lambda}\ (1 - \textrm{BLEU}(\hat{\seq{D}},\seq{R}))   \nonumber \\
@@ -674,7 +674,7 @@ dr = start_i-end_{i-1}-1
 \end{eqnarray}
 %公式--------------------------------------------------------------------
-\parinterval 需要注意的是， BLEU本身是一个不可微分函数。因此，无法使用梯度下降等方法对式\ref{eq:7-19}进行求解。那么如何能快速得到最优解？这里会使用一种特殊的优化方法，称作{\small\bfnew{线搜索}}\index{线搜索}（Line Search）\index{Line Search}，它是Powell搜索的一种形式\upcite{powell1964an}。这种方法也构成了最小错误率训练的核心。
+\parinterval 需要注意的是， BLEU本身是一个不可微分函数。因此，无法使用梯度下降等方法对式\eqref{eq:7-19}进行求解。那么如何能快速得到最优解？这里会使用一种特殊的优化方法，称作{\small\bfnew{线搜索}}\index{线搜索}（Line Search）\index{Line Search}，它是Powell搜索的一种形式\upcite{powell1964an}。这种方法也构成了最小错误率训练的核心。
 \parinterval 首先，重新看一下特征权重的搜索空间。按照前面的介绍，如果要进行暴力搜索，需要把特征权重的取值按小的间隔进行划分。这样，所有特征权重的取值可以用图\ref{fig:7-23}的网格来表示。
@@ -687,11 +687,11 @@ dr = start_i-end_{i-1}-1
 \end{figure}
 %-------------------------------------------
-\parinterval 其中横坐标为所有的$M$个特征函数，纵坐标为权重可能的取值。假设每个特征都有$V$种取值，那么遍历所有特征权重取值的组合有$M^V$种。每组$\lambda = \{\lambda_i\}$的取值实际上就是一个贯穿所有特征权重的折线，如图\ref{fig:7-23}中间红线所展示的路径。当然，可以通过枚举得到很多这样的折线（图\ref{fig:7-23}右）。假设计算BLEU的时间开销为$B$，那么遍历所有的路径的时间复杂度为$O(M^V \cdot B)$，由于$V$可能很大，而且$B$往往也无法忽略，因此这种计算方式的时间成本是极高的。如果考虑对每一组特征权重都需要重新解码得到$n$-best译文，那么基于这种简单枚举的方法是无法使用的。
+\parinterval 其中横坐标为所有的$M$个特征函数，纵坐标为权重可能的取值。假设每个特征都有$V$种取值，那么遍历所有特征权重取值的组合有$M^V$种。每组$\lambda = \{\lambda_i\}$的取值实际上就是一个贯穿所有特征权重的折线，如图\ref{fig:7-23}中间蓝线所展示的路径。当然，可以通过枚举得到很多这样的折线（图\ref{fig:7-23}右）。假设计算BLEU的时间开销为$B$，那么遍历所有的路径的时间复杂度为$O(M^V \cdot B)$，由于$V$可能很大，而且$B$往往也无法忽略，因此这种计算方式的时间成本是极高的。如果考虑对每一组特征权重都需要重新解码得到$n$-best译文，那么基于这种简单枚举的方法是无法使用的。
 \parinterval 对全搜索的一种改进是使用局部搜索。循环处理每个特征，每一次只调整一个特征权重的值，找到使BLEU达到最大的权重。反复执行该过程，直到模型达到稳定状态（例如BLEU不再降低）。
-\parinterval 图\ref{fig:7-24}左侧展示了这种方法。其中红色部分为固定住的权重，相应的虚线部分为当前权重所有可能的取值，这样搜索一个特征权重的时间复杂度为$O(V \cdot B)$。而整个算法的时间复杂度为$O(L \cdot V \cdot B)$，其中$L$为循环访问特征的总次数。这种方法也被称作{\small\bfnew{格搜索}}\index{格搜索}（Grid Search）\index{Grid Search}。
+\parinterval 图\ref{fig:7-24}左侧展示了这种方法。其中蓝色部分为固定住的权重，相应的虚线部分为当前权重所有可能的取值，这样搜索一个特征权重的时间复杂度为$O(V \cdot B)$。而整个算法的时间复杂度为$O(L \cdot V \cdot B)$，其中$L$为循环访问特征的总次数。这种方法也被称作{\small\bfnew{格搜索}}\index{格搜索}（Grid Search）\index{Grid Search}。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -912,7 +912,7 @@ dr = start_i-end_{i-1}-1
 \vspace{0.5em}
 \item 统计机器翻译中使用的栈解码方法源自Tillmann等人的工作\upcite{tillmann1997a}。这种方法在Pharaoh\upcite{DBLP:conf/amta/Koehn04}、Moses\upcite{Koehn2007Moses}等开源系统中被成功的应用，在机器翻译领域产生了很大的影响力。特别是，这种解码方法效率很高，因此在许多工业系统里也大量使用。对于栈解码也有很多改进工作，比如，早期的工作考虑剪枝或者限制调序范围以加快解码速度\upcite{DBLP:conf/acl/WangW97,DBLP:conf/coling/TillmannN00,DBLP:conf/iwslt/ShenDA06a,robert2007faster}。随后，也有研究工作从解码算法和语言模型集成方式的角度对这类方法进行改进\upcite{DBLP:conf/acl/HeafieldKM14,DBLP:conf/acl/WuebkerNZ12,DBLP:conf/iwslt/ZensN08}。
 \vspace{0.5em}
-\item 统计机器翻译的成功很大程度上来自判别式模型引入任意特征的能力。因此，在统计机器翻译时代，很多工作都集中在新特征的设计上。比如，可以基于不同的统计特征和先验知识设计翻译特征\upcite{och2004smorgasbord,Chiang200911,gildea2003loosely}，也可以模仿分类任务设计大规模的稀疏特征\upcite{chiang2008online}。另一方面，模型训练和特征权重调优也是统计机器翻译中的重要问题，除了最小错误率训练，还有很多方法，比如，最大似然估计\upcite{koehn2003statistical,DBLP:journals/coling/BrownPPM94}、判别式方法\upcite{Blunsom2008A}、贝叶斯方法\upcite{Blunsom2009A,Cohn2009A}、最小风险训练\upcite{smith2006minimum,li2009first}、基于Margin的方法\upcite{watanabe2007online,Chiang200911}以及基于排序模型的方法（PRO）\upcite{Hopkins2011Tuning,dreyer2015apro}。实际上，统计机器翻译的训练和解码也存在不一致的问题，比如，特征值由双语数据上的极大似然估计得到（没有剪枝），而解码时却使用束剪枝，而且模型的目标是最大化机器翻译评价指标。对于这个问题也可以通过调整训练的目标函数进行缓解\upcite{XiaoA,marcu2006practical}。
+\item 统计机器翻译的成功很大程度上来自判别式模型引入任意特征的能力。因此，在统计机器翻译时代，很多工作都集中在新特征的设计上。比如，可以基于不同的统计特征和先验知识设计翻译特征\upcite{och2004smorgasbord,Chiang200911,gildea2003loosely}，也可以模仿分类任务设计大规模的稀疏特征\upcite{DBLP:conf/emnlp/ChiangMR08}。另一方面，模型训练和特征权重调优也是统计机器翻译中的重要问题，除了最小错误率训练，还有很多方法，比如，最大似然估计\upcite{koehn2003statistical,DBLP:journals/coling/BrownPPM94}、判别式方法\upcite{Blunsom2008A}、贝叶斯方法\upcite{Blunsom2009A,Cohn2009A}、最小风险训练\upcite{smith2006minimum,li2009first}、基于Margin的方法\upcite{watanabe2007online,Chiang200911}以及基于排序模型的方法（PRO）\upcite{Hopkins2011Tuning,dreyer2015apro}。实际上，统计机器翻译的训练和解码也存在不一致的问题，比如，特征值由双语数据上的极大似然估计得到（没有剪枝），而解码时却使用束剪枝，而且模型的目标是最大化机器翻译评价指标。对于这个问题也可以通过调整训练的目标函数进行缓解\upcite{XiaoA,marcu2006practical}。
 \vspace{0.5em}
 \item 短语表是基于短语的系统中的重要模块。但是，简单的利用基于频次的方法估计得到的翻译概率无法很好的处理低频短语。这时就需要对短语表进行平滑\upcite{DBLP:conf/iwslt/ZensN08,DBLP:conf/emnlp/SchwenkCF07,boxing2011unpacking,DBLP:conf/coling/DuanSZ10}。另一方面，随着数据量的增长和抽取短语长度的增大，短语表的体积会极具膨胀，这也大大增加了系统的存储消耗，同时过大的短语表也会带来短语查询效率的下降。针对这个问题，很多工作尝试对短语表进行压缩。一种思路是限制短语的长度\upcite{DBLP:conf/naacl/QuirkM06,DBLP:journals/coling/MarinoBCGLFC06}；另一种广泛使用的思路是使用一些指标或者分类器来对短语进行剪枝，其核心思想是判断每个短语的质量\upcite{DBLP:conf/emnlp/ZensSX12}，并过滤掉低质量的短语。代表性的方法有：基于假设检验的剪枝\upcite{DBLP:conf/emnlp/JohnsonMFK07}、基于熵的剪枝\upcite{DBLP:conf/emnlp/LingGTB12}、两阶段短语抽取方法\upcite{DBLP:conf/naacl/ZettlemoyerM07}、基于解码中短语使用频率的方法\upcite{DBLP:conf/naacl/EckVW07}等。此外，短语表的存储方式也是在实际使用中需要考虑的问题。因此，也有研究者尝试使用更加紧凑、高效的结构保存短语表。其中最具代表性的结构是后缀数组（Suffix Arrays），这种结构可以充分利用短语之间有重叠的性质，发幅减少了重复存储\upcite{DBLP:conf/acl/Callison-BurchBS05,DBLP:conf/acl/Callison-BurchBS05,DBLP:conf/naacl/ZensN07,2014Dynamic}。
 \vspace{0.5em}

--- a/Chapter8/Figures/figure-classification-of-models-based-on-syntax.tex
+++ b/Chapter8/Figures/figure-classification-of-models-based-on-syntax.tex
@@ -7,18 +7,20 @@
 \tikzstyle{cnode} = [minimum width=7.0em,minimum height=2.5em,rounded corners=0.2em];
 \tikzstyle{xnode} = [minimum width=4.5em,minimum height=2.5em,rounded corners=0.2em];
-\node[xnode,anchor=west,fill=red!25,align=left] (itg) at (0,0) {\footnotesize{反向转录}\\\footnotesize{文法}};
+\node[cnode,anchor=south,minimum width=10.0em,fill=green!25,align=center] (cat0) at (0,0) {\footnotesize{（广义上）}\\\footnotesize{基于句法的模型}};
-\node[xnode,anchor=west,fill=red!25,align=left] (hiero) at ([xshift=0.5em]itg.east) {\footnotesize{层次短语}\\\footnotesize{模型}};
+\node[cnode,anchor=north,fill=red!25,align=left] (cat1) at ([xshift=-6.5em,yshift=-2em]cat0.south) {\footnotesize{基于形式文法}\\\footnotesize{的模型}};
-\node[xnode,anchor=west,fill=blue!25,align=left] (s2t) at ([xshift=0.5em]hiero.east) {\footnotesize{串到树}\\\footnotesize{模型}};
+\node[cnode,anchor=north,fill=blue!25,align=left] (cat2) at ([xshift=6.5em,yshift=-2em]cat0.south) {\footnotesize{基于语言学}\\\footnotesize{句法的模型}};
-\node[xnode,anchor=west,fill=blue!25,align=left] (t2s) at ([xshift=0.5em]s2t.east) {\footnotesize{树到串}\\\footnotesize{模型}};
-\node[xnode,anchor=west,fill=blue!25,align=left] (t2t) at ([xshift=0.5em]t2s.east) {\footnotesize{树到树}\\\footnotesize{模型}};
+\node[xnode,anchor=north,fill=red!25,align=left] (itg) at ([xshift=-2.5em,yshift=-2.0em]cat1.south) {\footnotesize{反向转录}\\\footnotesize{文法}};
-\node[cnode,anchor=south,fill=red!25,align=left] (cat1) at ([xshift=-0.2em,yshift=2em]hiero.north west) {\footnotesize{基于形式文法}\\\footnotesize{的模型}};
+\node[xnode,anchor=north,fill=red!25,align=left] (hiero) at ([xshift=2.5em,yshift=-2.0em]cat1.south) {\footnotesize{层次短语}\\\footnotesize{模型}};
-\node[cnode,anchor=south,fill=blue!25,align=left] (cat2) at ([xshift=-0.0em,yshift=2em]t2s.north) {\footnotesize{基于语言学}\\\footnotesize{句法的模型}};
+\node[xnode,anchor=north,fill=blue!25,align=left] (s2t) at ([xshift=-5.0em,yshift=-2.0em]cat2.south) {\footnotesize{串到树}\\\footnotesize{模型}};
-\node[cnode,anchor=south,minimum width=10.0em,fill=green!25,align=center] (cat0) at ([xshift=-3em,yshift=2em]cat2.north west) {\footnotesize{(广义上)}\\\footnotesize{基于句法的模型}};
+\node[xnode,anchor=north,fill=blue!25,align=left] (t2s) at ([xshift=0.0em,yshift=-2.0em]cat2.south) {\footnotesize{树到串}\\\footnotesize{模型}};
+\node[xnode,anchor=north,fill=blue!25,align=left] (t2t) at ([xshift=5.0em,yshift=-2.0em]cat2.south) {\footnotesize{树到树}\\\footnotesize{模型}};
-\draw [-,thick] ([yshift=0.1em,xshift=1em]cat1.north) -- ([xshift=-1.5em,yshift=-0.1em]cat0.south);
-\draw [-,thick] ([yshift=0.1em,xshift=-1em]cat2.north) -- ([xshift=1.5em,yshift=-0.1em]cat0.south);
+\draw [-,thick] ([yshift=0.1em,xshift=1em]cat0.south) -- ([xshift=-1.5em,yshift=-0.1em]cat2.north);
+\draw [-,thick] ([yshift=0.1em,xshift=-1em]cat0.south) -- ([xshift=1.5em,yshift=-0.1em]cat1.north);
 \draw [-,thick] ([yshift=0.1em]itg.north) -- ([xshift=-0.5em,yshift=-0.1em]cat1.south);
 \draw [-,thick] ([yshift=0.1em]hiero.north) -- ([xshift=0.5em,yshift=-0.1em]cat1.south);
 \draw [-,thick] ([yshift=0.1em]s2t.north) -- ([xshift=-0.8em,yshift=-0.1em]cat2.south);

--- a/Chapter8/Figures/figure-one-best-node-alignment-and-alignment-matrix.tex
+++ b/Chapter8/Figures/figure-one-best-node-alignment-and-alignment-matrix.tex
@@ -105,7 +105,7 @@
 \end{flushright}
 \begin{center}
 \vspace{-1em}
-\footnotesize{(a)节点对齐矩阵（1-best vs. Matrix）}
+\footnotesize{(a)节点对齐矩阵（1-best vs Matrix）}
 \end{center}
 \begin{center}
@@ -120,7 +120,7 @@
 \footnotesize{$r_6$} & \footnotesize{AS(了) $\rightarrow$ VBZ(have)} \\
 \footnotesize{$r_8$} & \footnotesize{VP(AD$_1$ VP(VV$_2$ AS$_3$)) $\rightarrow$} \\
                     & \footnotesize{VP(VBZ$_3$ ADVP(RB$_1$ VBN$_2$)} \\
-\rule{0pt}{11pt} \\
+\rule{0pt}{9.5pt} \\
 \\
 \\
 \end{tabular}

--- a/Chapter8/chapter8.tex
+++ b/Chapter8/chapter8.tex
@@ -190,7 +190,7 @@
 \subsubsection{1. 文法定义}
-\parinterval 层次短语模型中一个重要的概念是{\small\bfnew{同步上下文无关文法}}\index{同步上下文无关文法}（Synchronous Context-free Grammar\index{Synchronous Context-free Grammar}，简称SCFG）。SCFG可以被看作是对源语言和目标语言上下文无关文法的融合，它要求源语言和目标语言的产生式及产生式中的变量具有对应关系。具体定义如下：
+\parinterval 层次短语模型中一个重要的概念是{\small\bfnew{同步上下文无关文法}}\index{同步上下文无关文法}（Synchronous Context-free Grammar\index{Synchronous Context-free Grammar}，SCFG）。SCFG可以被看作是对源语言和目标语言上下文无关文法的融合，它要求源语言和目标语言的产生式及产生式中的变量具有对应关系。具体定义如下：
 %-------------------------------------------
 \vspace{0.5em}
@@ -201,7 +201,7 @@
 \begin{enumerate}
 \item $N$是非终结符集合。
 \item $T_s$和$T_t$分别是源语言和目标语言的终结符集合。
-\item $I \subseteq N$起始非终结符集合。
+\item $I \subseteq N$是起始非终结符集合。
 \item $R$是规则集合，每条规则$r \in R$有如下形式：
 \end{enumerate}
 \vspace{0.3em}
@@ -319,16 +319,22 @@ d = {r_1} \circ {r_2} \circ {r_3} \circ {r_4}
 \subsection{层次短语规则抽取}
-\parinterval 层次短语系统所使用的文法包括两部分：1）不含变量的层次短语规则（短语翻译）；2）含有变量的层次短语规则。短语翻译的抽取直接复用基于短语的系统即可。此处重点讨论如何抽取含有变量的层次短语规则。
+\parinterval 层次短语系统所使用的文法包括两部分：
+\begin{itemize}
-\parinterval 在{\chapterseven}短语抽取一节已经介绍了短语与词对齐相兼容的概念。这里，所有层次短语规则也是与词对齐相兼容（一致）的。
+\vspace{0.5em}
+\item 不含变量的层次短语规则（短语翻译）；
+\vspace{0.5em}
+\item 含有变量的层次短语规则。短语翻译的抽取直接复用基于短语的系统即可。
+\vspace{0.5em}
+\end{itemize}
+\parinterval 此处重点讨论如何抽取含有变量的层次短语规则。在{\chapterseven}短语抽取一节已经介绍了短语与词对齐相兼容的概念。这里，所有层次短语规则也是与词对齐相兼容（一致）的。
 %-------------------------------------------
 \vspace{0.5em}
 \begin{definition} 与词对齐相兼容的层次短语规则
 {\small
-对于句对$(\vectorn{s},\vectorn{t})$和它们之间的词对齐$\vectorn{a}$，令$\Phi$表示在句对$(\vectorn{s},\vectorn{t})$上与$\vectorn{a}$相兼容的双语短语集合。则：
+对于句对$(\vectorn{\emph{s}},\vectorn{\emph{t}})$和它们之间的词对齐$\vectorn{\emph{a}}$，令$\Phi$表示在句对$(\vectorn{\emph{s}},\vectorn{\emph{t}})$上与$\vectorn{\emph{a}}$相兼容的双语短语集合。则：
 \begin{enumerate}
 \item 	如果$(x,y)\in \Phi$，则$\textrm{X} \to \langle x,y,\phi \rangle$是与词对齐相兼容的层次短语规则。
 \item 	对于$(x,y)\in \Phi$，存在$m$个双语短语$(x_i,y_j)\in \Phi$，同时存在(1,$...$,$m$)上面的一个排序$\sim = \{\pi_1 , ... ,\pi_m\}$，且：
@@ -376,7 +382,7 @@ y&=&\beta_0 y_{\pi_1} \beta_1 y_{\pi_2} ... \beta_{m-1} y_{\pi_m} \beta_m
 \subsection{翻译特征}
-\parinterval 在层次短语模型中，每个翻译推导都有一个模型得分$\textrm{score}(d,\vectorn{s},\vectorn{t})$。$\textrm{score}(d,\vectorn{s},\vectorn{t})$是若干特征的线性加权之和：$\textrm{score}(d,\vectorn{t},\vectorn{s})=\sum_{i=1}^M\lambda_i\cdot h_i (d,\vectorn{t},\vectorn{s})$，其中$\lambda_i$是特征权重，$h_i (d,\vectorn{t},\vectorn{s})$是特征函数。层次短语模型的特征包括与规则相关的特征和语言模型特征，如下：
+\parinterval 在层次短语模型中，每个翻译推导都有一个模型得分$\textrm{score}(d,\vectorn{\emph{s}},\vectorn{\emph{t}})$。$\textrm{score}(d,\vectorn{\emph{s}},\vectorn{\emph{t}})$是若干特征的线性加权之和：$\textrm{score}(d,\vectorn{\emph{t}},\vectorn{\emph{s}})=\sum_{i=1}^M\lambda_i\cdot h_i (d,\vectorn{\emph{t}},\vectorn{\emph{s}})$，其中$\lambda_i$是特征权重，$h_i (d,\vectorn{\emph{t}},\vectorn{\emph{s}})$是特征函数。层次短语模型的特征包括与规则相关的特征和语言模型特征，如下：
 \parinterval 对于每一条翻译规则LHS$\to \langle \alpha, \beta ,\sim \rangle$，有：
@@ -396,19 +402,19 @@ y&=&\beta_0 y_{\pi_1} \beta_1 y_{\pi_2} ... \beta_{m-1} y_{\pi_m} \beta_m
 \parinterval 这些特征可以被具体描述为：
 \begin{eqnarray}
-h_i (d,\vectorn{t},\vectorn{s})=\sum_{r \in d}h_i (r)
+h_i (d,\vectorn{\emph{t}},\vectorn{\emph{s}})=\sum_{r \in d}h_i (r)
 \label{eq:8-4}
 \end{eqnarray}
-\parinterval 公式\ref{eq:8-4}中，$r$表示推导$d$中的一条规则，$h_i (r)$表示规则$r$上的第$i$个特征。可以看出，推导$d$的特征值就是所有包含在$d$中规则的特征值的和。进一步，可以定义
+\parinterval 公式\eqref{eq:8-4}中，$r$表示推导$d$中的一条规则，$h_i (r)$表示规则$r$上的第$i$个特征。可以看出，推导$d$的特征值就是所有包含在$d$中规则的特征值的和。进一步，可以定义
 \begin{eqnarray}
-\textrm{rscore}(d,\vectorn{t},\vectorn{s})=\sum_{i=1}^7 \lambda_i \cdot h_i (d,\vectorn{t},\vectorn{s})
+\textrm{rscore}(d,\vectorn{\emph{t}},\vectorn{\emph{s}})=\sum_{i=1}^7 \lambda_i \cdot h_i (d,\vectorn{\emph{t}},\vectorn{\emph{s}})
 \label{eq:8-5}
 \end{eqnarray}
 \parinterval 最终，模型得分被定义为：
 \begin{eqnarray}
-\textrm{score}(d,\vectorn{t},\vectorn{s})=\textrm{rscore}(d,\vectorn{t},\vectorn{s})+ \lambda_8 \textrm{log}⁡(\textrm{P}_{\textrm{lm}}(\vectorn{t}))+\lambda_9 \mid \vectorn{t} \mid
+\textrm{score}(d,\vectorn{\emph{t}},\vectorn{\emph{s}})=\textrm{rscore}(d,\vectorn{\emph{t}},\vectorn{\emph{s}})+ \lambda_8 \textrm{log}⁡(\textrm{P}_{\textrm{lm}}(\vectorn{\emph{t}}))+\lambda_9 \mid \vectorn{\emph{t}} \mid
 \label{eq:8-6}
 \end{eqnarray}
@@ -432,18 +438,18 @@ h_i (d,\vectorn{t},\vectorn{s})=\sum_{r \in d}h_i (r)
 \parinterval 层次短语模型解码的目标是找到模型得分最高的推导，即：
 \begin{eqnarray}
-\hat{d} = \argmax_{d}\ \textrm{score}(d,\vectorn{s},\vectorn{t})
+\hat{d} = \argmax_{d}\ \textrm{score}(d,\vectorn{\emph{s}},\vectorn{\emph{t}})
 \label{eq:8-7}
 \end{eqnarray}
-\noindent 这里，$\hat{d}$的目标语部分即最佳译文$\hat{\vectorn{t}}$。令函数$t(\cdot)$返回翻译推导的目标语词串，于是有：
+\noindent 这里，$\hat{d}$的目标语部分即最佳译文$\hat{\vectorn{\emph{t}}}$。令函数$t(\cdot)$返回翻译推导的目标语词串，于是有：
 \begin{eqnarray}
-\hat{\vectorn{t}}=t(\hat{d})
+\hat{\vectorn{\emph{t}}}=t(\hat{d})
 \label{eq:8-8}
 \end{eqnarray}
-\parinterval 由于层次短语规则本质上就是CFG规则，因此公式\ref{eq:8-7}代表了一个典型的句法分析过程。需要做的是，用模型源语言端的CFG对输入句子进行分析，同时用模型目标语言端的CFG生成译文。基于CFG的句法分析是自然语言处理中的经典问题。一种广泛使用的方法是：首先把CFG转化为$\varepsilon$-free的{\small\bfnew{乔姆斯基范式}}\index{乔姆斯基范式}（Chomsky Normal Form）\index{Chomsky Normal Form}\footnote[5]{能够证明任意的CFG都可以被转换为乔姆斯基范式，即文法只包含形如A$\to$BC或A$\to$a的规则。这里，假设文法中不包含空串产生式A$\to\varepsilon$，其中$\varepsilon$表示空字符串。}，之后采用CKY方法进行分析。
+\parinterval 由于层次短语规则本质上就是CFG规则，因此公式\eqref{eq:8-7}代表了一个典型的句法分析过程。需要做的是，用模型源语言端的CFG对输入句子进行分析，同时用模型目标语言端的CFG生成译文。基于CFG的句法分析是自然语言处理中的经典问题。一种广泛使用的方法是：首先把CFG转化为$\varepsilon$-free的{\small\bfnew{乔姆斯基范式}}\index{乔姆斯基范式}（Chomsky Normal Form）\index{Chomsky Normal Form}\footnote[5]{能够证明任意的CFG都可以被转换为乔姆斯基范式，即文法只包含形如A$\to$BC或A$\to$a的规则。这里，假设文法中不包含空串产生式A$\to\varepsilon$，其中$\varepsilon$表示空字符串。}，之后采用CKY方法进行分析。
 \parinterval CKY是形式语言中一种常用的句法分析方法\upcite{cocke1969programming,younger1967recognition,kasami1966efficient}。它主要用于分析符合乔姆斯基范式的句子。由于乔姆斯基范式中每个规则最多包含两叉（或者说两个变量），因此CKY方法也可以被看作是基于二叉规则的一种分析方法。对于一个待分析的字符串，CKY方法从小的“范围”开始，不断扩大分析的“范围”，最终完成对整个字符串的分析。在CKY方法中，一个重要的概念是{\small\bfnew{跨度}}\index{跨度}（Span）\index{Span}，所谓跨度表示了一个符号串的范围。这里可以把跨度简单的理解为从一个起始位置到一个结束位置中间的部分。
@@ -726,11 +732,11 @@ span\textrm{[0,4]}&=&\textrm{“猫} \quad \textrm{喜欢} \quad \textrm{吃} \q
 \cline{3-5}
 \rule{0pt}{15pt} & & \multicolumn{1}{c|}{树到串} & \multicolumn{1}{c}{串到树} & \multicolumn{1}{|c}{树到树} \\
 \hline
-源语句法 & 否 & 是 & 否 & 是 \\
+\rule{0pt}{15pt}源语句法 & 否 & 是 & 否 & 是 \\
-目标语句法 & 否 & 否 & 是 & 是 \\
+\rule{0pt}{15pt}目标语句法 & 否 & 否 & 是 & 是 \\
-基于串的解码 & 是 & 否 & 是 & 是 \\
+\rule{0pt}{15pt}基于串的解码 & 是 & 否 & 是 & 是 \\
-基于树的解码 & 否 & 是 & 否 & 是 \\
+\rule{0pt}{15pt}基于树的解码 & 否 & 是 & 否 & 是 \\
-健壮性 & 高 & 中 & 中 & 低 \\
+\rule{0pt}{15pt}健壮性 & 高 & 中 & 中 & 低 \\
 \end{tabular}
 }
 \end{center}
@@ -815,7 +821,6 @@ span\textrm{[0,4]}&=&\textrm{“猫} \quad \textrm{喜欢} \quad \textrm{吃} \q
 \parinterval 树片段的叶子节点既可以是终结符（单词）也可以是非终结符。当叶子节点为非终结符时，表示这个非终结符会被进一步替换，因此它可以被看作是变量。而源语言树结构和目标语言树结构中的变量是一一对应的，对应关系用虚线表示。
 \parinterval 这个双语映射关系可以被表示为一个基于树结构的文法规则，套用规则的定义$\langle\  \alpha_h, \beta_h\ \rangle \to \langle\ \alpha_r, \beta_r, \sim\ \rangle$形式，可以知道：
 \begin{eqnarray}
 \alpha_h &=& \textrm{VP} \nonumber \\
 \beta_h &=& \textrm{VP} \nonumber \\
@@ -823,13 +828,11 @@ span\textrm{[0,4]}&=&\textrm{“猫} \quad \textrm{喜欢} \quad \textrm{吃} \q
 \beta_r &=& \textrm{VP}(\textrm{VBZ(was)}\ \ \textrm{VP(VBN:}x\ \ \textrm{PP:}x)) \nonumber \\
 \sim &=& \{1-2,2-1\} \nonumber
 \end{eqnarray}
 \noindent 这里，$\alpha_h$和$\beta_h$表示规则的左部，对应树片段的根节点；$\alpha_r$和$\beta_r$是两种语言的树结构（序列化表示），其中标记为$x$的非终结符是变量。$\sim = \{1-2,2-1\}$表示源语言的第一个变量对应目标语言的第二个变量，而源语言的第二个变量对应目标语言的第一个变量，这也反应出两种语言句法结构中的调序现象。类似于层次短语规则，可以把规则中变量的对应关系用下标进行表示。例如，上面的规则也可以被写为如下形式：
 \begin{eqnarray}
 \langle\ \textrm{VP}, \textrm{VP}\ \rangle\ \to\ \langle\ \textrm{PP}_{1} \ \textrm{VP(VV(表示)}\ \textrm{NN}_{2})),\ \ \textrm{VP}(\textrm{VBZ(was)}\ \textrm{VP(VBN}_{2} \ \textrm{PP}_{1})) \ \rangle \nonumber
 \end{eqnarray}
 \noindent 其中，两种语言中变量的对应关系为$\textrm{PP}_1 \leftrightarrow \textrm{PP}_1$，$\textrm{NN}_2 \leftrightarrow \textrm{VBN}_2$。
 %----------------------------------------------------------------------------------------
@@ -1305,7 +1308,7 @@ r_9: \quad \textrm{IP(}\textrm{NN}_1\ \textrm{VP}_2) \rightarrow \textrm{S(}\tex
 \subsection{句法翻译模型的特征}
-\parinterval 基于语言学句法的翻译模型使用判别式模型对翻译推导进行建模（{\chapterseven}数学建模小节）。给定双语句对($\vectorn{s}$,$\vectorn{t}$)，由$M$个特征经过线性加权，得到每个翻译推导$d$的得分，记为$\textrm{score(}d,\vectorn{t},\vectorn{s})=\sum_{i=1}^{M} \lambda_i \cdot h_{i}(d,\vectorn{t},\vectorn{s})$，其中$\lambda_i$表示特征权重，$h_{i}(d,\vectorn{t},\vectorn{s})$表示特征函数。翻译的目标就是要找到使$\textrm{score(}d,\vectorn{t},\vectorn{s})$达到最高的推导$d$。
+\parinterval 基于语言学句法的翻译模型使用判别式模型对翻译推导进行建模（{\chapterseven}数学建模小节）。给定双语句对($\vectorn{\emph{s}}$,$\vectorn{\emph{t}}$)，由$M$个特征经过线性加权，得到每个翻译推导$d$的得分，记为$\textrm{score(}d,\vectorn{\emph{t}},\vectorn{\emph{s}})=\sum_{i=1}^{M} \lambda_i \cdot h_{i}(d,\vectorn{\emph{t}},\vectorn{\emph{s}})$，其中$\lambda_i$表示特征权重，$h_{i}(d,\vectorn{\emph{t}},\vectorn{\emph{s}})$表示特征函数。翻译的目标就是要找到使$\textrm{score(}d,\vectorn{\emph{t}},\vectorn{\emph{s}})$达到最高的推导$d$。
 \parinterval 这里，可以使用最小错误率训练对特征权重进行调优（{\chapterseven}最小错误率训练小节）。而特征函数可参考如下定义：
@@ -1346,9 +1349,9 @@ r_9: \quad \textrm{IP(}\textrm{NN}_1\ \textrm{VP}_2) \rightarrow \textrm{S(}\tex
 \begin{itemize}
 \vspace{0.5em}
-\item (h8)语言模型得分（取对数），即$\log(\textrm{P}_{\textrm{lm}}(\vectorn{t}))$，用于度量译文的流畅度；
+\item (h8)语言模型得分（取对数），即$\log(\textrm{P}_{\textrm{lm}}(\vectorn{\emph{t}}))$，用于度量译文的流畅度；
 \vspace{0.5em}
-\item (h9)译文长度，即$|\vectorn{t}|$，用于避免模型过于倾向生成短译文（因为短译文语言模型分数高）；
+\item (h9)译文长度，即$|\vectorn{\emph{t}}|$，用于避免模型过于倾向生成短译文（因为短译文语言模型分数高）；
 \vspace{0.5em}
 \item (h10)翻译规则数量，学习对使用规则数量的偏好。比如，如果这个特征的权重较高，则表明系统更喜欢使用数量多的规则；
 \vspace{0.5em}
@@ -1455,7 +1458,7 @@ d_1 = {d'} \circ {r_5}
 \parinterval 解码的目标是找到得分score($d$)最高的推导$d$。这个过程通常被描述为：
 \begin{eqnarray}
-\hat{d} = \argmax_d\ \textrm{score} (d,\vectorn{s},\vectorn{t})
+\hat{d} = \argmax_d\ \textrm{score} (d,\vectorn{\emph{s}},\vectorn{\emph{t}})
 \label{eq:8-13}
 \end{eqnarray}
@@ -1571,11 +1574,11 @@ d_1 = {d'} \circ {r_5}
 \textrm{VP}_1\ \ \textrm{NP}_2 &\rightarrow& \textrm{V103(}\ \ \textrm{VP}_1\ \ \textrm{NP}_2 ) \nonumber
 \end{eqnarray}
-\noindent 可以看到，这两条新的规则源语言端只有两个部分，代表两个分叉。V103是一个新的标签，它没有任何句法含义。不过，为了保证二叉化后规则目标语部分的连续性，需要考虑源语言和目标语二叉化的同步性\upcite{zhang2006synchronous,Tong2009Better}。这样的规则与CKY方法一起使用完成解码，具体内容可以参考\ref{section-8.2.4}节的内容。
+\noindent 可以看到，这两条新的规则源语言端只有两个部分，代表两个分叉。V103是一个新的标签，它没有任何句法含义。不过，为了保证二叉化后规则目标语部分的连续性，需要考虑源语言和目标语二叉化的同步性\upcite{DBLP:conf/naacl/ZhangHGK06,Tong2009Better}。这样的规则与CKY方法一起使用完成解码，具体内容可以参考\ref{section-8.2.4}节的内容。
 \vspace{0.5em}
 \end{itemize}
-\parinterval 总的来说，基于句法的解码器较为复杂。无论是算法的设计还是工程技巧的运用，对开发者的能力都有一定要求。因此开发一个优秀的基于句法的机器翻译系统是一项有挑战的工作。
+\parinterval 总的来说，基于句法的解码器较为复杂，无论是算法的设计还是工程技巧的运用，对开发者的能力都有一定要求。因此开发一个优秀的基于句法的机器翻译系统是一项有挑战的工作。
 %----------------------------------------------------------------------------------------
 %    NEW SECTION
@@ -1583,15 +1586,15 @@ d_1 = {d'} \circ {r_5}
 \sectionnewpage
 \section{小结及深入阅读}
-\parinterval 自基于规则的方法开始，如何句法信息就是机器翻译研究人员关注的热点。在统计机器翻译时代，句法信息与机器翻译的结合成为了最具时态特色的研究方向之一。句法结构具有高度的抽象性，因此可以缓解基于词串方法不善于处理句子上层结构的问题。
+\parinterval 自基于规则的方法开始，如何使用句法信息就是机器翻译研究人员关注的热点。在统计机器翻译时代，句法信息与机器翻译的结合成为了最具时态特色的研究方向之一。句法结构具有高度的抽象性，因此可以缓解基于词串方法不善于处理句子上层结构的问题。
 \parinterval 本章对基于句法的机器翻译模型进行了介绍，并重点讨论了相关的建模、翻译规则抽取以及解码问题。从某种意义上说，基于句法的模型与基于短语的模型都同属一类模型，因为二者都假设：两种语言间存在由短语或者规则构成的翻译推导，而机器翻译的目标就是找到最优的翻译推导。但是，由于句法信息有其独特的性质，因此也给机器翻译带来了新的问题。有几方面问题值得关注：
 \begin{itemize}
 \vspace{0.5em}
-\item 从建模的角度看，早期的统计机器翻译模型已经涉及到了树结构的表示问题\upcite{DBLP:conf/acl/AlshawiBX97,DBLP:conf/acl/WangW98}。不过，基于句法的翻译模型的真正崛起还源自同步文法的提出。初期的工作大多集中在反向转录文法和括号转录文法方面\upcite{DBLP:conf/acl-vlc/Wu95,wu1997stochastic,DBLP:conf/acl/WuW98}，这类方法也被用于短语获取\upcite{ja2006obtaining,DBLP:conf/acl/ZhangQMG08}。进一步，研究者提出了更加通用的层次模型来描述翻译过程\upcite{chiang2005a,DBLP:conf/coling/ZollmannVOP08,DBLP:conf/acl/WatanabeTI06}，本章介绍的层次短语模型就是其中典型的代表。之后，使用语言学句法的模型也逐渐兴起。最具代表性的是在单语言端使用语言学句法信息的模型\upcite{DBLP:conf/naacl/GalleyHKM04,galley2006scalable,marcu2006spmt,DBLP:conf/naacl/HuangK06,DBLP:conf/emnlp/DeNeefeKWM07,DBLP:conf/wmt/LiuG08,DBLP:conf/acl/LiuLL06}，即：树到串翻译模型和串到树翻译模型。值得注意的是，除了直接用句法信息定义翻译规则，也有研究者将句法信息作为软约束改进层次短语模型\upcite{zollmann2006syntax,DBLP:conf/acl/MartonR08}。这类方法具有很大的灵活性，既保留了层次短语模型比较健壮的特点，同时也兼顾了语言学句法对翻译的指导作用。在同一时期，也有研究者提出同时使用双语两端的语言学句法树对翻译进行建模，比较有代表性的工作是使用同步树插入文法（Synchronous Tree-Insertion Grammars）和同步树替换文法（Synchronous Tree-Substitution Grammars）进行树到树翻译的建模\upcite{Nesson06inductionof,Zhang07atree-to-tree,DBLP:conf/acl/LiuLL09}。不过，树到树翻译假设两种语言间的句法结构能够相互转换，而这个假设并不总是成立。因此树到树翻译系统往往要配合一些技术，如树二叉化，来提升系统的健壮性。
+\item 从建模的角度看，早期的统计机器翻译模型已经涉及到了树结构的表示问题\upcite{DBLP:conf/acl/AlshawiBX97,DBLP:conf/acl/WangW98}。不过，基于句法的翻译模型的真正崛起还源自同步文法的提出。初期的工作大多集中在反向转录文法和括号转录文法方面\upcite{DBLP:conf/acl-vlc/Wu95,wu1997stochastic,DBLP:conf/acl/WuW98}，这类方法也被用于短语获取\upcite{ja2006obtaining,DBLP:conf/acl/ZhangQMG08}。进一步，研究者提出了更加通用的层次模型来描述翻译过程\upcite{chiang2005a,DBLP:conf/coling/ZollmannVOP08,DBLP:conf/acl/WatanabeTI06}，本章介绍的层次短语模型就是其中典型的代表。之后，使用语言学句法的模型也逐渐兴起。最具代表性的是在单语言端使用语言学句法信息的模型\upcite{DBLP:conf/naacl/GalleyHKM04,galley2006scalable,marcu2006spmt,DBLP:conf/naacl/HuangK06,DBLP:conf/emnlp/DeNeefeKWM07,DBLP:conf/wmt/LiuG08,liu2006tree}，即：树到串翻译模型和串到树翻译模型。值得注意的是，除了直接用句法信息定义翻译规则，也有研究者将句法信息作为软约束改进层次短语模型\upcite{DBLP:conf/wmt/ZollmannV06,DBLP:conf/acl/MartonR08}。这类方法具有很大的灵活性，既保留了层次短语模型比较健壮的特点，同时也兼顾了语言学句法对翻译的指导作用。在同一时期，也有研究者提出同时使用双语两端的语言学句法树对翻译进行建模，比较有代表性的工作是使用同步树插入文法（Synchronous Tree-Insertion Grammars）和同步树替换文法（Synchronous Tree-Substitution Grammars）进行树到树翻译的建模\upcite{Nesson06inductionof,Zhang07atree-to-tree,liu2009improving}。不过，树到树翻译假设两种语言间的句法结构能够相互转换，而这个假设并不总是成立。因此树到树翻译系统往往要配合一些技术，如树二叉化，来提升系统的健壮性。
 \vspace{0.5em}
-\item 在基于句法的模型中，常常会使用句法分析器完成句法分析树的生成。由于句法分析器会产生错误，因此这些错误会对机器翻译系统产生影响。对于这个问题，一种解决办法是同时考虑更多的句法树，这样增加正确句法分析结果被使用到的概率。其中，比较典型的方式基于句法森林的方法\upcite{DBLP:conf/acl/MiHL08,DBLP:conf/emnlp/MiH08}，比如，在规则抽取或者解码阶段使用句法森林，而不是仅仅使用一棵单独的句法树。另一种思路是，对句法结构进行松弛操作，即在翻译的过程中并不严格遵循句法结构\upcite{DBLP:conf/acl/ZhuX11,DBLP:conf/emnlp/ZhangZZ11}。实际上，前面提到的基于句法软约束的模型也是这类方法的一种体现\upcite{DBLP:conf/wmt/ZollmannV06,DBLP:conf/acl/MartonR08}。实际上，机器翻译领域的长期存在一个问题：使用什么样的句法结构是最适合机器翻译？因此，有研究者尝试对比不同的句法分析结果对机器翻译系统的影响\upcite{DBLP:conf/wmt/PopelMGZ11,DBLP:conf/coling/XiaoZZZ10}。也有研究者面向机器翻译任务自动归纳句法结构\upcite{DBLP:journals/tacl/ZhaiZZZ13}，而不是直接使用从单语小规模树库学习到的句法分析器，这样可以提高系统的健壮性。
+\item 在基于句法的模型中，常常会使用句法分析器完成句法分析树的生成。由于句法分析器会产生错误，因此这些错误会对机器翻译系统产生影响。对于这个问题，一种解决办法是同时考虑更多的句法树，这样增加正确句法分析结果被使用到的概率。其中，比较典型的方式基于句法森林的方法\upcite{DBLP:conf/acl/MiHL08,DBLP:conf/emnlp/MiH08}，比如，在规则抽取或者解码阶段使用句法森林，而不是仅仅使用一棵单独的句法树。另一种思路是，对句法结构进行松弛操作，即在翻译的过程中并不严格遵循句法结构\upcite{zhu2011improving,DBLP:conf/emnlp/ZhangZZ11}。实际上，前面提到的基于句法软约束的模型也是这类方法的一种体现\upcite{DBLP:conf/wmt/ZollmannV06,DBLP:conf/acl/MartonR08}。实际上，机器翻译领域的长期存在一个问题：使用什么样的句法结构是最适合机器翻译？因此，有研究者尝试对比不同的句法分析结果对机器翻译系统的影响\upcite{DBLP:conf/wmt/PopelMGZ11,DBLP:conf/coling/XiaoZZZ10}。也有研究者面向机器翻译任务自动归纳句法结构\upcite{DBLP:journals/tacl/ZhaiZZZ13}，而不是直接使用从单语小规模树库学习到的句法分析器，这样可以提高系统的健壮性。
 \vspace{0.5em}
 \item 本章所讨论的模型大多基于短语结构树。另一个重要的方向是使用依存树进行翻译建模\upcite{DBLP:journals/mt/QuirkM06,DBLP:conf/wmt/XiongLL07,DBLP:conf/coling/Lin04}。依存树比短语结构树有更简单的结构，而且依存关系本身也是对“语义”的表征，因此也可以扑捉到短语结构树所无法涵盖的信息。同其它基于句法的模型类似，基于依存树的模型大多也需要进行规则抽取、解码等步骤，因此这方面的研究工作大多涉及翻译规则的抽取、基于依存树的解码等\upcite{DBLP:conf/acl/DingP05,DBLP:conf/coling/ChenXMJL14,DBLP:conf/coling/SuLMZLL10,DBLP:conf/coling/XieXL14,DBLP:conf/emnlp/LiWL15}。此外，基于依存树的模型也可以与句法森林结构相结合，对系统性能进行进一步提升\upcite{DBLP:conf/acl/MiL10,DBLP:conf/coling/TuLHLL10}。
 \vspace{0.5em}

--- a/Chapter9/Figures/fig-back-propagation-output2.tex
+++ b/Chapter9/Figures/fig-back-propagation-output2.tex
@@ -10,11 +10,11 @@
 \node [anchor=south west,inner sep=2pt] (step100) at ([xshift=0.5em,yshift=-0.8em]h.north east) {\scriptsize{$\textbf{s}^K = \textbf{h}^{K-1} \textbf{w}^K$}};
-\node [anchor=south west] (slabel) at ([yshift=1em,xshift=0.3em]s.north) {\scriptsize{\red{\textbf{{已经得到：$\pi^K = \frac{\partial L}{\partial \textbf{s}^K}$}}}}};
+\node [anchor=south west] (slabel) at ([yshift=1em,xshift=0.3em]s.north) {\scriptsize{\textbf{{已经得到：$\pi^K = \frac{\partial L}{\partial \textbf{s}^K}$}}}};
-\draw [->,red] ([yshift=0.3em]slabel.south) .. controls +(south:0.5) and +(north:0.5) .. ([xshift=0.5em]s.north);
+\draw [->] ([yshift=0.3em]slabel.south) .. controls +(south:0.5) and +(north:0.5) .. ([xshift=0.5em,yshift=0.1em]s.north);
 {
-\draw [->,very thick,red] ([yshift=1em,xshift=-0.1em]s.north) -- ([yshift=1em,xshift=0.1em]h.north) node [pos=0.5,above] {\scriptsize{{$\frac{\partial L}{\partial \textbf{w}^K} = ?$, $\frac{\partial L}{\partial \textbf{h}^{K-1}} = ?$}}};
+\draw [->,very thick,red] ([yshift=1em,xshift=-0.1em]s.north) -- ([yshift=1.0em,xshift=0.1em]h.north) node [pos=0.5,above] {\scriptsize{{$\frac{\partial L}{\partial \textbf{w}^K} = ?$, $\frac{\partial L}{\partial \textbf{h}^{K-1}} = ?$}}};
 \draw [-,very thick,red] ([yshift=0.5em]h.north) -- ([yshift=1.5em]h.north);
 \draw [-,very thick,red] ([yshift=0.5em]s.north) -- ([yshift=1.5em]s.north);
 }

--- a/Chapter9/Figures/fig-code-back-propagation-1.tex
+++ b/Chapter9/Figures/fig-code-back-propagation-1.tex
@@ -51,15 +51,15 @@
 \node [anchor=south,draw,rounded corners,inner sep=2pt,minimum width=8em,minimum height=1.2em,fill=green!30!white,blur shadow={shadow xshift=1pt,shadow yshift=-1pt}] (h3) at ([yshift=1.5em]h2.north) {\scriptsize{h2 = Relu(h1 * w2)}};
 \node [anchor=south,draw,rounded corners,inner sep=2pt,minimum width=8em,minimum height=1.2em,fill=green!30!white,blur shadow={shadow xshift=1pt,shadow yshift=-1pt}] (h4) at ([yshift=1.5em]h3.north) {\scriptsize{h3 = h2 + h1}};
-{\draw [<-,very thick,red] (h1.north) -- (h2.south);}
+{\draw [->,very thick] (h1.north) -- (h2.south);}
-{\draw [<-,very thick,red] (h2.north) -- (h3.south);}
+{\draw [->,very thick] (h2.north) -- (h3.south);}
-{\draw [<-,very thick,red] (h3.north) -- (h4.south);}
+{\draw [->,very thick] (h3.north) -- (h4.south);}
-{\draw [<-,very thick,red,rounded corners] (h2.east) -- ([xshift=0.5em]h2.east) -- ([xshift=0.5em,yshift=0.5em]h3.north east) -- ([xshift=-2em,yshift=0.5em]h3.north east) -- ([xshift=-2em,yshift=1.5em]h3.north east);}
+{\draw [->,very thick,rounded corners] (h2.east) -- ([xshift=0.5em]h2.east) -- ([xshift=0.5em,yshift=0.5em]h3.north east) -- ([xshift=-2em,yshift=0.5em]h3.north east) -- ([xshift=-2em,yshift=1.5em]h3.north east);}
 \node [anchor=south,draw,rounded corners,inner sep=2pt,minimum width=8.0em,minimum height=1.2em,fill=red!30!white,blur shadow={shadow xshift=1pt,shadow yshift=-1pt}] (slayer) at ([yshift=1.5em]h4.north) {\tiny{h4 = Softmax(h3 * w4) (output)}};
 \node [anchor=south] (losslabel) at (slayer.north) {\scriptsize{\textbf{Cross Entropy Loss}}};
-{\draw [<-,very thick,red] (h4.north) -- (slayer.south);}
+{\draw [->,very thick] (h4.north) -- (slayer.south);}
 \end{tikzpicture}
 \end{center}

--- a/Chapter9/Figures/fig-corresponence-between-matrix-element-and-output.tex
+++ b/Chapter9/Figures/fig-corresponence-between-matrix-element-and-output.tex
@@ -28,10 +28,10 @@
 \node [anchor=west] (y2) at ([xshift=4em]neuron02.east) {$y_2$：\scriptsize{风力}};
-\draw [->,purple!40,line width=0.4mm] (x0.east) -- (neuron02.140) node [pos=0.1,below,yshift=-0.2em] {\tiny{$w_{02}$}};
+\draw [->,ugreen!50,line width=0.4mm] (x0.east) -- (neuron02.140) node [pos=0.1,below,yshift=-0.2em] {\tiny{$w_{02}$}};
-\draw [->,purple!40,line width=0.4mm] (x1.east) -- (neuron02.160) node [pos=0.1,below] {\tiny{$w_{12}$}};
+\draw [->,ugreen!50,line width=0.4mm] (x1.east) -- (neuron02.160) node [pos=0.1,below] {\tiny{$w_{12}$}};
-\draw [->,purple!40,line width=0.4mm] (x2.east) -- (neuron02.180) node [pos=0.3,below] {\tiny{$b_{2}$}};
+\draw [->,ugreen!50,line width=0.4mm] (x2.east) -- (neuron02.180) node [pos=0.3,below] {\tiny{$b_{2}$}};
-\draw [->,purple!30,line width=0.4mm] (neuron02.east) -- (y2.west);
+\draw [->,ugreen!30,line width=0.4mm] (neuron02.east) -- (y2.west);
 \end{scope}
 \end{tikzpicture}

--- a/Chapter9/Figures/fig-fit.tex
+++ b/Chapter9/Figures/fig-fit.tex
@@ -42,10 +42,10 @@
 %% sigmoid box
 \begin{scope}
 {
-\node [anchor=west] (flabel) at ([xshift=0.5in]y.east) {\scriptsize{sigmoid:}};
+\node [anchor=west] (flabel) at ([xshift=0.5in]y.east) {\scriptsize{Sigmoid:}};
-\node [anchor=north east] (slabel) at ([xshift=0]flabel.south east) {\scriptsize{sum:}};
+\node [anchor=north east] (slabel) at ([xshift=0]flabel.south east) {\scriptsize{Sum:}};
-\node [anchor=west,inner sep=2pt] (flabel2) at (flabel.east) {\scriptsize{$f(s)=1/(1+e^{-s})$}};
+\node [anchor=west,inner sep=2pt] (flabel2) at (flabel.east) {\scriptsize{$f(s_2)=1/(1+e^{-s_2})$}};
-\node [anchor=west,inner sep=2pt] (flabel3) at (slabel.east) {\scriptsize{$s=x_1 \cdot w + b$}};
+\node [anchor=west,inner sep=2pt] (flabel3) at (slabel.east) {\scriptsize{$s_2=x_1 \cdot w_2 + b$}};
 \draw [->,thick,dotted] ([yshift=-0.3em,xshift=-0.1em]n11.60)  .. controls +(east:1) and +(west:2) ..  ([xshift=-0.2em]flabel.west) ;
 \begin{pgfonlayer}{background}
@@ -136,10 +136,10 @@
 %% sigmoid box
 \begin{scope}
 {
-\node [anchor=west] (flabel) at ([xshift=0.8in]y.east) {\scriptsize{sigmoid:}};
+\node [anchor=west] (flabel) at ([xshift=0.8in]y.east) {\scriptsize{Sigmoid:}};
-\node [anchor=north east] (slabel) at ([xshift=0]flabel.south east) {\scriptsize{sum:}};
+\node [anchor=north east] (slabel) at ([xshift=0]flabel.south east) {\scriptsize{Sum:}};
-\node [anchor=west,inner sep=2pt] (flabel2) at (flabel.east) {\scriptsize{$f(s)=1/(1+e^{-s})$}};
+\node [anchor=west,inner sep=2pt] (flabel2) at (flabel.east) {\scriptsize{$f(s_2)=1/(1+e^{-s_2})$}};
-\node [anchor=west,inner sep=2pt] (flabel3) at (slabel.east) {\scriptsize{$s=x_1 \cdot w + b$}};
+\node [anchor=west,inner sep=2pt] (flabel3) at (slabel.east) {\scriptsize{$s_2=x_1 \cdot w_2 + b$}};
 \draw [->,thick,dotted] ([yshift=-0.3em,xshift=-0.1em]n11.60)  .. controls +(east:1) and +(west:2) ..  ([xshift=-0.2em]flabel.west) ;
 \begin{pgfonlayer}{background}
 {

--- a/Chapter9/Figures/fig-four-layers-of-neural-network.tex
+++ b/Chapter9/Figures/fig-four-layers-of-neural-network.tex
@@ -26,12 +26,12 @@
 \end{pgfonlayer}
 \node [anchor=west] (layer00label) at ([xshift=1.3em]x5.east) {\footnotesize{第0层}};
-\node [anchor=west] (layer00label2) at (layer00label.east) {\footnotesize{\red{(输入层)}}};
+\node [anchor=west] (layer00label2) at (layer00label.east) {\footnotesize{(输入层)}};
 {
 \node [anchor=west] (layer01label) at ([xshift=1em]layer01.east) {\footnotesize{第1层}};
 }
 {
-\node [anchor=west] (layer01label2) at (layer01label.east) {\footnotesize{\red{({隐层})}}};
+\node [anchor=west] (layer01label2) at (layer01label.east) {\footnotesize{({隐层})}};
 }
 %%% layer 2
@@ -57,7 +57,7 @@
 \node [anchor=west] (layer02label) at ([xshift=4.4em]layer02.east) {\footnotesize{第2层}};
 {
-\node [anchor=west] (layer02label2) at (layer02label.east) {\footnotesize{\red{({隐层})}}};
+\node [anchor=west] (layer02label2) at (layer02label.east) {\footnotesize{({隐层})}};
 }
 }
@@ -87,7 +87,7 @@
 \node [anchor=west] (layer03label) at ([xshift=1em]layer03.east) {\footnotesize{第3层}};
 {
-\node [anchor=west] (layer03label2) at (layer03label.east) {\footnotesize{\red{({输出层})}}};
+\node [anchor=west] (layer03label2) at (layer03label.east) {\footnotesize{({输出层})}};
 }
 }

--- a/Chapter9/Figures/fig-linear-transformation.tex
+++ b/Chapter9/Figures/fig-linear-transformation.tex
@@ -4,21 +4,21 @@ $$
 \begin{smallmatrix}  \underbrace{
    \left\{
        \begin{smallmatrix}
-            \left[
+            \left(
            \begin{array}{cccc}
             1& 0 &0 \\
             0& 1 &0 \\
             0& 0 &1
            \end{array}
-            \right ]
+            \right )
            \cdots
-            \left[
+            \left(
            \begin{array}{cccc}
                1& 0 &0 \\
                0& 1 &0 \\
                0& 0 &1
            \end{array}
-            \right]
+            \right)
        \end{smallmatrix}
        \right\}
     }\\5
@@ -37,21 +37,21 @@ $$
 \begin{smallmatrix}  \underbrace{
    \left\{
        \begin{smallmatrix}
-            \left[
+            \left(
            \begin{array}{cccc}
             1 \\
             1 \\
             1
            \end{array}
-            \right ]
+            \right)
            \cdots
-            \left[
+            \left(
            \begin{array}{cccc}
                1 \\
                1 \\
                1
            \end{array}
-            \right]
+            \right)
        \end{smallmatrix}
        \right\}
     }\\5

--- a/Chapter9/Figures/fig-model-training.tex
+++ b/Chapter9/Figures/fig-model-training.tex
@@ -12,7 +12,7 @@
 \node [anchor=north] (data) at ([yshift=-1em]system.south) {\scriptsize{\textbf{目标任务有标注数据}}};
 \draw [->,thick] (data.north) -- ([yshift=-0.1em]system.south);
-\node [anchor=north] (label) at ([yshift=-0em]data.south) {\scriptsize{(a) standard method}};
+\node [anchor=north] (label) at ([yshift=-0em]data.south) {\scriptsize{(a) 标准方法}};
 \end{scope}
@@ -31,7 +31,7 @@
 \draw [->,thick] (data.north) -- ([yshift=-0.1em]system.south);
 \node [anchor=north] (data2) at ([yshift=-1em,xshift=-7em]system.south) {\scriptsize{\textbf{大规模无标注数据}}};
 \draw [->,thick] (data2.north) -- ([yshift=-0.1em]encoderpre.south);
-\node [anchor=north] (label) at ([yshift=-0em,xshift=-4em]data.south) {\scriptsize{(b) pre-training + fine-tuning}};
+\node [anchor=north] (label) at ([yshift=-0em,xshift=-4em]data.south) {\scriptsize{(b) 预训练 + 微调}};
 \end{scope}

--- a/Chapter9/Figures/fig-parallel.tex
+++ b/Chapter9/Figures/fig-parallel.tex
@@ -13,7 +13,7 @@
 \node[parametershard,anchor=west,fill=yellow!10] (param1) at (0,0) {$W_o$};
 \node (param2) at ([xshift=1em]param1.east) {};
 \node[parametershard,anchor=west,fill=red!10] (param3) at ([xshift=1em]param2.east) {$W_h$};
-\node[anchor=south,inner sep=1pt] (serverlabel) at ([yshift=0.2em]param2.north) {\footnotesize{\textbf{parameter server}: $\mathbf w_{new} = \mathbf w - \alpha\cdot \frac{\partial L}{\partial \mathbf w}$}};
+\node[anchor=south,inner sep=1pt] (serverlabel) at ([yshift=0.2em]param2.north) {\footnotesize{\textbf{parameter server}: $\mathbf w_{\textrm{new}} = \mathbf w - \alpha\cdot \frac{\partial L}{\partial \mathbf w}$}};
 }
 \begin{pgfonlayer}{background}
@@ -33,7 +33,7 @@
 {
 \draw[->,very thick,red] ([xshift=-0.5em,yshift=2pt]processor2.north) -- ([xshift=-0.5em,yshift=-2pt]serverbox.south) node [pos=0.5,align=right,xshift=-2em] (pushlabel) {\scriptsize{$\frac{\partial L}{\partial \mathbf w}$}};;
-\draw[<-,very thick,blue] ([xshift=0.5em,yshift=2pt]processor2.north) -- ([xshift=0.5em,yshift=-2pt]serverbox.south) node [pos=0.5,align=left,xshift=2.2em] (fetchlabel) {\scriptsize{$\mathbf w_{new}$}};;;
+\draw[<-,very thick,blue] ([xshift=0.5em,yshift=2pt]processor2.north) -- ([xshift=0.5em,yshift=-2pt]serverbox.south) node [pos=0.5,align=left,xshift=2.2em] (fetchlabel) {\scriptsize{$\mathbf w_{\textrm{new}}$}};;;
 \draw[->,very thick,red] ([xshift=-0.5em,yshift=2pt]processor3.north) --
 ([xshift=3em,yshift=-2pt]serverbox.south);
 \draw[<-,very thick,blue] ([xshift=0.5em,yshift=2pt]processor3.north) -- ([xshift=4em,yshift=-2pt]serverbox.south) node [pos=0.5,align=left,xshift=2.2em] (fetchlabel) {\scriptsize{fetch (F)}};

--- a/Chapter9/Figures/fig-perceptron-mode.tex
+++ b/Chapter9/Figures/fig-perceptron-mode.tex
@@ -6,7 +6,7 @@
 \node [anchor=center] (x0) at ([yshift=3em]x1.center) {\Large{$x_0$}};
 \node [anchor=center] (x2) at ([yshift=-3em]x1.center) {\Large{$x_2$}};
 \node [anchor=west] (y) at ([xshift=6em]neuron.east) {\Large{$y$}};
-\node [anchor=center] (neuronmath) at (neuron.center) {\red{\small{$\sum \ge \sigma$}}};
+\node [anchor=center] (neuronmath) at (neuron.center) {\small{$\sum \ge \sigma$}};
 \draw [->,thick] (x0.east) -- (neuron.150) node [pos=0.5,above] {$w_0$};
 \draw [->,thick] (x1.east) -- (neuron.180) node [pos=0.5,above] {$w_1$};

--- a/Chapter9/Figures/fig-perceptron-to-predict-2.tex
+++ b/Chapter9/Figures/fig-perceptron-to-predict-2.tex
@@ -14,7 +14,7 @@
 \draw [->,thick] (neuron.east) -- (y.west);
 \node [anchor=center] (neuronmath) at (neuron.center) {\small{$\sum \ge \sigma$}};
-\node [anchor=south] (ylabel) at (y.north) {\textbf{不去了！}};
+\node [anchor=south] (ylabel) at (y.north) {\textbf{}};
 \end{scope}

--- a/Chapter9/Figures/fig-softmax.tex
+++ b/Chapter9/Figures/fig-softmax.tex
+\definecolor{ublue}{rgb}{0.152,0.250,0.545}
+\begin{tikzpicture}
+\begin{axis}[  
+  width=8cm, height=5cm, 
+  xtick={-6,-4,...,6},
+  ytick={0,0.5,1},
+  xlabel={\small{$x$}},
+  ylabel={\small{Softmax($x$)}},
+  xlabel style={xshift=3.0cm,yshift=1cm},
+  axis y line=middle,
+  ylabel style={xshift=-2.4cm,yshift=-0.2cm},
+  x axis line style={->},
+  axis line style={very thick},
+ % ymajorgrids,
+  %xmajorgrids,
+ axis x line*=bottom,
+  xmin=-6,
+  xmax=6,
+  ymin=0,
+  ymax=1]
+\addplot[draw=ublue,very thick]{(tanh(x/2) + 1)/2};
+\end{axis}
+\end{tikzpicture}
+%---------------------------------------------------------------------
\ No newline at end of file
--- a/Chapter9/Figures/fig-tensor-sample.tex
+++ b/Chapter9/Figures/fig-tensor-sample.tex
@@ -7,7 +7,7 @@
 \begin{tikzpicture}
 \begin{scope}[yshift=6.5em,xshift=1em]
 \setcounter{mycount1}{1}
-\draw[step=0.5cm,color=orange,line width=0.2mm] (-2,-2) grid (1,1);
+\draw[step=0.5cm,color=orange,line width=0.4mm] (-2,-2) grid (1,1);
 \foreach \y in {+0.5,-0.5,-1.5}
  \foreach \x in {-1.5,-0.5,0.5}{
    \node [fill=orange!20,inner sep=0pt,minimum height=0.98cm,minimum width=0.98cm] at (\x,\y) {\number\value{mycount1}};
@@ -17,7 +17,7 @@
 \begin{scope}[yshift=5.5em,xshift=0em]
 \setcounter{mycount2}{2}
-\draw[step=0.5cm,color=blue,line width=0.2mm] (-2,-2) grid (1,1);
+\draw[step=0.5cm,color=blue,line width=0.4mm] (-2,-2) grid (1,1);
 \foreach \y in {+0.5,-0.5,-1.5}
  \foreach \x in {-1.5,-0.5,0.5}{
    \node [fill=blue!20,inner sep=0pt,minimum height=0.98cm,minimum width=0.98cm] at (\x,\y) {\number\value{mycount2}};
@@ -27,7 +27,7 @@
 \begin{scope}[yshift=4.5em,xshift=-1em]
 \setcounter{mycount3}{3}
-\draw[step=0.5cm,color=ugreen,line width=0.2mm] (-2,-2) grid (1,1);
+\draw[step=0.5cm,color=ugreen,line width=0.4mm] (-2,-2) grid (1,1);
 \foreach \y in {+0.5,-0.5,-1.5}
  \foreach \x in {-1.5,-0.5,0.5}{
    \node [fill=green!20,inner sep=0pt,minimum height=0.98cm,minimum width=0.98cm] at (\x,\y) {\number\value{mycount3}};
@@ -37,7 +37,7 @@
 \begin{scope}[yshift=3.5em,xshift=-2em]
 \setcounter{mycount4}{4}
-\draw[step=0.5cm,color=red,line width=0.2mm] (-2,-2) grid (1,1);
+\draw[step=0.5cm,color=red,line width=0.4mm] (-2,-2) grid (1,1);
 \foreach \y in {+0.5,-0.5,-1.5}
  \foreach \x in {-1.5,-0.5,0.5}{
    \node [fill=red!20,inner sep=0pt,minimum height=0.98cm,minimum width=0.98cm] at (\x,\y) {\number\value{mycount4}};
@@ -45,4 +45,4 @@
  }
 \end{scope}
 \end{tikzpicture}
 %%%------------------------------------------------------------------------------------------------------------
\ No newline at end of file
--- a/Chapter9/Figures/fig-the-amount-of-data-in-a-bilingual-corpus.tex
+++ b/Chapter9/Figures/fig-the-amount-of-data-in-a-bilingual-corpus.tex
@@ -7,7 +7,7 @@
    yticklabel style={/pgf/number format/precision=1,/pgf/number format/fixed zerofill},
    xticklabel style={/pgf/number format/1000 sep=},
    xlabel style={yshift=0.5em},
-    xlabel={\footnotesize{Year}},ylabel={\footnotesize{\# of sents.}},
+    xlabel={\footnotesize{Year}},ylabel={\footnotesize{句子数量}},
    ymin=1,ymax=1000000000000,
    xmin=1999,xmax=2020,xtick={2000,2005,2010,2015,2020},
    legend style={yshift=-5em,xshift=0em,legend cell align=left,legend plot pos=right}

--- a/Chapter9/Figures/fig-translation.tex
+++ b/Chapter9/Figures/fig-translation.tex
@@ -19,11 +19,11 @@
 \node[above] at ([xshift=2em,yshift=1em]a2.west){1};
 \node[below] at ([xshift=-0.5em,yshift=0em]a2.west){-1};
 \node [anchor=west] (x) at ([xshift=-3.5cm,yshift=2em]a2.north) {\scriptsize{
-    $w=\begin{bmatrix}
+    $\mathbf{w}=\begin{pmatrix}
    1&0&0\\
    0&-1&0\\
    0&0&1
-    \end{bmatrix}$}
+    \end{pmatrix}$}
    };
 \node [anchor=west,rotate = 180] (x) at ([xshift=0.7em,yshift=1em]a2.south) {\Large{$\textbf{F}$}};
@@ -44,11 +44,11 @@
 \node [anchor=west] (x) at ([xshift=-4cm,yshift=2em]a3.north) {\scriptsize{
-    $b=\begin{bmatrix}
+    $\mathbf{b}=\begin{pmatrix}
    0.5&0&0\\
    0&0&0\\
    0&0&0
-    \end{bmatrix}$}
+    \end{pmatrix}$}
    };
 \draw[-stealth, line width=2pt,dashed] ([xshift=3em,yshift=1em]a2.east) to ([xshift=-3em,yshift=1em]a3.west);
 }

--- a/Chapter9/Figures/fig-two-layer-neural-network.tex
+++ b/Chapter9/Figures/fig-two-layer-neural-network.tex
@@ -44,10 +44,10 @@
 %% sigmoid box
 \begin{scope}
 {
-\node [anchor=west] (flabel) at ([xshift=1in]y.east) {\footnotesize{sigmoid:}};
+\node [anchor=west] (flabel) at ([xshift=1in]y.east) {\footnotesize{Sigmoid:}};
-\node [anchor=north east] (slabel) at ([xshift=0]flabel.south east) {\footnotesize{sum:}};
+\node [anchor=north east] (slabel) at ([xshift=0]flabel.south east) {\footnotesize{Sum:}};
-\node [anchor=west,inner sep=2pt] (flabel2) at (flabel.east) {\footnotesize{$f(s)=1/(1+e^{-s})$}};
+\node [anchor=west,inner sep=2pt] (flabel2) at (flabel.east) {\footnotesize{$f(s_2)=1/(1+e^{-s_2})$}};
-\node [anchor=west,inner sep=2pt] (flabel3) at (slabel.east) {\footnotesize{$s=x_1 \cdot w + b$}};
+\node [anchor=west,inner sep=2pt] (flabel3) at (slabel.east) {\footnotesize{$s_2=x_1 \cdot w_2 + b$}};
 \draw [->,thick,dotted] ([yshift=-0.3em,xshift=-0.1em]n11.60)  .. controls +(east:1) and +(west:2) ..  ([xshift=-0.2em]flabel.west) ;
 \begin{pgfonlayer}{background}

--- a/Chapter9/chapter9.tex
+++ b/Chapter9/chapter9.tex
--- a/bibliography.bib
+++ b/bibliography.bib
--- a/mt-book-xelatex.tex
+++ b/mt-book-xelatex.tex
@@ -136,11 +136,11 @@
 %\include{Chapter3/chapter3}
 %\include{Chapter4/chapter4}
 %\include{Chapter5/chapter5}
-\include{Chapter6/chapter6}
+%\include{Chapter6/chapter6}
 %\include{Chapter7/chapter7}
 %\include{Chapter8/chapter8}
 %\include{Chapter9/chapter9}
-%\include{Chapter10/chapter10}
+\include{Chapter10/chapter10}
 %\include{Chapter11/chapter11}
 %\include{Chapter12/chapter12}
 %\include{Chapter13/chapter13}

--- a/structure.tex
+++ b/structure.tex
@@ -686,5 +686,5 @@ addtohook={%
 \newcommand\chaptereighteen{第十八章}%*
 \newcommand\funp{}%函数P等使用，空是斜体，textrm是加粗
-\newcommand\vectorn{\mathbf}%向量N等使用
+\newcommand\vectorn{\textbf}%向量N等使用
-\newcommand\seq{\mathbf}%序列N等使用
+\newcommand\seq{}%序列N等使用