合并分支 'caorunzhe' 到 'zengxin'

Caorunzhe 查看合并请求 !654

合并分支 'caorunzhe' 到 'zengxin'
Caorunzhe 查看合并请求 !654
74e1894b · zengxin · 13959b0a · 29da486c · 74e1894b · 74e1894b
Commit 74e1894b authored Dec 21, 2020 by zengxin
--- a/Chapter1/chapter1.tex
+++ b/Chapter1/chapter1.tex
@@ -186,7 +186,7 @@
 \includegraphics[scale=0.3]{./Chapter1/Figures/figure-wmt-participation.jpg}
 \includegraphics[scale=0.3]{./Chapter1/Figures/figure-wmt-bestresults.jpg}
 \setlength{\belowcaptionskip}{-1.5em}
-    \caption{WMT\ 19国际机器翻译大赛（左：WMT\ 19参赛队伍；右：WMT\ 19各项目的最好分数结果）}
+    \caption{WMT\ 19国际机器翻译大赛（左：WMT\ 19参赛队伍；右：WMT\ 19各项目的最好分数）}
    \label{fig:1-5}
 \end{figure}
 %-------------------------------------------
@@ -200,7 +200,7 @@
 \sectionnewpage
 \section{机器翻译现状及挑战}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\parinterval 机器翻译技术发展到今天已经过无数次迭代，技术范式也经过若干次更替，近些年机器翻译的应用也如雨后春笋相继浮现。今天的机器翻译的质量究竟如何呢？乐观地说，在很多特定的条件下，机器翻译的译文结果是非常不错的，甚至可以接近人工翻译的结果。然而，在开放式翻译任务中，机器翻译的结果还并不完美。更严格来说，机器翻译的质量远没有达到人们所期望的程度。对于有些人提到的“机器翻译代替人工翻译”也并不是事实。比如，在高精度同声传译任务中，机器翻译仍需要更多打磨；再比如，针对于小说的翻译，机器翻译还无法做到与人工翻译媲美；甚至有人尝试用机器翻译系统翻译中国古代诗词，这里更多的是娱乐的味道。但是毫无疑问的是，机器翻译可以帮助人类，甚至有朝一日可以代替一些低端的人工翻译工作。
+\parinterval 机器翻译技术发展到今天已经过无数次迭代，技术范式也经过若干次更替，近些年机器翻译的应用也如雨后春笋相继浮现。今天的机器翻译的质量究竟如何呢？乐观地说，在很多特定的条件下，机器翻译的译文结果是非常不错的，甚至可以接近人工翻译的结果。然而，在开放式翻译任务中，机器翻译的结果还并不完美。更严格来说，机器翻译的质量远没有达到人们所期望的程度。对于有些人提到的“机器翻译将代替人工翻译”也并不是事实。比如，在高精度同声传译任务中，机器翻译仍需要更多打磨；再比如，针对于小说的翻译，机器翻译还无法做到与人工翻译媲美；甚至有人尝试用机器翻译系统翻译中国古代诗词，这里更多的是娱乐的味道。但是毫无疑问的是，机器翻译可以帮助人类，甚至有朝一日可以代替一些低端的人工翻译工作。
 \parinterval 图\ref{fig:1-6}展示了机器翻译和人工翻译质量的一个对比结果。在汉语到英语的新闻翻译任务中，如果对译文进行人工评价（五分制），那么机器翻译的译文得分为3.9分，人工译文得分为4.7分（人的翻译也不是完美的）。可见，在这个任务中机器翻译表现不错，但是与人还有一定差距。如果换一种方式评价，把人的译文作为参考答案，用机器翻译的译文与其进行比对（百分制），会发现机器翻译的得分只有47分。当然，这个结果并不是说机器翻译的译文质量很差，它更多的是表明机器翻译系统可以生成一些与人工翻译不同的译文，机器翻译也具有一定的创造性。这也类似于，很多围棋选手都想向AlphaGo学习，因为智能围棋系统也可以走出一些人类从未走过的妙招。
@@ -549,7 +549,7 @@
 \vspace{0.5em}
 \item EMNLP，全称Conference on Empirical Methods in Natural Language Processing，自然语言处理另一个顶级会议之一，由ACL当中对语言数据和经验方法有特殊兴趣的团体主办，始于1996年。会议比较偏重于方法和经验性结果。
 \vspace{0.5em}
-\item MT Summit，全称Machine Translation Summit，是机器翻译领域的重要峰会。该会议的特色是与产业结合，在探讨机器翻译技术问题的同时，更多的关注机器翻译的应用落地工作，因此备受产业界关注。该会议每两年举办一次，通常由欧洲机器翻译协会（The European Association for Machine Translation，EAMT）、美国机器翻译协会（The Association for Machine Translation in the Americas，AMTA）、亚洲-太平洋地区机器翻译协会（Asia-Pacific Association for Machine Translation，AAMT）。
+\item MT Summit，全称Machine Translation Summit，是机器翻译领域的重要峰会。该会议的特色是与产业结合，在探讨机器翻译技术问题的同时，更多的关注机器翻译的应用落地工作，因此备受产业界关注。该会议每两年举办一次，通常由欧洲机器翻译协会（The European Association for Machine Translation，EAMT）、美国机器翻译协会（The Association for Machine Translation in the Americas，AMTA）、亚洲-太平洋地区机器翻译协会（Asia-Pacific Association for Machine Translation，AAMT）举办。
 \vspace{0.5em}
 \item NAACL，全称Annual Conference of the North American Chapter of the Association for Computational Linguistics，为ACL北美分会，在自然语言处理领域也属于顶级会议，每年会选择一个北美城市召开会议。
 \vspace{0.5em}

--- a/Chapter13/Figures/figure-example-of-neural-machine-translation.png
+++ b/Chapter13/Figures/figure-example-of-neural-machine-translation.png
--- a/Chapter13/Figures/figure-true-case-of-adversarial-examples.jpg
+++ b/Chapter13/Figures/figure-true-case-of-adversarial-examples.jpg
--- a/Chapter13/chapter13.tex
+++ b/Chapter13/chapter13.tex
@@ -46,7 +46,7 @@
 \sectionnewpage
 \section{开放词表}
-\parinterval 人类表达语言的方式是十分多样的，这也体现在单词的构成上，甚至我们都无法想象数据中存在的不同单词的数量。比如，如果使用简单的分词策略，WMT、CCMT等评测数据的英文词表大小都会在100万以上。当然，这里面也包括很多的数字和字母的混合，还有一些组合词。不过，如果不加限制，机器翻译所面对的词表确实很``大''。这也会导致系统速度变慢，模型变大。更严重的问题是，测试数据中的一些单词根本就没有在训练数据中出现过，这时会出现OOV翻译问题，即系统无法对未见单词进行翻译。在神经机器翻译中，通常会考虑使用更小的翻译单元来缓解以上问题。
+\parinterval 人类表达语言的方式是十分多样的，这也体现在单词的构成上，甚至我们都无法想象数据中存在的不同单词的数量。即便使用分词策略，在WMT、CCMT等评测数据上，英文词表大小都会在100万以上。当然，这里面也包括很多的数字和字母的混合，还有一些组合词。不过，如果不加限制，机器翻译所面对的词表将会很“大”。这也会导致系统速度变慢，模型变大。更严重的问题是，测试数据中的一些单词根本就没有在训练数据中出现过，这时会出现OOV翻译问题，即系统无法对未见单词进行翻译。在神经机器翻译中，通常会考虑使用更小的翻译单元来缓解以上问题。
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -54,18 +54,18 @@
 \subsection{大词表和OOV问题}
-\parinterval 首先来具体看一看神经机器翻译的大词表问题。神经机器翻译模型训练和解码都依赖于源语言和目标语言的词表。在建模中，词表中的每一个单词都会被转换为分布式（向量）表示，即词嵌入。这些向量会作为模型的输入（见第六章）。如果每个单词都对应一个向量，那么单词的各种变形（时态、语态等）都会导致词表和相应的向量数量的增加。图\ref{fig:7-7}展示了一些英语单词的时态语态变化。
+\parinterval 首先来具体看一看神经机器翻译的大词表问题。神经机器翻译模型训练和解码都依赖于源语言和目标语言的词表。在建模中，词表中的每一个单词都会被转换为分布式（向量）表示，即词嵌入。这些向量会作为模型的输入（见{\chapterten}）。如果每个单词都对应一个向量，那么单词的各种变形（时态、语态等）都会导致词表和相应的向量数量的增加。图\ref{fig:13-1}展示了一些英语单词的时态语态变化。
 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter13/Figures/figure-word-change}
 \caption{单词时态、语态、单复数的变化}
-\label{fig:7-7}
+\label{fig:13-1}
 \end{figure}
 %----------------------------------------------
-\parinterval 如果要覆盖更多的翻译现象，词表会不断膨胀，并带来两个问题：
+\parinterval 如果要覆盖更多的翻译现象，词表会不断膨胀，并带来下述两个问题：
 \begin{itemize}
 \vspace{0.5em}
@@ -75,7 +75,7 @@
 \vspace{0.5em}
 \end{itemize}
-\parinterval 理想情况下，机器翻译应该是一个{\small\bfnew{开放词表}}\index{开放词表}（Open-Vocabulary）\index{Open-Vocabulary}的翻译任务。也就是，不论测试数据中包含什么样的词，机器翻译系统都应该能够正常翻译。但是，现实的情况是，即使不断扩充词表，也不可能覆盖所有可能的单词。这时就会出现OOV问题（集外词问题）。这个问题在使用受限词表时会更加严重，因为低频词和未见过的词都会被看作OOV单词。这时会将这些单词用<UNK>代替。通常，数据中<UNK>的数量会直接影响翻译性能，过多的<UNK>会造成欠翻译、结构混乱等问题。因此神经机器翻译需要额外的机制解决大词表和OOV问题。
+\parinterval 理想情况下，机器翻译应该是一个{\small\bfnew{开放词表}}\index{开放词表}（Open-Vocabulary）\index{Open-Vocabulary}的翻译任务。也就是，无论测试数据中包含什么样的词，机器翻译系统都应该能够正常翻译。但是，现实的情况是即使不断扩充词表，也不可能覆盖所有可能的单词，即OOV问题，或被称作集外词问题。这个问题在使用受限词表时会更加严重，因为低频词和未见过的词都会被看作OOV单词。这时会将这些单词用<UNK>代替。通常，数据中<UNK>的数量会直接影响翻译性能，过多的<UNK>会造成欠翻译、结构混乱等问题。因此神经机器翻译需要额外的机制解决大词表和OOV问题。
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -83,20 +83,20 @@
 \subsection{子词}
-\parinterval 一种解决开放词表翻译问题的方法是改造输出层结构\upcite{garcia-martinez2016factored,DBLP:conf/acl/JeanCMB15}，比如，替换原始的Softmax层，用更加高效的神经网络结构进行超大规模词表上的预测。不过这类方法往往需要对系统进行修改，由于模型结构和训练方法的调整使得系统开发与调试的工作量增加。而且这类方法仍然无法解决OOV问题。因此在实用系统中并不常用。
+\parinterval 一种解决开放词表翻译问题的方法是改造输出层结构\upcite{garcia-martinez2016factored,DBLP:conf/acl/JeanCMB15}，比如，替换原始的Softmax层，用更加高效的神经网络结构进行超大规模词表上的预测。不过，由于模型结构和训练方法的调整使得系统开发与调试的工作量增加，因此使用这类方法时往往需要对系统进行修改，并且这类方法仍然无法解决OOV问题，因此在实用系统中并不常用。
-\parinterval 另一种思路是不改变机器翻译系统，而是从数据处理的角度来缓解OOV问题。既然使用单词会带来数据稀疏问题，那么自然会想到使用更小的单元。比如，把字符作为最小的翻译单元 \footnote{中文中字符可以被看作是汉字。} \ \dash \ 也就是基于字符的翻译模型\upcite{DBLP:journals/tacl/LeeCH17}。以英文为例，只需要构造一个包含26个英文字母、数字和一些特殊符号的字符表，便可以表示所有的单词。
+\parinterval 另一种思路是不改变机器翻译系统，而是从数据处理的角度来缓解OOV问题。既然使用单词会带来数据稀疏问题，那么自然会想到使用更小的单元。比如，把字符作为最小的翻译单元 \footnote{中文里的字符可以被看作是汉字。} \ \dash \ 也就是基于字符的翻译模型\upcite{DBLP:journals/tacl/LeeCH17}。以英文为例，只需要构造一个包含26个英文字母、数字和一些特殊符号的字符表，便可以表示所有的单词。
-\parinterval 但是字符级翻译也面临着新的问题\ \dash\ 使用字符增加了系统捕捉不同语言单元之间搭配的难度。假设平均一个单词由5个字符组成，所处理的序列长度便增大5倍。这使得具有独立意义的不同语言单元需要跨越更远的距离才能产生联系。此外，基于字符的方法也破坏了单词中天然存在的构词规律，或者说破坏了单词内字符的局部依赖。比如，英文单词``telephone''中的``tele''和``phone''都是有具体意义的词缀，但是如果把它们打散为字符就失去了这些含义。
+\parinterval 但是字符级翻译也面临着新的问题\ \dash\ 使用字符增加了系统捕捉不同语言单元之间搭配的难度。假设平均一个单词由5个字符组成，系统所处理的序列长度便增大5倍。这使得具有独立意义的不同语言单元需要跨越更远的距离才能产生联系。此外，基于字符的方法也破坏了单词中天然存在的构词规律，或者说破坏了单词内字符的局部依赖。比如，英文单词“telephone”中的“tele”和“phone”都是有具体意义的词缀，但是如果把它们打散为字符就失去了这些含义。
-\parinterval 那么有没有一种方式能够兼顾基于单词和基于字符方法的优点呢？常用的手段包括两种，一种是采用字词融合的方式构建词表，将未知单词转换为字符的序列并通过特殊的标记将其与普通的单词区分开来\upcite{luong2016acl_hybrid}。而另一种方式是将单词切分为{\small\bfnew{子词}}\index{子词}（Sub-word）\index{Sub-word}，它是介于单词和字符中间的一种语言单元表示形式。比如，将英文单词``doing''切分为``do''+``ing''。对于形态学丰富的语言来说，子词体现了一种具有独立意义的构词基本单元。比如，如图\ref{fig:7-8}，子词``do''，和``new''在可以用于组成其他不同形态的单词。
+\parinterval 那么有没有一种方式能够兼顾基于单词和基于字符方法的优点呢？常用的手段包括两种，一种是采用字词融合的方式构建词表，将未知单词转换为字符的序列并通过特殊的标记将其与普通的单词区分开来\upcite{luong2016acl_hybrid}。而另一种方式是将单词切分为{\small\bfnew{子词}}\index{子词}（Sub-word）\index{Sub-word}，它是介于单词和字符中间的一种语言单元表示形式。比如，将英文单词“doing”切分为“do”+“ing”。对于形态学丰富的语言来说，子词体现了一种具有独立意义的构词基本单元。比如，如图\ref{fig:13-2}，子词“do”，和“new”在可以用于组成其他不同形态的单词。
 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter13/Figures/figure-word-root}
 \caption{不同单词共享相同的子词（前缀）}
-\label{fig:7-8}
+\label{fig:13-2}
 \end{figure}
 %----------------------------------------------
@@ -106,13 +106,13 @@
 \vspace{0.5em}
 \item 对原始数据进行分词操作；
 \vspace{0.5em}
-\item 构建子词词表；
+\item 构建符号合并表；
 \vspace{0.5em}
-\item 通过子词词表重新对数据中的单词进行切分。
+\item 通过合并表，将按字符切分的单词合并为字词组合。
 \vspace{0.5em}
 \end{itemize}
-\parinterval 这里面的核心是如何构建子词词表，下面对一些典型方法进行介绍。
+\parinterval 这里面的核心是如何构建符号合并表，下面对一些典型方法进行介绍。
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -120,7 +120,7 @@
 \subsection{双字节编码（BPE）}
-\parinterval {\small\bfnew{字节对编码}}\index{字节对编码}或{\small\bfnew{双字节编码}}\index{双字节编码}（Byte Pair Encoding，BPE）\index{Byte Pair Encoding，BPE}是一种常用的子词词表构建方法\upcite{DBLP:conf/acl/SennrichHB16a}。BPE方法最早用于数据压缩，该方法将数据中常见的连续字符串替换为一个不存在的字符，之后通过构建一个替换关系的对应表，对压缩后的数据进行还原。机器翻译借用了这种思想，把子词切分看作是学习对自然语言句子进行压缩编码表示的问题\upcite{Gage1994ANA}。其目的是，保证编码后的结果（即子词切分）占用的字节尽可能少。这样，子词单元会尽可能被不同单词复用，同时又不会因为使用过小的单元造成子词切分序列过长。使用BPE算法构建子词词表可以分为如下几个步骤：
+\parinterval {\small\bfnew{字节对编码}}\index{字节对编码}或{\small\bfnew{双字节编码}}\index{双字节编码}（Byte Pair Encoding\index{Byte Pair Encoding}，BPE）是一种常用的子词词表构建方法\upcite{DBLP:conf/acl/SennrichHB16a}。BPE方法最早用于数据压缩，该方法将数据中常见的连续字符串替换为一个不存在的字符，之后通过构建一个替换关系的对应表，对压缩后的数据进行还原。机器翻译借用了这种思想，把子词切分看作是学习对自然语言句子进行压缩编码表示的问题\upcite{Gage1994ANA}。其目的是，保证编码后的结果（即子词切分）占用的字节尽可能少。这样，子词单元会尽可能被不同单词复用，同时又不会因为使用过小的单元造成子词切分序列过长。使用BPE算法构建符号合并表可以分为如下几个步骤：
 \begin{itemize}
 \vspace{0.5em}
@@ -128,7 +128,7 @@
 \vspace{0.5em}
 \item 将分词后的每个单词进行进一步切分，划分为字符序列。同时，在每个单词结尾添加结束符<e>用于标记单词的边界。之后，统计该单词在数据中出现的次数。例如单词low在数据中出现了5次，可以将其记为`l o w <e>:'5。
 \vspace{0.5em}
-\item 对得到的字符集合进行统计，统计每个单词中2-gram符号出现的频次 \footnote{发生合并前，一个字符便是一个符号}。之后，选择最高频的2-gram符号，将其合并为新的符号，即新的子词。例如``A''和``B''连续出现的频次最高，则以``AB''替换所有单词内连续出现的``A''和``B''并将其加入子词词表。这样，``AB''会被作为一个整体，在之后的过程中可以与其他符号进一步合并。需要注意的是替换和合并不会跨越单词的边界，即只对单个单词进行替换和合并。
+\item 对得到的字符集合进行统计，统计每个单词中2-gram符号出现的频次 \footnote{发生合并前，一个字符便是一个符号}。之后，选择最高频的2-gram符号，将其合并为新的符号，即新的子词。例如“A”和“B”连续出现的频次最高，则以“AB”替换所有单词内连续出现的“A”和“B”并将其加入子词词表。这样，“AB”会被作为一个整体，在之后的过程中可以与其他符号进一步合并。需要注意的是替换和合并不会跨越单词的边界，即只对单个单词进行替换和合并。
 \vspace{0.5em}
 \item 不断重复上一步骤，直到子词词表大小达到预定的大小或者下一个最高频的2-gram字符的频次为1。子词词表大小是BPE的唯一的参数，它用来控制上述子词合并的规模。
 \vspace{0.5em}
@@ -143,19 +143,9 @@
 %----------------------------------------------
 \end{itemize}
-\parinterval 图\ref{fig:7-9}给出了BPE算法执行的实例。在执行合并操作时，需要考虑不同的情况。假设词表中存在子词``ab''和``cd''，此时要加入子词``abcd''。可能会出现如下的情况：
+\parinterval 图\ref{fig:7-9}给出了BPE算法执行的实例。其中预先设定的合并表的大小为10
-\begin{itemize}
+\parinterval 在得到了符号合并表后，便需要对用字符表示的单词进行合并，得到以子词形式表示的文本。首先，将单词切分为以字符表示的符号序列，并在尾部加上终结符。然后按照符号合并表的顺序依次遍历，如果存在相同的2-gram符号组合，则对其进行合并，直至遍历结束。图1.4给出了一个使用字符合并表对单词进行子词切分的实例。红色单元为每次合并后得到的新符号，直至无法合并，或遍历结束，得到最终的合并结果。其中每一个单元为一个子词，如图\ref{fig:7-10} 。{\red{图有问题}}
-\vspace{0.5em}
-\item 若``ab''、``cd''、``abcd''完全独立，彼此的出现互不影响，将``abcd''加入词表，词表数目$+1$；
-\vspace{0.5em}
-\item 若``ab''和``cd''必同时出现则词表中加入``abcd''，去除``ab''和``cd''，词表数目$-1$。这个操作是为了较少词表中的冗余；
-\vspace{0.5em}
-\item 若出现``ab''，其后必出现``cd''，但是``cd''却可以作为独立的子词出现，则将``abcd''加入词表，去除``ab''，反之亦然，词表数目不变。
-\vspace{0.5em}
-\end{itemize}
-\parinterval 在得到了子词词表后，便需要对单词进行切分。BPE要求从较长的子词开始替换。首先，对子词词表按照字符长度从大到小进行排序。然后，对于每个单词，遍历子词词表，判断每个子词是不是当前词的子串，若是则进行替换切分。将单词中所有的子串替换为子词后，如果仍有子串未被替换，则将其用<UNK>代替，如图\ref{fig:7-10} 。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -168,9 +158,9 @@
 \parinterval 由于模型的输出也是子词序列，因此需要对最终得到的翻译结果进行子词还原，即将子词形式表达的单元重新组合为原本的单词。这一步操作也十分简单，只需要不断的将每个子词向后合并，直至遇到表示单词边界的结束符<e>，便得到了一个完整的单词。
-\parinterval 使用BPE方法的策略有很多。不仅可以单独对源语言和目标语言进行子词的切分，也可以联合源语言和目标语言，共同进行子词切分，被称作Joint-BPE\upcite{DBLP:conf/acl/SennrichHB16a}。单语BPE比较简单直接，而Joint-BPE则可以增加两种语言子词切分的一致性。对于相似语系中的语言，如英语和德语，常使用Joint-BPE的方法联合构建词表。而对于中英这些差异比较大的语种，则需要独立的进行子词切分。
+\parinterval 使用BPE方法的策略有很多。不仅可以单独对源语言和目标语言进行子词的切分，也可以联合源语言和目标语言，共同进行子词切分，被称作Joint-BPE\upcite{DBLP:conf/acl/SennrichHB16a}。单语BPE比较简单直接，而Joint-BPE则可以增加两种语言子词切分的一致性。对于相似语系中的语言，如英语和德语，常使用Joint-BPE的方法联合构建词表。而对于中英这些差异比较大的语种，则需要独立的进行子词切分。使用子词表示句子的方法可以有效的平衡词汇量，增大对未见单词的覆盖度。像英译德、汉译英任务，使用16k或者32k的子词词表大小便能取得很好的效果。
-\parinterval BPE还有很多变种方法。在进行子词切分时，BPE从最长的子词开始进行切分。这个启发性规则可以保证切分结果的唯一性，实际上，在对一个单词用同一个子词词表切分时，可能存在多种切分方式，如hello，我们可以分割为``hell''和``o''，也可以分割为``h''和``ello''。这种切分的多样性可以来提高神经机器翻译系统的健壮性\upcite{DBLP:conf/acl/Kudo18}。而在T5等预训练模型中\upcite{DBLP:journals/jmlr/RaffelSRLNMZLL20}则使用了基于字符级别的BPE。此外，尽管BPE被命名为字节对编码，实际上一般处理的是Unicode编码，而不是字节。在预训练模型GPT2中，也探索了字节级别的BPE，在机器翻译、问答等任务中取得了很好的效果\upcite{radford2019language}。
+\parinterval BPE还有很多变种方法。在进行子词切分时，BPE从最长的子词开始进行切分。这个启发性规则可以保证切分结果的唯一性，实际上，在对一个单词用同一个子词词表切分时，可能存在多种切分方式，如hello，我们可以分割为“hell”和“o”，也可以分割为“h”和“ello”。这种切分的多样性可以用来提高神经机器翻译系统的健壮性\upcite{DBLP:conf/acl/Kudo18}。{\red 而在T5等预训练模型中\upcite{DBLP:journals/jmlr/RaffelSRLNMZLL20}，则使用了基于字符级别的BPE。此外，尽管BPE被命名为字节对编码，实际上一般处理的是Unicode编码，而不是字节。在预训练模型GPT2中，也探索了字节级别的BPE，在机器翻译、问答等任务中取得了很好的效果\upcite{radford2019language}}。
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -178,11 +168,21 @@
 \subsection{其他方法}
-\parinterval 与基于统计的BPE算法不同，基于Word Piece和1-gram Language Model（ULM）的方法则是利用语言模型进行子词词表的构造\upcite{DBLP:conf/acl/Kudo18}。本质上，基于语言模型的方法和基于BPE的方法的思路是一样的，即通过合并字符和子词不断生成新的子词。它们的区别仅在于合并子词的方式不同。基于BPE的方法选择出现频次最高的连续字符2-gram合并为新的子词，而基于语言模型的方法则是根据语言模型输出的概率选择要合并哪些子词。
+\parinterval 与基于统计的BPE算法不同，基于Word Piece的子词切分方法则是利用语言模型进行子词词表的构造\upcite{DBLP:conf/icassp/SchusterN12}。本质上，基于语言模型的方法和基于BPE的方法的思路是一样的，即通过合并字符和子词不断生成新的子词。它们的区别仅在于合并子词的方式不同。{\red 基于BPE的方法选择出现频次最高的连续字符2-gram合并为新的子词}，而基于语言模型的方法则是根据语言模型输出的概率选择要合并哪些子词。具体来说，基于Word Piece的方法首先将句子切割为字符表示的形式\upcite{DBLP:conf/icassp/SchusterN12}，并利用该数据训练一个1-gram语言模型，记为$\textrm{log}\funp{P}(\cdot)$。假设两个相邻的子词单元$a$和$b$被合并为新的子词$c$，则整个句子的语言模型得分的变化为$\triangle=\textrm{log}\funp{P}(c)-\textrm{log}\funp{P}(a)-\textrm{log}\funp{P}(b)$。这样，可以不断的选择使$\triangle$最大的两个子词单元进行合并，直到达到预设的词表大小或者句子概率的增量低于某个阈值。
-\parinterval 具体来说，基于Word Piece的方法首先将句子切割为字符表示的形式\upcite{DBLP:conf/icassp/SchusterN12}，并利用该数据训练一个1-gram语言模型，记为$\textrm{logP}(\cdot)$。假设两个相邻的子词单元$a$和$b$被合并为新的子词$c$，则整个句子的语言模型得分的变化为$\triangle=\textrm{logP}(c)-\textrm{logP}(a)-\textrm{logP}(b)$。这样，可以不断的选择使$\triangle$最大的两个子词单元进行合并，直到达到预设的词表大小或者句子概率的增量低于某个阈值。而ULM方法以最大化整个句子的概率为目标构建词表\upcite{DBLP:conf/acl/Kudo18}，具体实现上也不同于基于Word Piece的方法，这里不做详细介绍。
+\parinterval 目前比较主流的子词切分方法都是作用于分词后的序列，对一些没有明显词边界且资源稀缺的语种并不友好。相比之下，SentencePiece可以作用于未经过分词处理的输入序列\upcite{kudo2018sentencepiece}，同时囊括了双字节编码和语言模型的子词切分方法，更加灵活易用。
-\parinterval 使用子词表示句子的方法可以有效的平衡词汇量，增大对未见单词的覆盖度。像英译德、汉译英任务，使用16k或者32k的子词词表大小便能取得很好的效果。
+\parinterval 通过上述子词切分方法，可以缓解OOV的问题，允许模型利用到一些词法上的信息。然而主流的BPE子词切分方法中，每个单词都对应一种唯一的子词切分方式，因此输入的数据经过子词切分后的序列表示也是唯一的。在给定词表的情况下，每句话仍然存在多种切分方式。而经过现有BPE处理后的序列，模型只能接收到单一的表示，可能会阻止模型更好地学习词的组成，不能充分利用单词中的形态学特征。此外，针对切分错误的输入数据表现不够鲁棒，常常会导致整句话的翻译效果极差。为此，研究人员提出一些正则化方法\upcite{DBLP:conf/acl/Kudo18,provilkov2020bpe}。
+\begin{itemize}
+\vspace{0.5em}
+\item 子词正则化方法\upcite{DBLP:conf/acl/Kudo18}。其思想是在训练过程中扰乱确定的子词边界，根据1-gram Language Model{\red （ULM）}采样出多种子词切分候选。通过最大化整个句子的概率为目标构建词表。在实现上，与上述基于Word Piece的方法略有不同，这里不做详细介绍。
+\vspace{0.5em}
+\item BPE-Dropout\upcite{provilkov2020bpe}。在训练时，通过在合并过程中按照一定概率$p${\red（这个p能不能改成P）}（介于0与1之间）随机丢弃一些可行的合并操作，从而产生不同的子词切分结果，进而增强模型健壮性。而在推断阶段，将p设置为0，等同于标准的BPE。总的来说，上述方法相当于在子词的粒度上对输入的序列进行扰动，进而达到鲁棒性训练的目的。在之后的小节中同样会针对鲁棒性训练进行详细介绍。
+\vspace{0.5em}
+\item DPE\upcite{he2020dynamic}。引入了混合字符-子词的切分方式，将句子的子词分割方式看作一种潜变量，该结构能够利用{\small\bfnew{动态规划}}\index{动态规划}（Dynamic Programming）\index{Dynamic Programming}的思想精确地将潜在的子字片段边缘化。解码端的输入是基于字符表示的目标语序列，推断时将每个时间步的输出映射到预先设定好的子词词表之上，得到当前最可能得子词结果。若当前子词长度为$m$，则接下来的$m$个时间步的输入为该子词，并在$m$个时间步后得到下一个切分的子词。
+\vspace{0.5em}
+\end{itemize}
 %----------------------------------------------------------------------------------------
 %    NEW SECTION
@@ -193,7 +193,7 @@
 \parinterval {\small\bfnew{正则化}}\index{正则化}（Regularization）\index{Regularization}是机器学习中的经典技术，通常用于缓解{\small\bfnew{过拟合问题}}\index{过拟合问题}（The Overfitting Problem）\index{Overfitting Problem}。正则化的概念源自线性代数和代数几何。在实践中，它更多的是指对{\small\bfnew{反问题}}\index{反问题}（The Inverse Problem）\index{Inverse Problem}的一种求解方式。假设输入$x$和输出$y$之间存在一种映射$f$
 \begin{eqnarray}
-y = f(x)
+y &=& f(x)
 \label{eq:13-1}
 \end{eqnarray}
@@ -224,7 +224,7 @@ y = f(x)
 \parinterval 正则化的一种实现是在训练目标中引入一个正则项。在神经机器翻译中，引入正则项的训练目标为：
 \begin{eqnarray}
-\widehat{\mathbf{w}}=\argmax_{\mathbf{w}}L(\mathbf{w}) + \lambda R(\mathbf{w})
+\widehat{\mathbf{w}} &=& \argmax_{\mathbf{w}}L(\mathbf{w}) + \lambda R(\mathbf{w})
 \label{eq:13-2}
 \end{eqnarray}
@@ -254,7 +254,7 @@ R(\mathbf{w}) & = & (\big| |\mathbf{w}| {\big|}_2)^2 \\
 \parinterval 从几何的角度看，L1和L2正则项都是有物理意义的。二者都可以被看作是空间上的一个区域，比如，在二维平面上，L1范数表示一个以0点为中心的矩形，L2范数表示一个以0点为中心的圆。因此，优化问题可以被看作是在两个区域（$L(\mathbf{w})$和$R(\mathbf{w})$）叠加在一起所形成的区域上进行优化。由于L1和L2正则项都是在0点（坐标原点）附近形成的区域，因此优化的过程可以确保参数不会偏离0点太多。也就是说，L1和L2正则项引入了一个先验：模型的解不应该离0点太远。而L1和L2正则项实际上是在度量这个距离。
-\parinterval 那为什么要用L1和L2正则项惩罚离0点远的解呢？这还要从模型复杂度谈起。实际上，对于神经机器翻译这样的模型来说，模型的容量是足够的。所谓容量可以被简单的理解为独立参数的个数 \footnote{关于模型容量，在\ref{section-13.2}节会有进一步讨论。}。也就是说，理论上存在一种模型可以完美的描述问题。但是，从目标函数拟合的角度来看，如果一个模型可以拟合很复杂的目标函数，那模型所表示的函数形态也会很复杂。这往往体现在模型中参数的值``偏大''。比如，用一个多项式函数拟合一些空间中的点，如果希望拟合得很好，各个项的系数往往是非零的。而且为了对每个点进行拟合，通常需要多项式中的某些项具有较大的系数，以获得函数在局部有较大的斜率。显然，这样的模型是很复杂的。而模型的复杂度可以用函数中的参数（比如多项式中各项的系数）的``值''进行度量，体现出来就是模型参数的范数。
+\parinterval 那为什么要用L1和L2正则项惩罚离0点远的解呢？这还要从模型复杂度谈起。实际上，对于神经机器翻译这样的模型来说，模型的容量是足够的。所谓容量可以被简单的理解为独立参数的个数 \footnote{关于模型容量，在\ref{section-13.2}节会有进一步讨论。}。也就是说，理论上存在一种模型可以完美的描述问题。但是，从目标函数拟合的角度来看，如果一个模型可以拟合很复杂的目标函数，那模型所表示的函数形态也会很复杂。这往往体现在模型中参数的值“偏大”。比如，用一个多项式函数拟合一些空间中的点，如果希望拟合得很好，各个项的系数往往是非零的。而且为了对每个点进行拟合，通常需要多项式中的某些项具有较大的系数，以获得函数在局部有较大的斜率。显然，这样的模型是很复杂的。而模型的复杂度可以用函数中的参数（比如多项式中各项的系数）的“值”进行度量，体现出来就是模型参数的范数。
 \parinterval 因此，L1和L2正则项的目的是防止模型为了匹配少数（噪声）样本而导致模型的参数过大。反过来说，L1和L2正则项会鼓励那些参数值在0点附近的情况。从实践的角度看，这种方法可以很好的对统计模型的训练进行校正，得到泛化能力更强的模型。
@@ -266,15 +266,15 @@ R(\mathbf{w}) & = & (\big| |\mathbf{w}| {\big|}_2)^2 \\
 \parinterval 神经机器翻译在每个目标语位置$j$会输出一个分布$y_j$，这个分布描述了每个目标语言单词出现的可能性。在训练时，每个目标语言位置上的答案是一个单词，也就对应了One-hot分布$\tilde{y}_j$，它仅仅在正确答案那一维为1，其它维均为0。模型训练可以被看作是一个调整模型参数让$y_j$逼近$\tilde{y}_j$的过程。但是，$\tilde{y}_j$的每一个维度是一个非0即1的目标，这样也就无法考虑类别之间的相关性。具体来说，除非模型在答案那一维输出1，否则都会得到惩罚。即使模型把一部分概率分配给与答案相近的单词（比如同义词），这个相近的单词仍被视为完全错误的预测。
-\parinterval {\small\bfnew{标签平滑}}\index{标签平滑}（Label Smoothing）\index{Label Smoothing}的思想很简单\upcite{Szegedy_2016_CVPR}：答案所对应的单词不应该``独享''所有的概率，其它单词应该有机会作为答案。这个观点与第二章中语言模型的平滑非常类似。在复杂模型的参数估计中，往往需要给未见或者低频事件分配一些概率，以保证模型具有更好的泛化能力。具体实现时，标签平滑使用了一个额外的分布$q$，它是在词汇表$V$ 上的一个均匀分布，即$q(k)=\frac{1}{|V|}$，其中$q(k)$表示分布的第$k$维。然后，答案分布被重新定义为$\tilde{y}_j$和$q$的线性插值：
+\parinterval {\small\bfnew{标签平滑}}\index{标签平滑}（Label Smoothing）\index{Label Smoothing}的思想很简单\upcite{Szegedy_2016_CVPR}：答案所对应的单词不应该“独享”所有的概率，其它单词应该有机会作为答案。这个观点与第二章中语言模型的平滑非常类似。在复杂模型的参数估计中，往往需要给未见或者低频事件分配一些概率，以保证模型具有更好的泛化能力。具体实现时，标签平滑使用了一个额外的分布$q$，它是在词汇表$V$ 上的一个均匀分布，即$q(k)=\frac{1}{|V|}$，其中$q(k)$表示分布的第$k$维。然后，答案分布被重新定义为$\tilde{y}_j$和$q$的线性插值：
 \begin{eqnarray}
-y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
+y_{j}^{ls} &=& (1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \label{eq:13-5}
 \end{eqnarray}
 \noindent 这里$\alpha$表示一个系数，用于控制分布$q$的重要性。$y_{j}^{ls}$会被作为最终的答案分布用于模型的训练。
-\parinterval 标签平滑实际上定义了一种``软''标签，使得所有标签都可以分到一些概率。一方面可以缓解数据中噪声的影响，另一方面目标分布会更合理（显然，真实的分布不应该是One-hot分布）。图\ref{fig:13-12}展示了标签平滑前后的损失函数计算结果的对比。
+\parinterval 标签平滑实际上定义了一种“软”标签，使得所有标签都可以分到一些概率。一方面可以缓解数据中噪声的影响，另一方面目标分布会更合理（显然，真实的分布不应该是One-hot分布）。图\ref{fig:13-12}展示了标签平滑前后的损失函数计算结果的对比。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -293,9 +293,9 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \subsection{Dropout}
-\parinterval 神经机器翻译模型是一种典型的多层神经网络模型。每一层网络都包含若干神经元，负责接收前一层所有神经元的输出，并进行诸如乘法、加法等变换，并有选择的使用非线性的激活函数，最终得到当前层每个神经元的输出。从模型最终预测的角度看，每个神经元都在参与最终的预测。理想的情况下，我们希望每个神经元都能相互独立的做出``贡献''。这样的模型会更加健壮，因为即使一部分神经元不能正常工作，其它神经元仍然可以独立做出合理的预测。但是，随着每一层神经元数量的增加以及网络结构的复杂化，研究者发现神经元之间会出现{\small\bfnew{相互适应}}\index{相互适应}（Co-Adaptation）\index{Co-Adaptation}的现象。所谓相互适应是指，一个神经元对输出的贡献与同一层其它神经元的行为是相关的，也就是说这个神经元已经适应到它周围的``环境''中。
+\parinterval 神经机器翻译模型是一种典型的多层神经网络模型。每一层网络都包含若干神经元，负责接收前一层所有神经元的输出，并进行诸如乘法、加法等变换，并有选择的使用非线性的激活函数，最终得到当前层每个神经元的输出。从模型最终预测的角度看，每个神经元都在参与最终的预测。理想的情况下，我们希望每个神经元都能相互独立的做出“贡献”。这样的模型会更加健壮，因为即使一部分神经元不能正常工作，其它神经元仍然可以独立做出合理的预测。但是，随着每一层神经元数量的增加以及网络结构的复杂化，研究者发现神经元之间会出现{\small\bfnew{相互适应}}\index{相互适应}（Co-Adaptation）\index{Co-Adaptation}的现象。所谓相互适应是指，一个神经元对输出的贡献与同一层其它神经元的行为是相关的，也就是说这个神经元已经适应到它周围的“环境”中。
-\parinterval 相互适应的好处在于神经网络可以处理更加复杂的问题，因为联合使用两个神经元要比单独使用每个神经元的表示能力强。这也类似于传统机器学习任务中往往会设计一些高阶特征，比如自然语言序列标注中对bi-gram和tri-gram的使用。不过另一方面，相互适应会导致模型变得更加``脆弱''。因为相互适应的神经元可以更好的描述训练数据中的现象，但是在测试数据上，由于很多现象是未见的，细微的扰动会导致神经元无法适应。具体体现出来就是过拟合问题。
+\parinterval 相互适应的好处在于神经网络可以处理更加复杂的问题，因为联合使用两个神经元要比单独使用每个神经元的表示能力强。这也类似于传统机器学习任务中往往会设计一些高阶特征，比如自然语言序列标注中对bi-gram和tri-gram的使用。不过另一方面，相互适应会导致模型变得更加“脆弱”。因为相互适应的神经元可以更好的描述训练数据中的现象，但是在测试数据上，由于很多现象是未见的，细微的扰动会导致神经元无法适应。具体体现出来就是过拟合问题。
 \parinterval Dropout是解决这个问题的一种常用方法\upcite{DBLP:journals/corr/abs-1207-0580}。方法很简单，在训练时随机让一部分神经元停止工作，这样每次参数更新中每个神经元周围的环境都在变化，它就不会过分适应到环境中。图\ref{fig:13-13}中给出了某一次参数训练中使用Dropout之前和之后的状态对比。
@@ -308,7 +308,7 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \end{figure}
 %----------------------------------------------
-\parinterval 具体实现时，可以设置一个参数$p\in (0,1)$。在每次参数更新所使用的前向和反向计算中，每个神经元都以概率$p$停止工作。相当于每层神经网络会有以$p$为比例的神经元被``屏蔽''掉。每一次参数更新中会随机屏蔽不同的神经元。图\ref{fig:13-14}给出了Dropout方法和传统方法计算方式的对比。
+\parinterval 具体实现时，可以设置一个参数$p\in (0,1)$。在每次参数更新所使用的前向和反向计算中，每个神经元都以概率$p$停止工作。相当于每层神经网络会有以$p$为比例的神经元被“屏蔽”掉。每一次参数更新中会随机屏蔽不同的神经元。图\ref{fig:13-14}给出了Dropout方法和传统方法计算方式的对比。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -329,7 +329,7 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \subsection{Layer Dropout}
-\parinterval 随时网络层数的增多，相互适应也会出现在不同层之间。特别是在引入残差网络之后，不同层的输出可以进行线性组合，因此不同层之间的相互影响会更加直接。对于这个问题，也可以使用Dropout的思想对不同层进行屏蔽。比如，可以使用一个开关来控制一个层能否发挥作用，这个开关以概率$p$被随机关闭，即该层有为$p$的可能性不工作。图\ref{fig:13-15}展示了Transformer多层网络引入Layer Dropout 前后的情况。可以看到，使用Layer Dropout后，开关M会被随机打开或者关闭，以达到屏蔽某一层计算的目的。由于使用了残差网络，关闭每一层相当于``跳过''这一层网络，因此Layer Dropout并不会影响神经网络中数据流的传递。
+\parinterval 随时网络层数的增多，相互适应也会出现在不同层之间。特别是在引入残差网络之后，不同层的输出可以进行线性组合，因此不同层之间的相互影响会更加直接。对于这个问题，也可以使用Dropout的思想对不同层进行屏蔽。比如，可以使用一个开关来控制一个层能否发挥作用，这个开关以概率$p$被随机关闭，即该层有为$p$的可能性不工作。图\ref{fig:13-15}展示了Transformer多层网络引入Layer Dropout 前后的情况。可以看到，使用Layer Dropout后，开关M会被随机打开或者关闭，以达到屏蔽某一层计算的目的。由于使用了残差网络，关闭每一层相当于“跳过”这一层网络，因此Layer Dropout并不会影响神经网络中数据流的传递。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -340,7 +340,7 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \end{figure}
 %----------------------------------------------
-\parinterval Layer Dropout可以被理解为在一个深网络（即原始网络）中随机采样出一个由若干层网络构成的``浅''网络。不同``浅''网络所对应的同一层的模型参数是共享的。这也达到了对指数级子网络高效训练的目的。需要注意的是，在推断阶段，每层的输出需要乘以$1-p$，确保训练时每层输出的期望和解码是一致的。Layer Dropout可以非常有效的缓解深层网路中的过拟合问题。在\ref{subsection-13.2}节还会看到Layer Dropout可以成功地帮助我们训练Deep Transformer模型。
+\parinterval Layer Dropout可以被理解为在一个深网络（即原始网络）中随机采样出一个由若干层网络构成的“浅”网络。不同“浅”网络所对应的同一层的模型参数是共享的。这也达到了对指数级子网络高效训练的目的。需要注意的是，在推断阶段，每层的输出需要乘以$1-p$，确保训练时每层输出的期望和解码是一致的。Layer Dropout可以非常有效的缓解深层网路中的过拟合问题。在\ref{subsection-13.2}节还会看到Layer Dropout可以成功地帮助我们训练Deep Transformer模型。
 %----------------------------------------------------------------------------------------
 %    NEW SECTION
@@ -349,11 +349,11 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \sectionnewpage
 \section{增大模型容量}\label{section-13.2}
-\parinterval 神经机器翻译是一种典型的多层神经网络。一方面，可以通过设计合适的网络连接方式和激活函数来捕捉复杂的翻译现象；另一方面，越来越多的可用数据让模型能够得到更有效的训练。在训练数据较为充分的情况下，设计更加``复杂''的模型成为了提升系统性能的有效手段。比如，Transformer模型有两个常用配置Transformer-Base和Transformer-Big。其中，Transformer-Big比Transformer-Base使用了更多的神经元，相应的翻译品质更优\upcite{NIPS2017_7181}。
+\parinterval 神经机器翻译是一种典型的多层神经网络。一方面，可以通过设计合适的网络连接方式和激活函数来捕捉复杂的翻译现象；另一方面，越来越多的可用数据让模型能够得到更有效的训练。在训练数据较为充分的情况下，设计更加“复杂”的模型成为了提升系统性能的有效手段。比如，Transformer模型有两个常用配置Transformer-Base和Transformer-Big。其中，Transformer-Big比Transformer-Base使用了更多的神经元，相应的翻译品质更优\upcite{NIPS2017_7181}。
 \parinterval 那么是否还有类似的方法可以改善系统性能呢？答案显然是肯定的。这里，把这类方法统称为基于大容量模型的方法。在传统机器学习的观点中，神经网络的性能不仅依赖于架构设计，同样与容量密切相关。那么什么是模型的{\small\bfnew{容量}}\index{容量}（Capacity）\index{Capacity}？简单理解，容量是指神经网络的参数量，即神经元之间连接权重的个数。另一种定义是把容量看作神经网络所能表示的假设空间大小\upcite{DBLP:journals/nature/LeCunBH15}，也就是神经网络能表示的不同函数所构成的空间。
-\parinterval 而学习一个神经网络就是要找到一个``最优''的函数，它可以准确地拟合数据。当假设空间变大时，训练系统有机会找到更好的函数，但是同时也需要依赖更多的训练样本才能完成最优函数的搜索。相反，当假设空间变小时，训练系统会更容易完成函数搜索，但是很多优质的函数可能都没有被包含在假设空间里。这也体现了一种简单的辩证思想：如果训练（搜索）的代价高，会有更大的机会找到更好的解；另一方面，如果想少花力气进行训练（搜索），那就设计一个小一些的假设空间，在小一些规模的样本集上进行训练，当然搜索到的解可能不是最好的。
+\parinterval 而学习一个神经网络就是要找到一个“最优”的函数，它可以准确地拟合数据。当假设空间变大时，训练系统有机会找到更好的函数，但是同时也需要依赖更多的训练样本才能完成最优函数的搜索。相反，当假设空间变小时，训练系统会更容易完成函数搜索，但是很多优质的函数可能都没有被包含在假设空间里。这也体现了一种简单的辩证思想：如果训练（搜索）的代价高，会有更大的机会找到更好的解；另一方面，如果想少花力气进行训练（搜索），那就设计一个小一些的假设空间，在小一些规模的样本集上进行训练，当然搜索到的解可能不是最好的。
 \parinterval 在很多机器翻译任务中，训练数据是相对充分的。这时增加模型容量是提升性能的一种很好的选择。常见的方法有三种：
@@ -401,7 +401,7 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \subsection{深网络}
-\parinterval 虽然，理论上宽网络有能力拟合任意的函数，但是获得这种能力的代价是非常高的。在实践中，往往需要增加相当的宽度，以极大的训练代价才能换来少量的性能提升。当神经网络达到一定宽度后这种现象更为严重。``无限''增加宽度显然是不现实的。
+\parinterval 虽然，理论上宽网络有能力拟合任意的函数，但是获得这种能力的代价是非常高的。在实践中，往往需要增加相当的宽度，以极大的训练代价才能换来少量的性能提升。当神经网络达到一定宽度后这种现象更为严重。“无限”增加宽度显然是不现实的。
 \parinterval 因此，另一种思路是使用更深的网络以增加模型的容量。深网络是指包含更多层的神经网络。相比宽网络的参数量随着宽度呈平方增长，深网络的参数量随着深度呈线性增长。这带给深网络一个优点：在同样参数量下可以通过更多的非线性变换来对问题进行描述。这也赋予了深网络对复杂问题建模的能力。比如，在图像识别领域，很多先进的系统都是基于很深的神经网络，甚至在一些任务上最好的的结果需要1000 层以上的神经网络。
@@ -432,7 +432,7 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \end{figure}
 %----------------------------------------------
-\parinterval 不过，深网络容易发生梯度消失和梯度爆炸问题。因此在使用深网络时，训练策略的选择是至关重要的。实际上，标准的Transformer模型已经是不太``浅''的神经网络，因此里面使用了残差连接来缓解梯度消失等问题。此外，为了避免过拟合，深层网络的训练也要与Dropout等正则化策略相配合，并且需要设计恰当的参数初始化方法和学习率调整策略。关于构建深层神经机器翻译的方法，本章\ref{subsection-7.5.1}节还会有进一步讨论。
+\parinterval 不过，深网络容易发生梯度消失和梯度爆炸问题。因此在使用深网络时，训练策略的选择是至关重要的。实际上，标准的Transformer模型已经是不太“浅”的神经网络，因此里面使用了残差连接来缓解梯度消失等问题。此外，为了避免过拟合，深层网络的训练也要与Dropout等正则化策略相配合，并且需要设计恰当的参数初始化方法和学习率调整策略。关于构建深层神经机器翻译的方法，本章\ref{subsection-7.5.1}节还会有进一步讨论。
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -442,7 +442,7 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \parinterval 如前所述，神经机器翻译的原始输入是单词序列，包括源语言端和目标语言端。模型中的输入层将这种离散的单词表示转换成实数向量的表示，也就是常说的{\small\bfnew{词嵌入}}\index{词嵌入}（Embedding）\index{Embedding}。从实现的角度来看，输入层其实就是从一个词嵌入矩阵中提取对应的词向量表示，这个矩阵两个维度大小分别对应着词表大小和词嵌入的维度。词嵌入的维度也代表着模型对单词刻画的能力。因此适当增加词嵌入的维度也是一种增加模型容量的手段。通常，词嵌入和隐藏层的维度是一致的，这种设计也是为了便于系统实现。
-\parinterval 当然，并不是说词嵌入的维度一定越大就越好。本质上，词嵌入是要在一个多维空间上有效的区分含有不同语义的单词。如果词表较大，更大的词嵌入维度会更有意义，因为需要更多的``特征''描述更多的语义。当词表较小时，增大词嵌入维度可能不会带来增益，相反会增加系统计算的负担。另一种策略是，动态选择词嵌入维度，比如，对于高频词使用较大的词嵌入维度，而对于低频词则使用较小的词嵌入维度\upcite{DBLP:conf/iclr/BaevskiA19}。这种方法可以用同样的参数量处理更大的词表。
+\parinterval 当然，并不是说词嵌入的维度一定越大就越好。本质上，词嵌入是要在一个多维空间上有效的区分含有不同语义的单词。如果词表较大，更大的词嵌入维度会更有意义，因为需要更多的“特征”描述更多的语义。当词表较小时，增大词嵌入维度可能不会带来增益，相反会增加系统计算的负担。另一种策略是，动态选择词嵌入维度，比如，对于高频词使用较大的词嵌入维度，而对于低频词则使用较小的词嵌入维度\upcite{DBLP:conf/iclr/BaevskiA19}。这种方法可以用同样的参数量处理更大的词表。
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -458,7 +458,7 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \subsection{大批量训练}
-\parinterval 在第六章已经介绍了神经机器翻译模型需要使用梯度下降方法进行训练。其中，一项非常重要的技术就是{\small\bfnew{小批量训练}}\index{小批量训练}（Mini-batch Training）\index{Mini-batch Training}，即每次使用多个样本来获取梯度并对模型参数进行更新。这里将每次参数更新使用的多个样本集合称为批次，将样本的数量称作批次的大小。在机器翻译中，通常用批次中的源语言/目标语言单词数或者句子数来表示批次大小。理论上，过小的批次会带来训练的不稳定，而且参数更新次数会大大增加。因此，很多研究者尝试增加批次大小来提高训练的稳定性。在Transformer模型中，使用更大的批次已经被验证是有效的。这种方法也被称作大批量训练。不过，这里所谓`` 大''批量是一个相对的概念。下面就一起看一看如何使用合适的批次大小来训练神经机器翻译模型。
+\parinterval 在第六章已经介绍了神经机器翻译模型需要使用梯度下降方法进行训练。其中，一项非常重要的技术就是{\small\bfnew{小批量训练}}\index{小批量训练}（Mini-batch Training）\index{Mini-batch Training}，即每次使用多个样本来获取梯度并对模型参数进行更新。这里将每次参数更新使用的多个样本集合称为批次，将样本的数量称作批次的大小。在机器翻译中，通常用批次中的源语言/目标语言单词数或者句子数来表示批次大小。理论上，过小的批次会带来训练的不稳定，而且参数更新次数会大大增加。因此，很多研究者尝试增加批次大小来提高训练的稳定性。在Transformer模型中，使用更大的批次已经被验证是有效的。这种方法也被称作大批量训练。不过，这里所谓“ 大”批量是一个相对的概念。下面就一起看一看如何使用合适的批次大小来训练神经机器翻译模型。
 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -524,7 +524,7 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \item 按词数构建批次：对比按照句长生成批次，按词数生成批次可以防止某些批次中句子整体长度特别长或者特别短的情况，保证不同批次之间整体的词数处于大致相同的范围，这样所得到的梯度也是可比较的。通常的做法是根据源语言词数、目标语言词数，或者源语言词数与目标语言词数的最大值等指标生成批次。
 \vspace{0.5em}
-\item 按课程学习的方式：考虑样本的``难度''也是生成批次的一种策略。比如，可以使用{\small\bfnew{课程学习}}\index{课程学习}（Curriculum Learning）\index{Curriculum Learning} 的思想\upcite{DBLP:conf/icml/BengioLCW09}，让系统先学习``简单''的样本，之后逐渐增加样本的难度，达到循序渐进的学习。具体来说，可以利用句子长度、词频等指标计算每个批次的``难度''，记为$d$。 之后，选择满足$d \leq c$的样本构建一个批次。这里，$c$表示难度的阈值，它可以随着训练的执行不断增大。
+\item 按课程学习的方式：考虑样本的“难度”也是生成批次的一种策略。比如，可以使用{\small\bfnew{课程学习}}\index{课程学习}（Curriculum Learning）\index{Curriculum Learning} 的思想\upcite{DBLP:conf/icml/BengioLCW09}，让系统先学习“简单”的样本，之后逐渐增加样本的难度，达到循序渐进的学习。具体来说，可以利用句子长度、词频等指标计算每个批次的“难度”，记为$d$。 之后，选择满足$d \leq c$的样本构建一个批次。这里，$c$表示难度的阈值，它可以随着训练的执行不断增大。
 \vspace{0.5em}
 \end{itemize}
@@ -535,24 +535,133 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \sectionnewpage
 \section{对抗样本训练}
+\parinterval 同其它基于神经网络的方法一样，提高{\small\bfnew{鲁棒性}}\index{鲁棒性}（Robustness）\index{Robustness}也是神经机器翻译研发中需要关注的。比如，大容量模型可以很好的拟合训练数据，但是当测试样本与训练样本差异较大时，会导致很糟糕的翻译结果\upcite{JMLR:v15:srivastava14a,DBLP:conf/amta/MullerRS20}。另一方面，实践中也发现，有些情况下即使输入中有微小的扰动，神经网络模型的输出也会产生巨大变化。或者说，神经网络模型在输入样本上容易受到{\small\bfnew{攻击}}\index{攻击}（Attack）\index{Attack}\upcite{DBLP:conf/sp/Carlini017,DBLP:conf/cvpr/Moosavi-Dezfooli16,DBLP:conf/acl/ChengJM19}。图\ref{fig:13-19}展示了一个神经机器翻译系统的结果，可以看到，输入中把单词“他”换成“她”会造成完全不同的译文。这时神经机器翻译系统就存在鲁棒性问题。
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\includegraphics[scale=0.5]{./Chapter13/Figures/figure-example-of-neural-machine-translation.png}
+\caption{神经机器翻译实例}
+\label{fig:13-19}
+\end{figure}
+%----------------------------------------------
+\parinterval 决定神经网络模型鲁棒性的因素主要包括训练数据、网络结构、正则化方法等。仅仅从网络结构设计和训练算法优化的角度来改善鲁棒性一般是较为困难的，因为如果输入数据是“干净”的，模型就会学习在这样的数据上进行预测。无论模型的能力是强还是弱，当推断时输入数据出现扰动的时候，模型可能无法适应，因为它从未见过这种新的数据。因此，一种简单直接的方法是从训练样本出发，让模型在学习的过程中能对样本中的扰动进行处理，进而在推断时具有更强的鲁棒性。具体来说，可以在训练过程中构造有噪声的样本，即基于{\small\bfnew{对抗样本}}\index{对抗样本}（Adversarial Examples）\index{Adversarial Examples}进行{\small\bfnew{对抗训练}}\index{对抗训练}（Adversarial Training）\index{Adversarial Training}。
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsection{对抗样本及对抗攻击}
+\parinterval 在图像识别领域，研究人员就发现，对于输入图像的细小扰动，如像素变化等，会使模型以高置信度给出错误的预测\upcite{DBLP:conf/cvpr/NguyenYC15,DBLP:journals/corr/SzegedyZSBEGF13,DBLP:journals/corr/GoodfellowSS14}，但是这种扰动并不会影响人类的判断。也就是说，样本中的微小变化“欺骗”了图像识别系统，但是“欺骗”不了人类。这种现象背后的原因有很多，例如，一种可能的原因是：系统并没有理解图像，而是在拟合数据，因此拟合能力越强，反而对数据中的微小变化更加敏感。从统计学习的角度看，既然新的数据中可能会有扰动，那更好的学习方式就是在训练中显性地把这种扰动建模出来，让模型对输入的细微变化更加鲁棒。
+\parinterval 这种对原样本上增加一些难以察觉的扰动从而使模型的到错误判断的样本，被称为对抗样本。对于输入$\mathbi{x}$和输出$\mathbi{y}$，对抗样本形式上可以被描述为：{\red s.t.的形式再确认一下}
+\begin{eqnarray}
+\funp{C}(\mathbi{x}) &=& \mathbi{y} 
+\label{eq:13-6}\\
+\funp{C}(\mathbi{x}) &\neq& \mathbi{y} 
+\label{eq:13-7}\\
+\textrm{s.t.} \quad \funp{R}(\mathbi{x},\mathbi{x}') &<& \varepsilon
+\label{eq:13-8}
+\end{eqnarray}
+\noindent 其中，$(\mathbi{x},\mathbi{y})$为原样本，$(\mathbi{x}',\mathbi{y})$为输入中含有扰动的对抗样本，函数$\funp{C}(\cdot)$为模型。公式\eqref{eq:13-8}中$\funp{R}(\mathbi{x},\mathbi{x}')$表示扰动后的输入$\mathbi{x}'$和原输入$\mathbi{x}$之间的距离，$\varepsilon$表示扰动的受限范围当模型对包含噪声的数据容易给出较差的结果时，往往意味着模型的抗干扰能力差，因此可以利用对抗样本检测现有模型的鲁棒性\upcite{DBLP:conf/emnlp/JiaL17}。同时，采用类似数据增强的方式将对抗样本混合至训练数据中，能够帮助模型学习到更普适的特征使模型的到稳定的输出，这种方式也被称为对抗训练\upcite{DBLP:journals/corr/GoodfellowSS14,DBLP:conf/emnlp/BekoulisDDD18,DBLP:conf/naacl/YasunagaKR18}。
+\parinterval 通过对抗样本训练提升模型鲁棒性的首要问题是如何生成对抗样本。通过当前模型$\funp{C}$，和样本$(\mathbi{x},\mathbi{y})$，生成对抗样本的过程，被称为{\small\bfnew{对抗攻击}}\index{对抗攻击}（Adversarial Attack）\index{Adversarial Attack}。对抗攻击可以被分为两种，分别是黑盒攻击和白盒攻击。在白盒攻击中，攻击算法可以访问模型的完整信息，包括模型结构、网络参数、损失函数、激活函数、输入和输出数据等。而黑盒攻击不需要知道神经网络的详细信息，仅仅通过访问模型的输入和输出达到攻击目的，因此 通常依赖启发式方法来生成对抗样本。由于神经网络本身便是一个黑盒模型，研究人员对模型内部的参数干预度有限，因此黑盒攻击在许多实际应用中更加实用。
+\parinterval 在神经机器翻译中，输入中细小的扰动经常会使模型变得脆弱\upcite{DBLP:conf/iclr/BelinkovB18}。但是图像和文本之间存在着一定的差异，图像中的对抗攻击方法难以直接应用于自然语言处理领域，这是由于以像素值等表示的图像数据是连续的\upcite{DBLP:conf/naacl/MichelLNP19}{\red （如何理解连续？）}，而文本中的一个个单词本身离散的。因此简单替换这些离散的单词，可能会生成语法错误或者语义错误的句子。这时的扰动过大，模型很容易判别，因此无法涵盖原始问题。即使对词嵌入等连续表示部分进行扰动，也会产生无法与词嵌入空间中的任何词匹配的问题\upcite{Gong2018AdversarialTW}。针对这些问题，下面着重介绍神经机器翻译任务中有效的生成和利用对抗样本的方法。
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsection{基于黑盒攻击的方法}
+\parinterval 一个好的对抗样本应该具有一些性质，如：对文本做最少的修改，并最大程度地保留原文的语义。这些可以通过对文本加噪声的方式来来实现，分为自然噪声和人工噪声\upcite{DBLP:conf/iclr/BelinkovB18}。自然噪声一般是指人为的在语料库中收集自然出现的错误，如输入错误，拼写错误等，构建可用的词汇替换表，在文本中加入噪声。人为噪声则可以通过多种方式来处理文本。如可以通过在干净的数据中通过固定的规则或是使用噪声生成器以一定的概率引入不同类型的噪声，如：拼写、表情符号、语法错误等\upcite{DBLP:conf/naacl/VaibhavSSN19,DBLP:conf/naacl/AnastasopoulosL19,DBLP:conf/acl/SinghGR18}；此外，也可以在文本中加入精心设计，或毫无意义的单词序列，以此来分散模型的注意，通过不同的算法来确定插入的位置和内容，这种方式常用于阅读理解任务中\upcite{DBLP:conf/emnlp/JiaL17}。
+\parinterval 除了单纯的在文本中引入各种扰动外，还可以通过文本编辑的方式构建对抗样本，在不改变语义的情况下尽可能的修改文本\upcite{DBLP:journals/corr/SamantaM17,DBLP:conf/ijcai/0002LSBLS18}，从而生成对抗样本。文本的编辑方式主要包括交换，插入，替换和删除操作。图\ref{fig:13-20}给出了一些通过上述方式生成的对抗样本。
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\includegraphics[scale=0.5]{./Chapter13/Figures/figure-true-case-of-adversarial-examples.jpg}
+\caption{对抗样本实例{\red（需要让读者真实的感受一下）}}
+\label{fig:13-20}
+\end{figure}
+%----------------------------------------------
+\parinterval 形式上可以利用FGSM\upcite{DBLP:journals/corr/GoodfellowSS14}等算法验证文本中每一个单词对语义的贡献度，同时为每一个单词构建其候选池，包括单词的近义词，拼写错误词，同音词等。对于贡献度较低的词如语气词，副词等，可以通过插入，删除操作进行扰动。对于文本序列中其他的单词可以进行在候选池中选择相应的单词进行替换。对于交换操作可以基于词级别交换序列中的单词，也可以是基于字符级的交换单词中的字符\upcite{DBLP:conf/coling/EbrahimiLD18}。重复的进行不同的编辑操作，直至误导模型做出错误的判断。
+\parinterval 基于语义的方法除了通过不同的算法进行修改输入生成对抗样本外，也可以通过神经网络模型增加扰动，例如在机器翻译中常用的回译技术也是生成对抗样本的一种有效方式，通过反向模型将目标语言翻译成源语言，并再次应用于神经机器翻译系统的训练。除了翻译模型，语言模型也可以用于生成对抗样本。前面也已经介绍过语言模型可以用于检测句子流畅度，根据上文预测当前位置可能出现的单词，因此可以通过语言模型根据上文预测出当前位置最可能出现的多个单词，与序列中的原本的单词进行替换。在机器翻译任务中，可以通过与神经机器翻译系统联合训练，共享词向量矩阵的方式得到语言模型。{\red （引用）}
+\parinterval 此外，生成{\small\bfnew{对抗网络}}\index{对抗网络}（Generative Adversarial Networks\index{Generative Adversarial Networks}, GANs）也可以被用来生成对抗样本\upcite{DBLP:conf/iclr/ZhaoDS18}。与回译方法类似，GAN的方法将原始的输入映射为潜在分布$\funp{P}$，并在其中搜索出服从相同分布的文本构成对抗样本。一些研究正也对这种方法进行了优化\upcite{DBLP:conf/iclr/ZhaoDS18}，在稠密的向量空间$z$中进行搜索，也就是说在定义$\funp{P}$的基础稠密向量空间中找到对抗性表示$\mathbi{z}'$，然后利用生成模型将其映射回$\mathbi{x}'$，使最终生成的对抗样本在语义上接近原始输入。{\red（既然GAN不是主流，可以考虑把这部分放到拓展阅读中）}
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsection{基于白盒攻击的方法}
+\parinterval 除了在离散的词汇级别的增加扰动，还可以在模型内部增加扰动。这里简单介绍一下利用白盒攻击方法增加模型鲁棒性的方法：
+\begin{itemize}
+\vspace{0.5em}
+\item 与利用词向量的余弦相似度选择近义词对当前词进行替换类似，可以对于每一个词都在其词嵌入表示的基础上累加了一个从正太分布，之后将其作为模型的最终输入。同时，可以在训练目标中增加额外的训练目标，比如，迫使模型在接收到被扰动的输入后，编码端生成与正常输入类似的表示，解码端输出正确的翻译结果\upcite{DBLP:conf/acl/LiuTMCZ18}。
+\vspace{0.5em}
+\item 除了引入标准的噪声外，还可以根据模型所存在的具体问题， 构建不同的扰动。例如，针对输入中包含同音字错误导致模型输出误差较大的问题，可以将单词的发音转换为一个包含$n$个发音单元，如音素，音节等的发音序列，并训练相应的嵌入矩阵将每一个发音单元转换为对应的向量表示。对发音序列中的发音单元的嵌入表示进行平均后，得到当前单词的发音表示。最后将词嵌入与单词的发音表示，加权求和后的结果作为模型的输入\upcite{DBLP:conf/acl/LiuMHXH19}。通过这种方式可以提高模型对同音异形词的鲁棒性，得到更准确的翻译结果。此外除了在词嵌入层增加扰动，同样有研究人员证明了，在端到端模型的中的编码端输出中引入额外的噪声，能起到与在层输入中增加扰动生成对抗样本进行对抗训练类似的效果，增强了模型训练的鲁棒性\upcite{DBLP:conf/acl/LiLWJXZLL20}。
+\vspace{0.5em}
+\item 此外还有一类基于梯度的方法来生成对抗样本进行对抗性训练。例如，可以通过最大化替换词与原始单词词向量之间的差值和候选词的梯度向量之间的相似度来生成对抗样本\upcite{DBLP:conf/acl/ChengJM19}，具体的计算方式如下：{\red 下面的是sin还是sim，而且文字中是正弦把？下面三角是不是delta}
+\begin{eqnarray}
+{\mathbi{x}'}_i &=& \arg\max_{\mathbi{x}\in \nu_{\mathbi{x}}}\textrm{sim}(\funp{e}(\mathbi{x})-\funp{e}(\mathbi{x}_i),\mathbi{g}_{\mathbi{x}_i}) 
+\label{eq:13-9} \\
+\mathbi{g}_{\mathbi{x}_i} &=&  \Delta_{\funp{e}(\mathbi{x}_i)} - \log \funp{P}(\mathbi{y}|\mathbi{x};\theta)
+\label{eq:13-10}
+\end{eqnarray}
+\noindent 其中，$\mathbi{x}_i$为输入中第$i$个词，$\mathbi{g}_{\mathbi{x}_i}$为对应的梯度向量，$\funp{e}(\cdot)$用于获取词向量，$\textrm{sim}(\cdot,\cdot)$用于评估两个向量之间的余弦距离{\red（很多符号没有解释，$∇_(e(x_i))$是什么？等等）}。$\nu_{\mathbi{x}}$为源语的词表，但是由于对词表中所有单词进行枚举，计算成本较大，因此利用语言模型选择最可能的$n$个词作为候选，进而缩减匹配范围，通过此方式对采样出的源语词进行替换。同时为了保护模型不受解码器预测误差的影响，对模型目标端的输入同样做了调整，方法与源语端类似，不同的地方在于将公式\eqref{eq:13-10}中的损失替换为$- \log \funp{P}(\mathbi{y}|\mathbi{x}')$，利用语言模型选择候选和采样的方式也做出了相应的调整。在进行对抗性训练时，在原有的训练损失上增加了三个额外的损失，最终的训练目标为：
+\begin{eqnarray}
+Loss(\theta_{\textrm{mt}},\theta_{\textrm{lm}}^{\mathbi{x}},\theta_{\textrm{lm}}^{\mathbi{y}}) &=& Loss_{\textrm{clean}}(\theta_{\textrm{mt}}) + Loss_{\textrm{lm}}(\theta_{\textrm{lm}}^{\mathbi{x}}) + \nonumber \\
+& & Loss_{\textrm{robust}}(\theta_{\textrm{mt}}) + Loss_{\textrm{lm}}(\theta_{\textrm{lm}}^{\mathbi{y}})
+\label{eq:13-11}
+\end{eqnarray}
+\noindent 其中分别$Loss_{\textrm{clean}}(\theta_{\textrm{mt}})$为正常情况下的损失，$Loss_{\textrm{lm}}(\theta_{\textrm{lm}}^{\mathbi{x}})$和$Loss_{\textrm{lm}}(\theta_{\textrm{lm}}^{\mathbi{y}})$生成对抗样本所用到的源语言与目标语言语言模型的损失，$Loss_{\textrm{robust}}(\theta_{\textrm{mt}})$是以对源语言和目标端做出修改后得到的对抗样本作为输入，原始的目标语作为答案计算得到的损失，其具体形式如下：
+\begin{eqnarray}
+Loss_{\textrm{robust}}(\theta_{\textrm{mt}}) &=&  \frac{1}{\textrm{S}}\sum_{(\mathbi{x},\mathbi{y}\in \textrm{S})}-\log \funp{P}(\mathbi{y}|\mathbi{x}',\mathbi{z}';\theta_{\textrm{mt}})
+\label{eq:13-11}
+\end{eqnarray}
+\noindent 这里$\textrm{S}$代表整体的语料库。通过上述方式达到对抗训练的目的。
+\vspace{0.5em}
+\end{itemize}
+\parinterval 不论是黑盒方法还是白盒方法，本质上都是通过增加噪声使得模型训练更加健壮。类似的思想在很多机器学习方法都有体现，比如，最大熵模型中使用高斯噪声就是常用的增加模型健壮性的手段之一\upcite{chen1999gaussian}{\red 这篇文章再找一下}。而在深度学习时代下，对抗训练将问题定义为：有意识地构造系统容易出错的样本，进而更加有效的增加系统抗干扰的能力。
 %----------------------------------------------------------------------------------------
 %    NEW SECTION
 %----------------------------------------------------------------------------------------
 \sectionnewpage
-\section{最小风险训练}
+\section{高级模型训练技术}
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
 %----------------------------------------------------------------------------------------
-\subsection{增强学习方法}
+\subsection{极大似然估计的问题}
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsection{非teacher-forcing方法}
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
 %----------------------------------------------------------------------------------------
-\subsection{曝光偏置问题}
+\subsection{增强学习方法}
 %----------------------------------------------------------------------------------------
 %    NEW SECTION
@@ -561,11 +670,11 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \sectionnewpage
 \section{知识精炼}\label{subsection-7.5.3}
-\parinterval 理想的机器翻译系统应该是品质好、速度块、存储占用少。不过现实的机器翻译系统往往需要用运行速度和存储空间来换取翻译品质，比如，\ref{subsection-7.3.2}节提到的增大模型容量的方法就是通过增加模型参数量来达到更好的函数拟合效果，但是这也导致系统变得更加笨拙。在很多场景下，这样的模型甚至无法使用。比如，Transformer-Big等``大''模型通常在专用GPU服务器上运行，在手机等受限环境下仍很难应用。
+\parinterval 理想的机器翻译系统应该是品质好、速度块、存储占用少。不过现实的机器翻译系统往往需要用运行速度和存储空间来换取翻译品质，比如，\ref{subsection-7.3.2}节提到的增大模型容量的方法就是通过增加模型参数量来达到更好的函数拟合效果，但是这也导致系统变得更加笨拙。在很多场景下，这样的模型甚至无法使用。比如，Transformer-Big等“大”模型通常在专用GPU服务器上运行，在手机等受限环境下仍很难应用。
-\parinterval 另一方面，直接训练``小''模型的效果往往并不理想，其翻译品质与``大''模型相比仍有比较明显的差距。比如，在Transformer中，使用一个48层的编码器要比传统的6层编码器在BLEU上高出1-2个点，而且两者翻译结果的人工评价的区别也十分明显。
+\parinterval 另一方面，直接训练“小”模型的效果往往并不理想，其翻译品质与“大”模型相比仍有比较明显的差距。比如，在Transformer中，使用一个48层的编码器要比传统的6层编码器在BLEU上高出1-2个点，而且两者翻译结果的人工评价的区别也十分明显。
-\parinterval 面对小模型难以训练的问题，一种有趣的想法是把``大''模型的知识传递给``小''模型，让``小''模型可以更好的进行学习。这类似于，教小孩子学习数学，是请一个权威数学家（数据中的标准答案），还是请一个小学数学教师（``大''模型）。这就是知识精炼的基本思想。
+\parinterval 面对小模型难以训练的问题，一种有趣的想法是把“大”模型的知识传递给“小”模型，让“小”模型可以更好的进行学习。这类似于，教小孩子学习数学，是请一个权威数学家（数据中的标准答案），还是请一个小学数学教师（“大”模型）。这就是知识精炼的基本思想。
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -573,23 +682,23 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \subsection{什么是知识精炼}
-\parinterval 通常，知识精炼可以被看作是一种知识迁移的手段\upcite{Hinton2015Distilling}。如果把``大''模型的知识迁移到``小''模型，这种方法的直接结果就是{\small\bfnew{模型压缩}}\index{模型压缩}（Model Compression）\index{Model Compression}。当然，理论上也可以把``小''模型的知识迁移到``大''模型，比如，将迁移后得到的``大''模型作为初始状态，之后继续训练该模型，以期望取得加速收敛的效果。不过，在实践中更多是使用``大''模型到``小''模型的迁移，这也是本节讨论的重点。
+\parinterval 通常，知识精炼可以被看作是一种知识迁移的手段\upcite{Hinton2015Distilling}。如果把“大”模型的知识迁移到“小”模型，这种方法的直接结果就是{\small\bfnew{模型压缩}}\index{模型压缩}（Model Compression）\index{Model Compression}。当然，理论上也可以把“小”模型的知识迁移到“大”模型，比如，将迁移后得到的“大”模型作为初始状态，之后继续训练该模型，以期望取得加速收敛的效果。不过，在实践中更多是使用“大”模型到“小”模型的迁移，这也是本节讨论的重点。
 \parinterval 知识精炼基于两个假设：
 \begin{itemize}
 \vspace{0.5em}
-\item ``知识''在模型间是可迁移的。也就是说，一个模型中蕴含的规律可以被另一个模型使用。最典型的例子就是预训练模型（见\ref{subsection-7.2.6}）。使用单语数据学习到的表示模型，在双语的翻译任务中仍然可以发挥很好的作用。也就是，把单语语言模型学习到的知识迁移到双语翻译中对句子表示的任务中；
+\item “知识”在模型间是可迁移的。也就是说，一个模型中蕴含的规律可以被另一个模型使用。最典型的例子就是预训练模型（见\ref{subsection-7.2.6}）。使用单语数据学习到的表示模型，在双语的翻译任务中仍然可以发挥很好的作用。也就是，把单语语言模型学习到的知识迁移到双语翻译中对句子表示的任务中；
 \vspace{0.5em}
-\item 模型所蕴含的``知识''比原始数据中的``知识''更容易被学习到。比如，机器翻译中大量使用的回译（伪数据）方法，就把模型的输出作为数据让系统进行学习。
+\item 模型所蕴含的“知识”比原始数据中的“知识”更容易被学习到。比如，机器翻译中大量使用的回译（伪数据）方法，就把模型的输出作为数据让系统进行学习。
 \vspace{0.5em}
 \end{itemize}
-\parinterval 这里所说的第二个假设对应了机器学习中的一大类问题\ \dash \ {\small\bfnew{学习难度}}\index{学习难度}（Learning Difficulty）\index{Learning Difficulty}。所谓难度是指：在给定一个模型的情况下，需要花费多少代价对目标任务进行学习。如果目标任务很简单，同时模型与任务很匹配，那学习难度就会降低。如果目标任务很复杂，同时模型与其匹配程度很低，那学习难度就会很大。在自然语言处理任务中，这个问题的一种表现是：在很好的数据中学习的模型的翻译质量可能仍然很差。即使训练数据是完美的，但是模型仍然无法做到完美的学习。这可能是因为建模的不合理，导致模型无法描述目标任务中复杂的规律。也就是纵然数据很好，但是模型学不到其中的``知识''。在机器翻译中这个问题体现的尤为明显。比如，在机器翻译系统$n$-best结果中挑选最好的译文（成为Oracle）作为训练样本让系统重新学习，系统仍然达不到Oracle的水平。
+\parinterval 这里所说的第二个假设对应了机器学习中的一大类问题\ \dash \ {\small\bfnew{学习难度}}\index{学习难度}（Learning Difficulty）\index{Learning Difficulty}。所谓难度是指：在给定一个模型的情况下，需要花费多少代价对目标任务进行学习。如果目标任务很简单，同时模型与任务很匹配，那学习难度就会降低。如果目标任务很复杂，同时模型与其匹配程度很低，那学习难度就会很大。在自然语言处理任务中，这个问题的一种表现是：在很好的数据中学习的模型的翻译质量可能仍然很差。即使训练数据是完美的，但是模型仍然无法做到完美的学习。这可能是因为建模的不合理，导致模型无法描述目标任务中复杂的规律。也就是纵然数据很好，但是模型学不到其中的“知识”。在机器翻译中这个问题体现的尤为明显。比如，在机器翻译系统$n$-best结果中挑选最好的译文（成为Oracle）作为训练样本让系统重新学习，系统仍然达不到Oracle的水平。
-\parinterval 知识精炼本身也体现了一种``自学习''的思想。即利用模型（自己）的预测来教模型（自己）。这样既保证了知识可以向更轻量的模型迁移，同时也避免了模型从原始数据中学习难度大的问题。虽然``大''模型的预测中也会有错误，但是这种预测是更符合建模的假设的，因此``小''模型反倒更容易从不完美的信息中学习\footnote[15]{很多时候，``大''模型和``小''模型都是基于同一种架构，因此二者对问题的假设和模型结构都是相似的。}到更多的知识。类似于，刚开始学习围棋的人从职业九段身上可能什么也学不到，但是向一个业余初段的选手学习可能更容易入门。另外，也有研究表明：在机器翻译中，相比于``小''模型，``大''模型更容易进行优化，也更容易找到更好的模型收敛状态。因此在需要一个性能优越，存储较小的模型时，也会考虑将大模型压缩得到更轻量模型的手段\upcite{DBLP:journals/corr/abs-2002-11794}。
+\parinterval 知识精炼本身也体现了一种“自学习”的思想。即利用模型（自己）的预测来教模型（自己）。这样既保证了知识可以向更轻量的模型迁移，同时也避免了模型从原始数据中学习难度大的问题。虽然“大”模型的预测中也会有错误，但是这种预测是更符合建模的假设的，因此“小”模型反倒更容易从不完美的信息中学习\footnote[15]{很多时候，“大”模型和“小”模型都是基于同一种架构，因此二者对问题的假设和模型结构都是相似的。}到更多的知识。类似于，刚开始学习围棋的人从职业九段身上可能什么也学不到，但是向一个业余初段的选手学习可能更容易入门。另外，也有研究表明：在机器翻译中，相比于“小”模型，“大”模型更容易进行优化，也更容易找到更好的模型收敛状态。因此在需要一个性能优越，存储较小的模型时，也会考虑将大模型压缩得到更轻量模型的手段\upcite{DBLP:journals/corr/abs-2002-11794}。
-\parinterval 通常把``大''模型看作的传授知识的``教师''，被称作{\small\bfnew{教师模型}}\index{教师模型}（Teacher Model）\index{Teacher Model}；把``小''模型看作是接收知识的``学生''，被称作{\small\bfnew{学生模型}}\index{学生模型}（Student Model）\index{Student Model}。比如，可以把Transformer-Big看作是教师模型，把Transformer-Base看作是学生模型。
+\parinterval 通常把“大”模型看作的传授知识的“教师”，被称作{\small\bfnew{教师模型}}\index{教师模型}（Teacher Model）\index{Teacher Model}；把“小”模型看作是接收知识的“学生”，被称作{\small\bfnew{学生模型}}\index{学生模型}（Student Model）\index{Student Model}。比如，可以把Transformer-Big看作是教师模型，把Transformer-Base看作是学生模型。
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -651,7 +760,7 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x})
 \begin{itemize}
 \vspace{0.5em}
-\item 固定教师模型，通过减少模型容量的方式设计学生模型。比如，可以使用容量较大的模型作为教师模型（如：Transformer-Big或Transformer-Deep），然后通过将神经网络变``窄''、变``浅''的方式得到学生模型。我们可以用Transformer-Big做教师模型，然后把Transformer-Big的解码器变为一层网络，作为学生模型。
+\item 固定教师模型，通过减少模型容量的方式设计学生模型。比如，可以使用容量较大的模型作为教师模型（如：Transformer-Big或Transformer-Deep），然后通过将神经网络变“窄”、变“浅”的方式得到学生模型。我们可以用Transformer-Big做教师模型，然后把Transformer-Big的解码器变为一层网络，作为学生模型。
 \vspace{0.5em}
 \item 固定学生模型，通过模型集成的方式设计教师模型。可以组合多个模型生成更高质量的译文（见\ref{subsection-7.4.3}节）。比如，融合多个Transformer-Big模型（不同参数初始化方式），之后学习一个Transformer-Base模型。
 \vspace{0.5em}
@@ -683,3 +792,7 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x})
 \sectionnewpage
 \section{小结及深入阅读}
+\parinterval 对抗样本除了用于提高模型的鲁棒性之外，还有很多其他的应用场景。其中最主要的便是用于评估模型。通过构建由对抗样本构造的数据集，可以验证模型对于不同类型噪声鲁棒性\upcite{DBLP:conf/emnlp/MichelN18}。正是由于对抗样本在检测和提高模型鲁棒性具有明显的效果，因此很多的研究人员在针对不同的任务提出了很多有效的方法。但是在生成对抗样本时常常要注意或考虑很多问题，比如扰动是否足够细微，在人类难以察觉的同时做到欺骗模型的目的，对抗样本在不同的模型结构或数据集上是否具有足够的泛化能力。生成的方法是否足够高效等等。 
--- a/Chapter14/Figures/figure-comparison-of-different-attention-method.tex
+++ b/Chapter14/Figures/figure-comparison-of-different-attention-method.tex
@@ -21,7 +21,7 @@
 \node[layernode,anchor=north] (layer11) at ([yshift=-\hseg]layer01.south) {};
 \node[attnnode,anchor=south] (attn11) at ([yshift=0.1\hnode]layer11.south) {};
-\node[anchor=north west,inner sep=4pt,font=\small] () at (attn11.north west) {Attention};
+\node[anchor=north west,inner sep=4pt,font=\small] () at (attn11.north west) {注意力};
 \node[anchor=south,inner sep=0pt] (out11) at ([yshift=0.3\hseg]attn11.north) {$\cdots$};
 \node[thinnode,anchor=south west,thick,draw=dblue,text=black] (q11) at ([xshift=0.1\wseg,yshift=0.2\hseg]attn11.south west) {$Q^n$};
 \node[thinnode,anchor=south,thick,draw=orange,text=black] (k11) at ([yshift=0.2\hseg]attn11.south) {$K^n$};
@@ -41,7 +41,7 @@
 \node[layernode,anchor=north] (layer12) at ([yshift=-\hseg]layer02.south) {};
 \node[attnnode,anchor=south] (attn12) at ([yshift=0.1\hnode]layer12.south) {};
-\node[anchor=north west,inner sep=4pt,font=\small] () at (attn12.north west) {Attention};
+\node[anchor=north west,inner sep=4pt,font=\small] () at (attn12.north west) {注意力};
 \node[anchor=south,inner sep=0pt] (out12) at ([yshift=0.3\hseg]attn12.north) {$\cdots$};
 \node[thinnode,anchor=south west,thick,draw=dblue!40,text=black!40] (q12) at ([xshift=0.1\wseg,yshift=0.2\hseg]attn12.south west) {$Q^n$};
 \node[thinnode,anchor=south,thick,draw=orange!40,text=black!40] (k12) at ([yshift=0.2\hseg]attn12.south) {$K^n$};
@@ -61,7 +61,7 @@
 \node[layernode,anchor=north] (layer13) at ([yshift=-\hseg]layer03.south) {};
 \node[attnnode,anchor=south] (attn13) at ([yshift=0.1\hnode]layer13.south) {};
-\node[anchor=north west,inner sep=4pt,font=\small] () at (attn13.north west) {Attention};
+\node[anchor=north west,inner sep=4pt,font=\small] () at (attn13.north west) {注意力};
 \node[anchor=south,inner sep=0pt] (out13) at ([yshift=0.3\hseg]attn13.north) {$\cdots$};
 \node[thinnode,anchor=south west,thick,draw=dblue!40,text=black!40] (q13) at ([xshift=0.1\wseg,yshift=0.2\hseg]attn13.south west) {$Q^n$};
 \node[thinnode,anchor=south,thick,draw=orange!40,text=black!40] (k13) at ([yshift=0.2\hseg]attn13.south) {$K^n$};
@@ -84,7 +84,7 @@
    {
        \node[layernode,anchor=north] (layer\i\j) at ([yshift=-0.8\hseg]layer\k\j.south) {};
        \node[attnnode,anchor=south] (attn\i\j) at ([yshift=0.1\hnode]layer\i\j.south) {};
-        \node[anchor=north west,inner sep=4pt,font=\small] () at (attn\i\j.north west) {Attention};
+        \node[anchor=north west,inner sep=4pt,font=\small] () at (attn\i\j.north west) {注意力};
        \node[anchor=south,inner sep=0pt] (out\i\j) at ([yshift=0.3\hseg]attn\i\j.north) {$\cdots$};
        \node[thinnode,anchor=south west,thick,draw=dblue!\q,text=black] (q\i\j) at ([xshift=0.1\wseg,yshift=0.2\hseg]attn\i\j.south west) {$Q^m$};
@@ -119,9 +119,9 @@
    \draw[->,thick] ([yshift=-0.15em]dot3\i.north) -- ([yshift=-0.3em]attn2\i.south);
 }
-\node[anchor=north,align=left,inner sep=1pt,font=\footnotesize] () at (dot31.south) {(a) Standard Transformer Attention};
+\node[anchor=north,align=left,inner sep=1pt,font=\footnotesize] () at (dot31.south) {(a) 标准的多层自注意力};
-\node[anchor=north,align=left,inner sep=1pt,font=\footnotesize] () at (dot32.south) {(b) \textsc{San} Self-Attention};
+\node[anchor=north,align=left,inner sep=1pt,font=\footnotesize] () at (dot32.south) {(b) 共享自注意力};
-\node[anchor=north,align=left,inner sep=1pt,font=\footnotesize] () at (dot33.south) {(c) \textsc{San} Encoder-Decoder Attention};
+\node[anchor=north,align=left,inner sep=1pt,font=\footnotesize] () at (dot33.south) {(c) 共享编码-解码注意力};
 \end{scope}
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter14/Figures/figure-hypothesis-generation.tex
+++ b/Chapter14/Figures/figure-hypothesis-generation.tex
@@ -42,7 +42,7 @@
    \node [] () at ([yshift=0.4cm]output3.north) {$\vdots$};
 \end{scope}
-\node [align=center,anchor=north,font=\small] () at ([yshift=-0.3cm]MULTIPLE.south) {\footnotesize{(a) 多系统输出结构融合}};
+\node [align=center,anchor=north,font=\small] () at ([yshift=-0.3cm]MULTIPLE.south) {\footnotesize{(a) 多系统输出结果融合}};
 \node [align=center,anchor=north,font=\small] () at ([yshift=-0.3cm]SINGLE.south) {\footnotesize{(b) 单系统多输出结果融合}};
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter14/Figures/figure-main-module.tex
+++ b/Chapter14/Figures/figure-main-module.tex
 \begin{tikzpicture}
 %左
-\node [anchor=west,draw=black!70,rounded corners,drop shadow,very thick,minimum width=6em,minimum height=3.5em,fill=blue!15,align=center,text=black] (part1) at (0,0) {\scriptsize{预测模块} \\ \tiny{(RNN/Transsformer)}};
+\node [anchor=west,draw=black!70,rounded corners,drop shadow,very thick,minimum width=6em,minimum height=3.5em,fill=blue!15,align=center,text=black] (part1) at (0,0) {\scriptsize{预测模块}};
-\node [anchor=south] (text) at ([xshift=0.5em,yshift=-3.5em]part1.south) {\scriptsize{源语言句子（编码）}};
+\node [anchor=south] (text) at ([xshift=0.5em,yshift=-3.5em]part1.south) {\scriptsize{源语言句子（编码器输出）}};
 \node [anchor=east,draw=black!70,rounded corners,drop shadow,very thick,minimum width=6em,minimum height=3.5em,fill=blue!15,align=center,text=black] (part2) at ([xshift=10em]part1.east) {\scriptsize{搜索模块}};
 \node [anchor=south] (text1) at ([xshift=0.5em,yshift=2.2em]part1.north) {\scriptsize{已经生成的目标语单词}};

--- a/Chapter14/Figures/figure-multi-modality.tex
+++ b/Chapter14/Figures/figure-multi-modality.tex
@@ -46,7 +46,7 @@
 \node[p,anchor=south,minimum height=0.6em] (w3_7) at ([xshift=0.3em]w3_6.south east){};
 \node[p,anchor=south,minimum height=0.8em] (w3_8) at ([xshift=0.3em]w3_7.south east){};
-\node[inner sep=0pt,font=\scriptsize] at ([yshift=0.3em]w1_2.north){thanks};
+\node[inner sep=0pt,font=\scriptsize] at ([yshift=0.3em]w1_2.north){Thanks};
 \node[inner sep=0pt,font=\scriptsize] at ([yshift=0.3em]w2_2.north){a};
 \node[inner sep=0pt,font=\scriptsize] at ([yshift=0.3em]w2_5.north){to};
 \node[inner sep=0pt,font=\scriptsize] at ([yshift=0.3em]w3_2.north){lot};
@@ -74,10 +74,10 @@
 \draw[-latex, very thick, ublue] ([yshift=-1.2em]box2.south) -- (box2.south);
 \draw[-latex, very thick, ublue] ([yshift=-1.2em]box3.south) -- (box3.south);
-\node[tgt,anchor=west] (tgt1) at ([xshift=2em]box3.east) {thanks a lot};
+\node[tgt,anchor=west,align=left] (tgt1) at ([xshift=2em]box3.east) {Thanks a lot};
-\node[tgt,,anchor=north](tgt2) at ([yshift=-1em]tgt1.south) {thanks to you};
+\node[tgt,,anchor=north,align=left](tgt2) at ([yshift=-1em]tgt1.south) {Thanks to you};
-\node[tgt,,anchor=north] (tgt3) at ([yshift=-1em]tgt2.south) {thanks a you};
+\node[tgt,,anchor=north,align=left] (tgt3) at ([yshift=-1em]tgt2.south) {Thanks a you};
-\node[tgt,,anchor=north] (tgt4) at ([yshift=-1em]tgt3.south) {thanks to lot};
+\node[tgt,,anchor=north,align=left] (tgt4) at ([yshift=-1em]tgt3.south) {Thanks to lot};
 \node[text=ugreen] at ([xshift=1em]tgt1.east){\ding{51}};
 \node[text=ugreen] at ([xshift=1em]tgt2.east){\ding{51}};
 \node[text=red] at ([xshift=1em]tgt3.east){\ding{55}};

--- a/Chapter14/Figures/figure-non-autoregressive.tex
+++ b/Chapter14/Figures/figure-non-autoregressive.tex
@@ -50,7 +50,7 @@
 \draw[-latex, very thick, ublue] (encoder.east) -- (attention.west);
 \draw[-latex, very thick, ublue] (attention.east) -- (decoder.west);
-\draw[decorate,decoration={brace, mirror},ublue, very thick] ([yshift=-0.4em]de1.-135) -- node[font=\scriptsize,text=black,yshift=-1em]{predict traget length \& get position embedding}([yshift=-0.4em]de6.-45);
+\draw[decorate,decoration={brace, mirror},ublue, very thick] ([yshift=-0.4em]de1.-135) -- node[font=\scriptsize,text=black,yshift=-1em]{预测译文长度 \& 计算位置编码}([yshift=-0.4em]de6.-45);
 %\begin{pgfonlayer}{background}
 %{

--- a/Chapter14/chapter14.tex
+++ b/Chapter14/chapter14.tex
--- a/Chapter15/Figures/figure-attention-distribution-based-on-gaussian-distribution.tex
+++ b/Chapter15/Figures/figure-attention-distribution-based-on-gaussian-distribution.tex
+%%%------------------------------------------------------------------------------------------------------------
+%%% 调序模型1：基于距离的调序
+\begin{center}
+\begin{tikzpicture}
+\tikzstyle{cirnode}=[circle,inner sep=4pt,draw]
+\tikzstyle{colnode}=[fill=ugreen!30,inner sep=0.1pt]
+\tikzstyle{show}=[fill=red,circle,inner sep=0.5pt]
+\begin{scope}
+\node [anchor=north west,cirnode] (c1) at (0, 0) {};
+\draw[-,dotted] ([xshift=-1em,yshift=-0.5em]c1.south)--([xshift=9.3em,yshift=-0.5em]c1.south);
+\draw[-,dotted] ([xshift=-1em,yshift=-2em]c1.south)--([xshift=9.3em,yshift=-2em]c1.south);
+\draw[-,dotted] ([xshift=-1em,yshift=-3.5em]c1.south)--([xshift=9.3em,yshift=-3.5em]c1.south);
+\draw[-,dotted] ([xshift=-1em,yshift=-5em]c1.south)--([xshift=9.3em,yshift=-5em]c1.south);
+\node [anchor=north,cirnode] (c2) at ([xshift=0em,yshift=-5.5em]c1.south) {};
+\node [anchor=west,cirnode] (c3) at ([xshift=0.6em,yshift=0em]c2.east) {};
+\node [anchor=west,cirnode] (c4) at ([xshift=0.6em,yshift=0em]c3.east) {};
+\node [anchor=west,cirnode] (c5) at ([xshift=0.6em,yshift=0em]c4.east) {};
+\node [anchor=west,cirnode] (c6) at ([xshift=0.6em,yshift=0em]c5.east) {};
+\node [anchor=west,cirnode] (c7) at ([xshift=0.6em,yshift=0em]c6.east) {};
+\node [anchor=south,colnode,minimum height=1.6em,minimum width=1em] (b1) at ([xshift=0em,yshift=0.5em]c2.north) {};
+\node [anchor=south,colnode,minimum height=4.1em,minimum width=1em] (b2) at ([xshift=0em,yshift=0.5em]c3.north) {};
+\node [anchor=south,colnode,minimum height=0.8em,minimum width=1em] (b3) at ([xshift=0em,yshift=0.5em]c4.north) {};
+\node [anchor=south,colnode,minimum height=0.4em,minimum width=1em] (b4) at ([xshift=0em,yshift=0.5em]c5.north) {};
+\node [anchor=south,colnode,minimum height=0.15em,minimum width=1em] (b5) at ([xshift=0em,yshift=0.5em]c6.north) {};
+\node [anchor=south,colnode,minimum height=0.15em,minimum width=1em] (b6) at ([xshift=0em,yshift=0.5em]c7.north) {};
+{\scriptsize
+\node [anchor=center] (n1) at ([xshift=0em,yshift=-1em]c2.south){\color{orange}Bush};
+\node [anchor=west] (n2) at ([xshift=-0.2em,yshift=0em]n1.east){\color{ugreen!30}held};
+\node [anchor=west] (n3) at ([xshift=0.35em,yshift=0em]n2.east){a};
+\node [anchor=west] (n4) at ([xshift=0.5em,yshift=0em]n3.east){talk};
+\node [anchor=west] (n5) at ([xshift=-0.3em,yshift=0em]n4.east){with};
+\node [anchor=west] (n6) at ([xshift=-0.3em,yshift=0em]n5.east){Sharon};
+}
+\node [anchor=north] (l1) at ([xshift=1em,yshift=-1em]n3.south){\small {(a)原始分布}};
+\draw[-,very thick] ([xshift=1.2em,yshift=2.4em]b6.north)--([xshift=1.7em,yshift=1.9em]b6.north);
+\draw[-,very thick] ([xshift=1.2em,yshift=1.9em]b6.north)--([xshift=1.7em,yshift=2.4em]b6.north);
+\end{scope}
+\begin{scope}[xshift=4.4cm,yshift=0em]
+\node [anchor=north west,circle,inner sep=4pt] (c1) at (0, 0) {};
+\draw[-,dotted] ([xshift=-1em,yshift=-0.5em]c1.south)--([xshift=9.3em,yshift=-0.5em]c1.south);
+\draw[-,dotted] ([xshift=-1em,yshift=-2em]c1.south)--([xshift=9.3em,yshift=-2em]c1.south);
+\draw[-,dotted] ([xshift=-1em,yshift=-3.5em]c1.south)--([xshift=9.3em,yshift=-3.5em]c1.south);
+\draw[-,dotted] ([xshift=-1em,yshift=-5em]c1.south)--([xshift=9.3em,yshift=-5em]c1.south);
+\node [anchor=north,cirnode] (c2) at ([xshift=0em,yshift=-5.5em]c1.south) {};
+\node [anchor=west,cirnode] (c3) at ([xshift=0.6em,yshift=0em]c2.east) {};
+\node [anchor=west,cirnode] (c4) at ([xshift=0.6em,yshift=0em]c3.east) {};
+\node [anchor=west,cirnode] (c5) at ([xshift=0.6em,yshift=0em]c4.east) {};
+\node [anchor=west,cirnode] (c6) at ([xshift=0.6em,yshift=0em]c5.east) {};
+\node [anchor=west,cirnode] (c7) at ([xshift=0.6em,yshift=0em]c6.east) {};
+\node [anchor=south,inner sep=0.1pt,minimum height=1.6em,minimum width=1em] (b1) at ([xshift=0em,yshift=0.5em]c2.north) {};
+\node [anchor=south,inner sep=0.1pt,minimum height=4.1em,minimum width=1em] (b2) at ([xshift=0em,yshift=0.5em]c3.north) {};
+\node [anchor=south,inner sep=0.1pt,minimum height=0.8em,minimum width=1em] (b3) at ([xshift=0em,yshift=0.5em]c4.north) {};
+\node [anchor=south,inner sep=0.1pt,minimum height=0.4em,minimum width=1em] (b4) at ([xshift=0em,yshift=0.5em]c5.north) {};
+\node [anchor=south,inner sep=0.1pt,minimum height=0.15em,minimum width=1em] (b5) at ([xshift=0em,yshift=0.5em]c6.north) {};
+\node [anchor=south,inner sep=0.1pt,minimum height=0.15em,minimum width=1em] (b6) at ([xshift=0em,yshift=0.5em]c7.north) {};
+{\scriptsize
+\node [anchor=center] (n1) at ([xshift=0em,yshift=-1em]c2.south){\color{orange}Bush};
+\node [anchor=west] (n2) at ([xshift=-0.2em,yshift=0em]n1.east){held};
+\node [anchor=west] (n3) at ([xshift=0.35em,yshift=0em]n2.east){a};
+\node [anchor=west] (n4) at ([xshift=0.5em,yshift=0em]n3.east){\color{blue!60}talk};
+\node [anchor=west] (n5) at ([xshift=-0.3em,yshift=0em]n4.east){with};
+\node [anchor=west] (n6) at ([xshift=-0.3em,yshift=0em]n5.east){Sharon};
+}
+%\node [anchor=west,show] (s1) at (-0.5em,-5.7em){};
+%\node [anchor=west,show] (s2) at (-0.2em,-5.6em){};
+%\node [anchor=west,show] (s3) at (1.1em,-5.5em){};
+%\node [anchor=west,show] (s11) at (1.9em,-5em){};
+%\node [anchor=west,show] (s4) at (3em,-4em){};
+%\node [anchor=west,show] (s5) at (3.7em,-3em){};
+%\node [anchor=west,show] (s12) at (4.2em,-2.2em){};
+%\node [anchor=west,show] (s6) at (5.3em,-1.4em){};
+%\node [anchor=west,show] (s7) at (6.3em,-2em){};
+%\node [anchor=west,show] (s13) at (7.4em,-3em){};
+%\node [anchor=west,show] (s8) at (8.4em,-3.9em){};
+%\node [anchor=west,show] (s9) at (8.9em,-4.5em){};
+%\node [anchor=west,show] (s10) at (9.3em,-5em){};
+%\draw[-,blue!60,thick] (-0.5em,-5.7em)..controls (-0.2em,-5.6em) and (1.1em,-5.5em)..(1.9em,-5em)..controls (3em,-4em) and (3.7em,-3em)..(4.2em,-2.2em)..controls (5.3em,-1.4em) and (6.3em,-2em)..(7.4em,-3em)..controls (8.4em,-3.9em) and (8.9em,-4.5em)..(9.3em,-5em);
+%\draw[-,blue!60,thick] (-1em,-6em)..controls (0em,-5.7em) and (1em,-5em)..(1.6em,-4.3em)..controls (5.3em,1em) and (7.4em,-2em)..(9.3em,-5em);
+\draw[-,blue!60,thick] ([xshift=-1em,yshift=-4.7em]c1.south)..controls (3.8em,-6em) and (3.9em,3.6em)..([xshift=9.3em,yshift=-4.3em]c1.south);
+\node [anchor=north] (l1) at ([xshift=1em,yshift=-1em]n3.south){\small {(b)高斯分布}};
+\draw[-,very thick] ([xshift=1.2em,yshift=2.2em]b6.north)--([xshift=1.7em,yshift=2.2em]b6.north);
+\draw[-,very thick] ([xshift=1.2em,yshift=1.9em]b6.north)--([xshift=1.7em,yshift=1.9em]b6.north);
+\node [anchor=south] (t1) at ([xshift=0em,yshift=6.7em]n4.north){$D$};
+\draw[->] ([xshift=0em,yshift=0em]t1.west)--([xshift=-1em,yshift=0em]t1.west);
+\draw[->] ([xshift=0em,yshift=0em]t1.east)--([xshift=1em,yshift=0em]t1.east);
+\draw[-] ([xshift=1em,yshift=-0.5em]t1.east)--([xshift=1em,yshift=0.5em]t1.east);
+\draw[-] ([xshift=-1em,yshift=-0.5em]t1.west)--([xshift=-1em,yshift=0.5em]t1.west);
+\end{scope}
+\begin{scope}[xshift=8.8cm,yshift=0em]
+\node [anchor=north west,cirnode] (c1) at (0, 0) {};
+\draw[-,dotted] ([xshift=-1em,yshift=-0.5em]c1.south)--([xshift=9.3em,yshift=-0.5em]c1.south);
+\draw[-,dotted] ([xshift=-1em,yshift=-2em]c1.south)--([xshift=9.3em,yshift=-2em]c1.south);
+\draw[-,dotted] ([xshift=-1em,yshift=-3.5em]c1.south)--([xshift=9.3em,yshift=-3.5em]c1.south);
+\draw[-,dotted] ([xshift=-1em,yshift=-5em]c1.south)--([xshift=9.3em,yshift=-5em]c1.south);
+\node [anchor=north,cirnode] (c2) at ([xshift=0em,yshift=-5.5em]c1.south) {};
+\node [anchor=west,cirnode] (c3) at ([xshift=0.6em,yshift=0em]c2.east) {};
+\node [anchor=west,cirnode] (c4) at ([xshift=0.6em,yshift=0em]c3.east) {};
+\node [anchor=west,cirnode] (c5) at ([xshift=0.6em,yshift=0em]c4.east) {};
+\node [anchor=west,cirnode] (c6) at ([xshift=0.6em,yshift=0em]c5.east) {};
+\node [anchor=west,cirnode] (c7) at ([xshift=0.6em,yshift=0em]c6.east) {};
+\node [anchor=south,colnode,minimum height=0.15em,minimum width=1em] (b1) at ([xshift=0em,yshift=0.5em]c2.north) {};
+\node [anchor=south,colnode,minimum height=4.2em,minimum width=1em] (b2) at ([xshift=0em,yshift=0.5em]c3.north) {};
+\node [anchor=south,colnode,minimum height=3.7em,minimum width=1em] (b3) at ([xshift=0em,yshift=0.5em]c4.north) {};
+\node [anchor=south,colnode,minimum height=4.2em,minimum width=1em] (b4) at ([xshift=0em,yshift=0.5em]c5.north) {};
+\node [anchor=south,colnode,minimum height=0.8em,minimum width=1em] (b5) at ([xshift=0em,yshift=0.5em]c6.north) {};
+\node [anchor=south,colnode,minimum height=0.15em,minimum width=1em] (b6) at ([xshift=0em,yshift=0.5em]c7.north) {};
+{\scriptsize
+\node [anchor=center] (n1) at ([xshift=0em,yshift=-1em]c2.south){\color{orange}Bush};
+\node [anchor=west] (n2) at ([xshift=-0.2em,yshift=0em]n1.east){\color{ugreen!30}held};
+\node [anchor=west] (n3) at ([xshift=0.35em,yshift=0em]n2.east){\color{ugreen!30}a};
+\node [anchor=west] (n4) at ([xshift=0.5em,yshift=0em]n3.east){\color{ugreen!30}talk};
+\node [anchor=west] (n5) at ([xshift=-0.3em,yshift=0em]n4.east){with};
+\node [anchor=west] (n6) at ([xshift=-0.3em,yshift=0em]n5.east){Sharon};
+}
+\node [anchor=north] (l1) at ([xshift=1em,yshift=-1em]n3.south){\small {(c)修改后的分布}};
+\end{scope}
+\end{tikzpicture}
+\end{center}
\ No newline at end of file
--- a/Chapter15/Figures/figure-convolutional-attention-network.tex
+++ b/Chapter15/Figures/figure-convolutional-attention-network.tex
+\begin{tikzpicture}
+\tikzstyle{elementnode} = [anchor=center,draw,minimum size=0.6em,inner sep=0.1pt,gray!80]
+\begin{scope}[scale=1.0]
+\foreach \i / \j in
+    {0/4, 1/4, 2/4, 3/4, 4/4, 5/4, 6/4, 7/4,
+    0/3, 1/3, 2/3, 3/3, 4/3, 5/3, 6/3, 7/3,
+    0/2, 1/2, 2/2, 3/2, 4/2, 5/2, 6/2, 7/2,
+    0/1, 1/1, 2/1, 3/1, 4/1, 5/1, 6/1, 7/1,
+    0/0, 1/0, 2/0, 3/0, 4/0, 5/0, 6/0, 7/0}
+    \node[elementnode] (a\i\j) at (0.6em*\i,0.6em*\j) {};
+\foreach \i / \j in
+    {0/4, 1/4, 2/4, 3/4, 4/4, 5/4, 6/4, 7/4,
+    0/3, 1/3, 2/3, 3/3, 4/3, 5/3, 6/3, 7/3,
+    0/2, 1/2, 2/2, 3/2, 4/2, 5/2, 6/2, 7/2,
+    0/1, 1/1, 2/1, 3/1, 4/1, 5/1, 6/1, 7/1,
+    0/0, 1/0, 2/0, 3/0, 4/0, 5/0, 6/0, 7/0}
+    \node[elementnode,fill=gray!50] (b\i\j) at (0.6em*\i+5.5em,0.6em*\j) {};
+\node [anchor=south west,minimum height=0.5em,minimum width=4.8em,inner sep=0.1pt,very thick,blue!60,draw] (n1) at ([xshift=0em,yshift=0em]a01.south west) {};
+\node [anchor=north west,minimum height=0.5em,minimum width=4.8em,inner sep=0.1pt,very thick,red!60,draw] (n2) at ([xshift=0em,yshift=0em]a02.north west) {};
+\node [anchor=west,minimum height=0.6em,minimum width=0.6em,inner sep=0.1pt,very thick,blue!60,draw] (n3) at ([xshift=0em,yshift=0em]b21.west) {};
+\node [anchor=west,minimum height=0.6em,minimum width=0.6em,inner sep=0.1pt,very thick,red!60,draw] (n4) at ([xshift=0em,yshift=0em]b42.west) {};
+\draw [-,very thick,dotted,blue!60] ([xshift=0em,yshift=0em]n1.south east) -- ([xshift=0em,yshift=0em]n3.south west);
+\draw [-,very thick,dotted,blue!60] ([xshift=0em,yshift=0em]n1.north east) -- ([xshift=0em,yshift=0em]n3.north west);
+\draw [-,very thick,dotted,red!60] ([xshift=0em,yshift=0em]n2.south east) -- ([xshift=0em,yshift=0em]n4.south west);
+\draw [-,very thick,dotted,red!60] ([xshift=0em,yshift=0em]n2.north east) -- ([xshift=0em,yshift=0em]n4.north west);
+\node [anchor=north] (l1) at ([xshift=0.5em,yshift=-1em]a70.south){\footnotesize {(a)标准自注意力模型}};
+\node [anchor=south,rotate=90] (l2) at ([xshift=0em,yshift=0em]a02.west){\scriptsize {注意力头}};
+\node [anchor=south] (l2) at ([xshift=0em,yshift=0em]a44.north){\scriptsize {句子长度}};
+\end{scope}
+\begin{scope}[scale=1.0,xshift=4.6cm]
+\foreach \i / \j in
+    {0/4, 1/4, 2/4, 3/4, 4/4, 5/4, 6/4, 7/4,
+    0/3, 1/3, 2/3, 3/3, 4/3, 5/3, 6/3, 7/3,
+    0/2, 1/2, 2/2, 3/2, 4/2, 5/2, 6/2, 7/2,
+    0/1, 1/1, 2/1, 3/1, 4/1, 5/1, 6/1, 7/1,
+    0/0, 1/0, 2/0, 3/0, 4/0, 5/0, 6/0, 7/0}
+    \node[elementnode] (a\i\j) at (0.6em*\i,0.6em*\j) {};
+\foreach \i / \j in
+    {0/4, 1/4, 2/4, 3/4, 4/4, 5/4, 6/4, 7/4,
+    0/3, 1/3, 2/3, 3/3, 4/3, 5/3, 6/3, 7/3,
+    0/2, 1/2, 2/2, 3/2, 4/2, 5/2, 6/2, 7/2,
+    0/1, 1/1, 2/1, 3/1, 4/1, 5/1, 6/1, 7/1,
+    0/0, 1/0, 2/0, 3/0, 4/0, 5/0, 6/0, 7/0}
+    \node[elementnode,fill=gray!50] (b\i\j) at (0.6em*\i+5.5em,0.6em*\j) {};
+\node [anchor=south west,minimum height=0.5em,minimum width=3em,inner sep=0.1pt,very thick,blue!60,draw] (n1) at ([xshift=0em,yshift=0em]a01.south west) {};
+\node [anchor=north west,minimum height=0.5em,minimum width=3em,inner sep=0.1pt,very thick,red!60,draw] (n2) at ([xshift=0em,yshift=0em]a22.north west) {};
+\node [anchor=west,minimum height=0.6em,minimum width=0.6em,inner sep=0.1pt,very thick,blue!60,draw] (n3) at ([xshift=0em,yshift=0em]b21.west) {};
+\node [anchor=west,minimum height=0.6em,minimum width=0.6em,inner sep=0.1pt,very thick,red!60,draw] (n4) at ([xshift=0em,yshift=0em]b42.west) {};
+\draw [-,very thick,dotted,blue!60] ([xshift=0em,yshift=0em]n1.south east) -- ([xshift=0em,yshift=0em]n3.south west);
+\draw [-,very thick,dotted,blue!60] ([xshift=0em,yshift=0em]n1.north east) -- ([xshift=0em,yshift=0em]n3.north west);
+\draw [-,very thick,dotted,red!60] ([xshift=0em,yshift=0em]n2.south east) -- ([xshift=0em,yshift=0em]n4.south west);
+\draw [-,very thick,dotted,red!60] ([xshift=0em,yshift=0em]n2.north east) -- ([xshift=0em,yshift=0em]n4.north west);
+\node [anchor=north] (l1) at ([xshift=0.5em,yshift=-1em]a70.south){\footnotesize {(b)一维卷积注意力模型}};
+\node [anchor=south,rotate=90] (l2) at ([xshift=0em,yshift=0em]a02.west){\scriptsize {注意力头}};
+\node [anchor=south] (l2) at ([xshift=0em,yshift=0em]a44.north){\scriptsize {句子长度}};
+\end{scope}
+\begin{scope}[scale=1.0,xshift=9.2cm]
+\foreach \i / \j in
+    {0/4, 1/4, 2/4, 3/4, 4/4, 5/4, 6/4, 7/4,
+    0/3, 1/3, 2/3, 3/3, 4/3, 5/3, 6/3, 7/3,
+    0/2, 1/2, 2/2, 3/2, 4/2, 5/2, 6/2, 7/2,
+    0/1, 1/1, 2/1, 3/1, 4/1, 5/1, 6/1, 7/1,
+    0/0, 1/0, 2/0, 3/0, 4/0, 5/0, 6/0, 7/0}
+    \node[elementnode] (a\i\j) at (0.6em*\i,0.6em*\j) {};
+\foreach \i / \j in
+    {0/4, 1/4, 2/4, 3/4, 4/4, 5/4, 6/4, 7/4,
+    0/3, 1/3, 2/3, 3/3, 4/3, 5/3, 6/3, 7/3,
+    0/2, 1/2, 2/2, 3/2, 4/2, 5/2, 6/2, 7/2,
+    0/1, 1/1, 2/1, 3/1, 4/1, 5/1, 6/1, 7/1,
+    0/0, 1/0, 2/0, 3/0, 4/0, 5/0, 6/0, 7/0}
+    \node[elementnode,fill=gray!50] (b\i\j) at (0.6em*\i+5.5em,0.6em*\j) {};
+\node [anchor=south west,minimum height=1.8em,minimum width=3em,inner sep=0.1pt,very thick,blue!60,draw] (n1) at ([xshift=0em,yshift=0em]a00.south west) {};
+\node [anchor=north west,minimum height=1.8em,minimum width=3em,inner sep=0.1pt,very thick,red!60,draw] (n2) at ([xshift=0em,yshift=0em]a23.north west) {};
+\node [anchor=west,minimum height=0.6em,minimum width=0.6em,inner sep=0.1pt,very thick,blue!60,draw] (n3) at ([xshift=0em,yshift=0em]b21.west) {};
+\node [anchor=west,minimum height=0.6em,minimum width=0.6em,inner sep=0.1pt,very thick,red!60,draw] (n4) at ([xshift=0em,yshift=0em]b42.west) {};
+\draw [-,very thick,dotted,blue!60] ([xshift=0em,yshift=0em]n1.south east) -- ([xshift=0em,yshift=0em]n3.south west);
+\draw [-,very thick,dotted,blue!60] ([xshift=0em,yshift=0em]n1.north east) -- ([xshift=0em,yshift=0em]n3.north west);
+\draw [-,very thick,dotted,red!60] ([xshift=0em,yshift=0em]n2.south east) -- ([xshift=0em,yshift=0em]n4.south west);
+\draw [-,very thick,dotted,red!60] ([xshift=0em,yshift=0em]n2.north east) -- ([xshift=0em,yshift=0em]n4.north west);
+\node [anchor=north] (l1) at ([xshift=0.5em,yshift=-1em]a70.south){\footnotesize {(c)二维卷积注意力模型}};
+\node [anchor=south,rotate=90] (l2) at ([xshift=0em,yshift=0em]a02.west){\scriptsize {注意力头}};
+\node [anchor=south] (l2) at ([xshift=0em,yshift=0em]a44.north){\scriptsize {句子长度}};
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter15/Figures/figure-encoder-structure-of-transformer-model-optimized-by-nas.tex
+++ b/Chapter15/Figures/figure-encoder-structure-of-transformer-model-optimized-by-nas.tex
+\begin{tikzpicture}
+	\tikzstyle{unit}=[draw,rounded corners=2pt,drop shadow,font=\tiny]
+%left
+\begin{scope}
+\foreach \x/\d in {1/2em, 2/8em, 3/18em, 4/24em}
+	\node[unit,fill=yellow!20] at (0,\d) (ln_\x) {层正则};
+\foreach \x/\d in {1/4em, 2/20em}
+	\node[unit,fill=green!20] at (0,\d) (sa_\x) {8头自注意力：512};
+\foreach \x/\d in {1/6em, 2/16em, 3/22em, 4/32em}
+	\node[draw,circle,minimum size=1em,inner sep=1pt] at (0,\d) (add_\x) {\scriptsize\bfnew{+}};
+\foreach \x/\d in {2/14em, 4/30em}
+	\node[unit,fill=red!20] at (0,\d) (conv_\x) {卷积$1 \times 1$：512};
+\foreach \x/\d in {1/10em,3/26em}
+	\node[unit,fill=red!20] at (0,\d) (conv_\x) {卷积$1 \times 1$：2048};
+\foreach \x/\d in {1/12em, 2/28em}
+	\node[unit,fill=blue!20] at (0,\d) (relu_\x) {RELU};
+\draw[->,thick] ([yshift=-1.4em]ln_1.-90) -- ([yshift=-0.1em]ln_1.-90);
+\draw[->,thick] ([yshift=0.1em]ln_1.90) -- ([yshift=-0.1em]sa_1.-90);
+\draw[->,thick] ([yshift=0.1em]sa_1.90) -- ([yshift=-0.1em]add_1.-90);
+\draw[->,thick] ([yshift=0.1em]add_1.90) -- ([yshift=-0.1em]ln_2.-90);
+\draw[->,thick] ([yshift=0.1em]ln_2.90) -- ([yshift=-0.1em]conv_1.-90);
+\draw[->,thick] ([yshift=0.1em]conv_1.90) -- ([yshift=-0.1em]relu_1.-90);
+\draw[->,thick] ([yshift=0.1em]relu_1.90) -- ([yshift=-0.1em]conv_2.-90);
+\draw[->,thick] ([yshift=0.1em]conv_2.90) -- ([yshift=-0.1em]add_2.-90);
+\draw[->,thick] ([yshift=0.1em]add_2.90) -- ([yshift=-0.1em]ln_3.-90);
+\draw[->,thick] ([yshift=0.1em]ln_3.90) -- ([yshift=-0.1em]sa_2.-90);
+\draw[->,thick] ([yshift=0.1em]sa_2.90) -- ([yshift=-0.1em]add_3.-90);
+\draw[->,thick] ([yshift=0.1em]add_3.90) -- ([yshift=-0.1em]ln_4.-90);
+\draw[->,thick] ([yshift=0.1em]ln_4.90) -- ([yshift=-0.1em]conv_3.-90);
+\draw[->,thick] ([yshift=0.1em]conv_3.90) -- ([yshift=-0.1em]relu_2.-90);
+\draw[->,thick] ([yshift=0.1em]relu_2.90) -- ([yshift=-0.1em]conv_4.-90);
+\draw[->,thick] ([yshift=0.1em]conv_4.90) -- ([yshift=-0.1em]add_4.-90);
+\draw[->,thick] ([yshift=0.1em]add_4.90) -- ([yshift=1em]add_4.90);
+\draw[->,thick] ([yshift=-0.8em]ln_1.-90) .. controls ([xshift=5em,yshift=-0.8em]ln_1.-90) and ([xshift=5em]add_1.0) .. (add_1.0);
+\draw[->,thick] (add_1.0) .. controls ([xshift=5em]add_1.0) and ([xshift=5em]add_2.0) .. (add_2.0);
+\draw[->,thick] (add_2.0) .. controls ([xshift=5em]add_2.0) and ([xshift=5em]add_3.0) .. (add_3.0);
+\draw[->,thick] (add_3.0) .. controls ([xshift=5em]add_3.0) and ([xshift=5em]add_4.0) .. (add_4.0);
+\node[font=\scriptsize] at (0em, -1em){(a) Transformer编码器中若干块的结构};
+\end{scope}
+%right
+\begin{scope}[xshift=14em]
+\foreach \x/\d in {1/2em, 2/8em, 3/16em, 4/22em, 5/28em}
+	\node[unit,fill=yellow!20] at (0,\d) (ln_\x) {层正则};
+\node[unit,fill=green!20] at (0,24em) (sa_1) {8头自注意力：512};
+\foreach \x/\d in {1/6em, 2/14em, 3/20em, 4/26em, 5/36em}
+	\node[draw,circle,minimum size=1em,inner sep=1pt] at (0,\d) (add_\x) {\scriptsize\bfnew{+}};
+\node[unit,fill=red!20] at (0,30em) (conv_4) {卷积$1 \times 1$：2048};
+\node[unit,fill=red!20] at (0,34em) (conv_5) {卷积$1 \times 1$：512};
+\node[unit,fill=blue!20] at (0,32em) (relu_3) {RELU};
+\node[unit,fill=red!20] at (0,4em) (glu_1) {门控线性单元：512};
+\node[unit,fill=red!20] at (-3em,10em) (conv_1) {卷积$1 \times 1$：2048};
+\node[unit,fill=cyan!20] at (3em,10em) (conv_2) {卷积$3 \times 1$：256};
+\node[unit,fill=blue!20] at (-3em,12em) (relu_1) {RELU};
+\node[unit,fill=blue!20] at (3em,12em) (relu_2) {RELU};
+\node[unit,fill=cyan!20] at (0em,18em) (conv_3) {Sep卷积$9 \times 1$：256};
+\draw[->,thick] ([yshift=-1.4em]ln_1.-90) -- ([yshift=-0.1em]ln_1.-90);
+\draw[->,thick] ([yshift=0.1em]ln_1.90) -- ([yshift=-0.1em]glu_1.-90);
+\draw[->,thick] ([yshift=0.1em]glu_1.90) -- ([yshift=-0.1em]add_1.-90);
+\draw[->,thick] ([yshift=0.1em]add_1.90) -- ([yshift=-0.1em]ln_2.-90);
+\draw[->,thick] ([,yshift=0.1em]ln_2.135) -- ([yshift=-0.1em]conv_1.-90);
+\draw[->,thick] ([yshift=0.1em]ln_2.45) -- ([yshift=-0.1em]conv_2.-90);
+\draw[->,thick] ([yshift=0.1em]conv_1.90) -- ([yshift=-0.1em]relu_1.-90);
+\draw[->,thick] ([yshift=0.1em]conv_2.90) -- ([yshift=-0.1em]relu_2.-90);
+\draw[->,thick] ([yshift=0.1em]relu_1.90) -- ([yshift=-0.1em]add_2.-135);
+\draw[->,thick] ([yshift=0.1em]relu_2.90) -- ([yshift=-0.1em]add_2.-45);
+\draw[->,thick] ([yshift=0.1em]add_2.90) -- ([yshift=-0.1em]ln_3.-90);
+\draw[->,thick] ([yshift=0.1em]ln_3.90) -- ([yshift=-0.1em]conv_3.-90);
+\draw[->,thick] ([yshift=0.1em]conv_3.90) -- ([yshift=-0.1em]add_3.-90);
+\draw[->,thick] ([yshift=0.1em]add_3.90) -- ([yshift=-0.1em]ln_4.-90);
+\draw[->,thick] ([yshift=0.1em]ln_4.90) -- ([yshift=-0.1em]sa_1.-90);
+\draw[->,thick] ([yshift=0.1em]sa_1.90) -- ([yshift=-0.1em]add_4.-90);
+\draw[->,thick] ([yshift=0.1em]add_4.90) -- ([yshift=-0.1em]ln_5.-90);
+\draw[->,thick] ([yshift=0.1em]ln_5.90) -- ([yshift=-0.1em]conv_4.-90);
+\draw[->,thick] ([yshift=0.1em]conv_4.90) -- ([yshift=-0.1em]relu_3.-90);
+\draw[->,thick] ([yshift=0.1em]relu_3.90) -- ([yshift=-0.1em]conv_5.-90);
+\draw[->,thick] ([yshift=0.1em]conv_5.90) -- ([yshift=-0.1em]add_5.-90);
+\draw[->,thick] ([yshift=0.1em]add_5.90) -- ([yshift=1em]add_5.90);
+\draw[->,thick] ([yshift=-0.8em]ln_1.-90) .. controls ([xshift=5em,yshift=-0.8em]ln_1.-90) and ([xshift=5em]add_1.0) .. (add_1.0);
+\draw[->,thick] (add_1.0) .. controls ([xshift=8em]add_1.0) and ([xshift=8em]add_3.0) .. (add_3.0);
+\draw[->,thick] (add_3.0) .. controls ([xshift=5em]add_3.0) and ([xshift=5em]add_4.0) .. (add_4.0);
+\draw[->,thick] (add_4.0) .. controls ([xshift=5em]add_4.0) and ([xshift=5em]add_5.0) .. (add_5.0);
+\node[font=\scriptsize,align=center] at (0em, -1.5em){(b) 使用结构搜索方法优化后的 \\ Transformer编码器中若干块的结构};
+\node[minimum size=0.8em,inner sep=0pt,rounded corners=1pt,draw,fill=blue!20] (act) at (5.5em, 38em){};
+\node[anchor=west,font=\footnotesize] at ([xshift=0.1em]act.east){激活函数};
+\node[anchor=north,minimum size=0.8em,inner sep=0pt,rounded corners=1pt,draw,fill=yellow!20] (nor) at ([yshift=-0.6em]act.south){};
+\node[anchor=west,font=\footnotesize] at ([xshift=0.1em]nor.east){正则化};
+\node[anchor=north,minimum size=0.8em,inner sep=0pt,rounded corners=1pt,draw,fill=cyan!20] (wc) at ([yshift=-0.6em]nor.south){};
+\node[anchor=west,font=\footnotesize] at ([xshift=0.1em]wc.east){宽卷积};
+\node[anchor=north,minimum size=0.8em,inner sep=0pt,rounded corners=1pt,draw,fill=green!20] (at) at ([yshift=-0.6em]wc.south){};
+\node[anchor=west,font=\footnotesize] (tag) at ([xshift=0.1em]at.east){注意力机制};
+\node[anchor=north,minimum size=0.8em,inner sep=0pt,rounded corners=1pt,draw,fill=red!20] (nsl) at ([yshift=-0.6em]at.south){};
+\node[anchor=west,font=\footnotesize] at ([xshift=0.1em]nsl.east){非空间层};
+\begin{pgfonlayer}{background}
+\node[draw,drop shadow,fill=white][fit=(act)(nsl)(tag)]{};
+\end{pgfonlayer}
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter15/Figures/figure-encoder-tree-structure-modeling.tex
+++ b/Chapter15/Figures/figure-encoder-tree-structure-modeling.tex
+\begin{tikzpicture}
+\begin{scope}
+\tikzstyle{hnode}=[rectangle,inner sep=0mm,minimum height=2em,minimum width=4.5em,rounded corners=5pt,fill=ugreen!30]
+\tikzstyle{tnode}=[rectangle,inner sep=0mm,minimum height=2em,minimum width=4.5em,rounded corners=5pt,fill=red!30]
+\tikzstyle{wnode}=[inner sep=0mm,minimum height=1.4em,minimum width=4.4em]
+\node [anchor=south west,hnode] (n1) at (0,0) {$\mathbi{h}_1$};
+\node [anchor=west,hnode] (n2) at ([xshift=1em,yshift=0em]n1.east) {$\mathbi{h}_2$};
+\node [anchor=west,hnode] (n3) at ([xshift=1em,yshift=0em]n2.east) {$\mathbi{h}_3$};
+\node [anchor=west,hnode] (n4) at ([xshift=1em,yshift=0em]n3.east) {$\cdots$};
+\node [anchor=west,hnode] (n5) at ([xshift=1em,yshift=0em]n4.east) {$\mathbi{h}_n$};
+\node [anchor=south,tnode] (t1) at ([xshift=2.8em,yshift=1em]n1.north) {$\mathbi{h}_{n+1}$};
+\node [anchor=south,tnode] (t2) at ([xshift=2.8em,yshift=1em]t1.north) {$\mathbi{h}_{n+2}$};
+\node [anchor=south,tnode] (t3) at ([xshift=2.8em,yshift=1em]t2.north) {$\cdots$};
+\node [anchor=south,tnode] (t4) at ([xshift=2.8em,yshift=1em]t3.north) {$\mathbi{h}_{2n-1}$};
+\draw [->,thick] ([xshift=0em,yshift=0em]n1.east) -- ([xshift=0em,yshift=0em]n2.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n2.east) -- ([xshift=0em,yshift=0em]n3.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n3.east) -- ([xshift=0em,yshift=0em]n4.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n4.east) -- ([xshift=0em,yshift=0em]n5.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n1.north) -- ([xshift=0em,yshift=0em]t1.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]n2.north) -- ([xshift=0em,yshift=0em]t1.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]t1.north) -- ([xshift=0em,yshift=0em]t2.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]n3.north) -- ([xshift=0em,yshift=0em]t2.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]t2.north) -- ([xshift=0em,yshift=0em]t3.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]n4.north) -- ([xshift=0em,yshift=0em]t3.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]t3.north) -- ([xshift=0em,yshift=0em]t4.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]n5.north) -- ([xshift=0em,yshift=0em]t4.south);
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter15/Figures/figure-evolution-and-change-of-ml-methods.tex
+++ b/Chapter15/Figures/figure-evolution-and-change-of-ml-methods.tex
+\begin{tikzpicture}
+\node[rounded corners=4pt, minimum width=10.4em, minimum height=7em,fill=yellow!15!gray!15] (box1) at (0em,0em){};
+\node[anchor=west,rounded corners=4pt, minimum width=10.4em, minimum height=7em,fill=yellow!15!gray!15] (box2) at ([xshift=2.8em]box1.east){};
+\node[anchor=west,rounded corners=4pt, minimum width=10.4em, minimum height=7em,fill=yellow!15!gray!15] (box3) at ([xshift=2.8em]box2.east){};
+\draw[densely dotted,line width=1.2pt] ([xshift=0.8em]box1.90) -- ([xshift=0.8em]box1.-90);
+\draw[densely dotted,line width=1.2pt] ([xshift=0.8em]box2.90) -- ([xshift=0.8em]box2.-90);
+\draw[densely dotted,line width=1.2pt] ([xshift=0.8em]box3.90) -- ([xshift=0.8em]box3.-90);
+\node[anchor=west,draw,rounded corners=2pt, minimum width=5em, minimum height=3em,font=\scriptsize,align=center,inner sep=1pt,fill=yellow!10] (n1) at ([xshift=0.5em]box1.west){机器学习算法：\\决策树、支持 \\ 向量机$\cdots$};
+\node[anchor=west,draw,rounded corners=2pt, minimum width=5em, minimum height=3em,font=\scriptsize,align=center,inner sep=1pt,fill=yellow!10] (n2) at ([xshift=0.5em]box2.west){神经网络：\\RNN、CNN、 \\ Transformer$\cdots$};
+\node[anchor=west,draw,rounded corners=2pt, minimum width=5em, minimum height=3em,font=\scriptsize,align=center,inner sep=1pt,fill=ugreen!10] (n3) at ([xshift=0.5em]box3.west){神经网络：\\RNN、CNN、 \\ Transformer$\cdots$};
+\foreach \x/\c in {1/yellow,2/ugreen,3/ugreen}{
+\node[anchor=north,font=\scriptsize,inner ysep=0.1em] (output_\x)at ([xshift=-2.2em,yshift=-0.5em]box\x.north){输出};
+\node[anchor=north,inner ysep=0.1em] at ([xshift=3em,yshift=-0.5em]box\x.north){\scriptsize\bfnew{执行步骤}};
+\node[anchor=south,font=\scriptsize,inner ysep=0.1em,fill=\c!10,rounded corners=2pt] at ([xshift=-2.2em,yshift=0.5em]box\x.south)(input_\x){输入};
+\draw[->,thick] (input_\x.90) -- (n\x.-90);
+\draw[->,thick] (n\x.90) -- (output_\x.-90);
+}
+\node[anchor=east,font=\scriptsize,align=center,inner xsep=0pt] at ([xshift=-0.2em]box1.east){1.特征提取；\\2.模型设计； \\3.实验验证。};
+\node[anchor=east,font=\scriptsize,align=center,inner xsep=0pt] at ([xshift=-0.2em]box2.east){1.模型设计； \\2.实验验证。\\ };
+\node[anchor=east,font=\scriptsize,align=center,inner xsep=0pt] at ([xshift=-0.2em]box3.east){1.实验验证。 \\ \\};
+\node [draw,thick,anchor=west,single arrow,minimum height=1.6em,single arrow head extend=0.4em] at ([xshift=0.6em]box1.east) {};
+\node [draw,thick,anchor=west,single arrow,minimum height=1.6em,single arrow head extend=0.4em] at ([xshift=0.6em]box2.east) {};
+\node[font=\footnotesize, anchor=north] at ([yshift=-0.1em]box1.south){传统机器学习};
+\node[font=\footnotesize, anchor=north] at ([yshift=-0.1em]box2.south){深度学习};
+\node[font=\footnotesize, anchor=north] at ([yshift=-0.1em]box3.south){深度学习\&网络结构搜索};
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter15/Figures/figure-multi-scale-local-modeling.tex
+++ b/Chapter15/Figures/figure-multi-scale-local-modeling.tex
+\begin{tikzpicture}
+\begin{scope}
+\tikzstyle{cirnode}=[circle,minimum size=3.7em,draw]
+\tikzstyle{recnode}=[rectangle,rounded corners=2pt,inner sep=0mm,minimum height=1.5em,minimum width=4em,draw]
+\node [anchor=west,cirnode] (n1) at (0, 0) {$\mathbi{h}_{i-2}^l$};
+\node [anchor=west,cirnode] (n2) at ([xshift=1em,yshift=0em]n1.east) {$\mathbi{h}_{i-1}^l$};
+\node [anchor=west,cirnode] (n3) at ([xshift=1em,yshift=0em]n2.east) {$\mathbi{h}_{i}^l$};
+\node [anchor=west,cirnode] (n4) at ([xshift=1em,yshift=0em]n3.east) {$\mathbi{h}_{i+1}^l$};
+\node [anchor=west,cirnode] (n5) at ([xshift=1em,yshift=0em]n4.east) {$\mathbi{h}_{i+2}^l$};
+\node [anchor=center,blue!30,minimum height=4.2em,minimum width=4.5em,very thick,draw] (c1) at ([xshift=0em,yshift=0em]n3.center) {};
+\node [anchor=center,ugreen!30,minimum height=4.9em,minimum width=14.5em,very thick,draw] (c2) at ([xshift=0em,yshift=0em]n3.center) {};
+\node [anchor=center,red!30,minimum height=5.6em,minimum width=24.5em,very thick,draw] (c3) at ([xshift=0em,yshift=0em]n3.center) {};
+\node [anchor=south,recnode] (r1) at ([xshift=0em,yshift=2.5em]n2.north) {$\textrm{head}_1$};
+\node [anchor=south,recnode] (r2) at ([xshift=0em,yshift=2.5em]n3.north) {$\textrm{head}_2$};
+\node [anchor=south,recnode] (r3) at ([xshift=0em,yshift=2.5em]n4.north) {$\textrm{head}_3$};
+\node [anchor=south,cirnode] (n6) at ([xshift=0em,yshift=1em]r2.north) {$\mathbi{h}_{i}^{l+1}$};
+\draw [->,very thick,blue!30] ([xshift=0em,yshift=0em]c1.north) -- ([xshift=0em,yshift=0em]r2.south);
+\draw [->,very thick,ugreen!30] ([xshift=4.73em,yshift=0em]c2.north) -- ([xshift=0em,yshift=0em]r3.south);
+\draw [->,very thick,red!30] ([xshift=-4.73em,yshift=0em]c3.north) -- ([xshift=0em,yshift=0em]r1.south);
+\draw [->] ([xshift=0em,yshift=0em]r1.north) -- ([xshift=0em,yshift=0em]n6.south west);
+\draw [->] ([xshift=0em,yshift=0em]r2.north) -- ([xshift=0em,yshift=0em]n6.south);
+\draw [->] ([xshift=0em,yshift=0em]r3.north) -- ([xshift=0em,yshift=0em]n6.south east);
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter15/Figures/figure-transparent-attention-mechanism.tex
+++ b/Chapter15/Figures/figure-transparent-attention-mechanism.tex
+\begin{tikzpicture}
+\begin{scope}
+\tikzstyle{encnode}=[rectangle,inner sep=0mm,minimum height=2em,minimum width=4.5em,rounded corners=5pt,thick]
+\tikzstyle{decnode}=[rectangle,inner sep=0mm,minimum height=2em,minimum width=4.5em,rounded corners=5pt,thick]
+\tikzstyle{cnode}=[rectangle,draw=teal!80, inner sep=0mm,minimum height=1.4em,minimum width=4.4em,fill=teal!17,rounded corners=2pt,thick]
+\node [anchor=north,encnode] (n1) at (0, 0) {编码器};
+\node [anchor=north,rectangle,minimum height=1.5em,minimum width=2.5em,rounded corners=5pt] (n2) at ([xshift=0em,yshift=-0.2em]n1.south) {$\mathbi{X}$};
+\node [anchor=west,encnode,draw=red!60!black!80,fill=red!20] (n3) at ([xshift=1.5em,yshift=0em]n2.east) {$\mathbi{h}_0$};
+\node [anchor=west,encnode,draw=red!60!black!80,fill=red!20] (n4) at ([xshift=1.5em,yshift=0em]n3.east) {$\mathbi{h}_1$};
+\node [anchor=west,encnode,draw=red!60!black!80,fill=red!20] (n5) at ([xshift=1.5em,yshift=0em]n4.east) {$\mathbi{h}_2$};
+\node [anchor=west,rectangle,minimum height=1.5em,minimum width=2.5em,rounded corners=5pt] (n6) at ([xshift=1em,yshift=0em]n5.east) {$\ldots$};
+\node [anchor=west,encnode,draw=red!60!black!80,fill=red!20] (n7) at ([xshift=1em,yshift=0em]n6.east) {$\mathbi{h}_{N-1}$};
+\node [anchor=north,cnode] (cn1) at ([xshift=0em,yshift=-2.8em]n4.south) {\footnotesize{权重聚合$\mathbi{g}$}};
+\node [anchor=north,cnode,opacity=0.5] (cn2) at ([xshift=0em,yshift=-2.8em]n3.south) {\footnotesize{权重聚合$\mathbi{g}$}};
+\node [anchor=north,cnode,opacity=0.5] (cn3) at ([xshift=0em,yshift=-2.8em]n5.south) {\footnotesize{权重聚合$\mathbi{g}$}};
+\node [anchor=north,cnode,opacity=0.5] (cn4) at ([xshift=0em,yshift=-2.8em]n7.south) {\footnotesize{权重聚合$\mathbi{g}$}};
+\node [anchor=west,decnode] (n9) at ([xshift=0em,yshift=-7.2em]n1.west) {解码器};
+\node [anchor=north,rectangle,minimum height=1.5em,minimum width=2.5em,rounded corners=5pt] (n10) at ([xshift=0em,yshift=-0.2em]n9.south) {$\mathbi{y}_{<j}$};
+\node [anchor=west,decnode,draw=ublue,fill=blue!10] (n11) at ([xshift=1.5em,yshift=0em]n10.east) {$\mathbi{s}_j^0$};
+\node [anchor=west,decnode,draw=ublue,fill=blue!10] (n12) at ([xshift=1.5em,yshift=0em]n11.east) {$\mathbi{s}_j^1$};
+\node [anchor=west,decnode,draw=ublue,fill=blue!10] (n13) at ([xshift=1.5em,yshift=0em]n12.east) {$\mathbi{s}_j^2$};
+\node [anchor=west,rectangle,minimum height=1.5em,minimum width=2.5em,rounded corners=5pt] (n14) at ([xshift=1em,yshift=0em]n13.east) {$\ldots$};
+\node [anchor=west,decnode,draw=ublue,fill=blue!10] (n15) at ([xshift=1em,yshift=0em]n14.east) {$\mathbi{s}_j^{M-1}$};
+\node [anchor=west,rectangle,minimum height=1.5em,minimum width=2.5em,rounded corners=5pt] (n16) at ([xshift=1.5em,yshift=0em]n15.east) {$\mathbi{y}_{j}$};
+\node [anchor=south,minimum height=1.5em,minimum width=2.5em] (n17) at ([xshift=0em,yshift=6em]n16.north) {};
+\node [anchor=west,minimum height=0.5em,minimum width=4em] (n20) at ([xshift=0em,yshift=-2.5em]n2.east) {};
+\node [anchor=west,minimum height=0.5em,minimum width=4em] (n21) at ([xshift=0em,yshift=-3.9em]n2.east) {};
+\node [anchor=north,minimum height=0.5em,minimum width=4em] (n22) at ([xshift=0em,yshift=-0.7em]n11.south) {};
+\begin{pgfonlayer}{background}
+{
+\node[rectangle,inner sep=2pt,fill=blue!7] [fit = (n1) (n7) (n17) (n20)] (bg1) {};
+\node[rectangle,inner sep=2pt,fill=red!7] [fit = (n9) (n16) (n13) (n21) (n22)] (bg2) {};
+}
+\end{pgfonlayer}
+\draw [->,thick] ([xshift=0em,yshift=0em]n2.east) -- ([xshift=0em,yshift=0em]n3.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n3.east) -- ([xshift=0em,yshift=0em]n4.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n4.east) -- ([xshift=0em,yshift=0em]n5.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n5.east) -- ([xshift=0em,yshift=0em]n6.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n6.east) -- ([xshift=0em,yshift=0em]n7.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n10.east) -- ([xshift=0em,yshift=0em]n11.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n11.east) -- ([xshift=0em,yshift=0em]n12.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n12.east) -- ([xshift=0em,yshift=0em]n13.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n13.east) -- ([xshift=0em,yshift=0em]n14.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n14.east) -- ([xshift=0em,yshift=0em]n15.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n15.east) -- ([xshift=0em,yshift=0em]n16.west);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]cn2.south) -- ([xshift=0em,yshift=0em]n11.north);
+\draw [->,thick] ([xshift=0em,yshift=0em]cn1.south) -- ([xshift=0em,yshift=0em]n12.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]cn3.south) -- ([xshift=0em,yshift=0em]n13.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]cn4.south) -- ([xshift=0em,yshift=0em]n15.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n2.south)..controls +(south:1.4em) and +(north:0.6em)..([xshift=0em,yshift=0em]cn2.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n3.south)..controls +(south:0.8em) and +(north:0.7em)..([xshift=0em,yshift=0em]cn2.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n4.south)..controls +(south:0.6em) and +(north:0.6em)..([xshift=0em,yshift=0em]cn2.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n5.south)..controls +(south:0.7em) and +(north:0.7em)..([xshift=0em,yshift=0em]cn2.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n7.south)..controls +(south:1.4em) and +(north:0.6em)..([xshift=0em,yshift=0em]cn2.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n2.south)..controls +(south:1.4em) and +(north:0.6em)..([xshift=0em,yshift=0em]cn3.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n3.south)..controls +(south:0.8em) and +(north:0.7em)..([xshift=0em,yshift=0em]cn3.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n4.south)..controls +(south:0.6em) and +(north:0.6em)..([xshift=0em,yshift=0em]cn3.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n5.south)..controls +(south:0.7em) and +(north:0.7em)..([xshift=0em,yshift=0em]cn3.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n7.south)..controls +(south:1.4em) and +(north:0.6em)..([xshift=0em,yshift=0em]cn3.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n2.south)..controls +(south:1.4em) and +(north:0.6em)..([xshift=0em,yshift=0em]cn4.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n3.south)..controls +(south:0.8em) and +(north:0.7em)..([xshift=0em,yshift=0em]cn4.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n4.south)..controls +(south:0.6em) and +(north:0.6em)..([xshift=0em,yshift=0em]cn4.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n5.south)..controls +(south:0.7em) and +(north:0.7em)..([xshift=0em,yshift=0em]cn4.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n7.south)..controls +(south:1.4em) and +(north:0.6em)..([xshift=0em,yshift=0em]cn4.north);
+\draw [->,thick] ([xshift=0em,yshift=0em]n2.south)..controls +(south:1.4em) and +(north:0.6em)..([xshift=0em,yshift=0em]cn1.north);
+\draw [->,thick] ([xshift=0em,yshift=0em]n3.south)..controls +(south:0.8em) and +(north:0.7em)..([xshift=0em,yshift=0em]cn1.north);
+\draw [->,thick] ([xshift=0em,yshift=0em]n4.south)..controls +(south:0.6em) and +(north:0.6em)..([xshift=0em,yshift=0em]cn1.north);
+\draw [->,thick] ([xshift=0em,yshift=0em]n5.south)..controls +(south:0.7em) and +(north:0.7em)..([xshift=0em,yshift=0em]cn1.north);
+\draw [->,thick] ([xshift=0em,yshift=0em]n7.south)..controls +(south:1.4em) and +(north:0.6em)..([xshift=0em,yshift=0em]cn1.north);
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter15/chapter15.tex
+++ b/Chapter15/chapter15.tex
--- a/Chapter16/Figures/figure-bilingual-dictionary-Induction.tex
+++ b/Chapter16/Figures/figure-bilingual-dictionary-Induction.tex
@@ -9,7 +9,7 @@
 \draw [-,thick] (-0.7,1.0)--(-0.7,-1.0);
 \node [anchor=center](c1) at (-0.1,0){\tiny{$\mathbi{Y}$}};
-\node [anchor=center](c2) at (-0.3,-0.7){\tiny{$\mathbi{W}\cdot \mathbi{X}$}};
+\node [anchor=center](c2) at (-0.3,-0.7){\tiny{$\mathbi{W} \mathbi{X}$}};
 \node [anchor=center,red!70](cr1) at (0.65,-0.65){\scriptsize{$\bullet$}}; 
 \node [anchor=center,ublue](cb1) at (0.6,-0.5){\scriptsize{$\bullet$}};
 \node [anchor=center,red!70](cr2) at (1.65,-0.65){\scriptsize{$\bullet$}}; 
@@ -30,7 +30,7 @@
 \draw [-,thick] (-0.7,1.0)--(-0.7,-1.0);
 \node [anchor=center](c1) at (-0.1,0){\tiny{$\mathbi{Y}$}};
-\node [anchor=center](c2) at (-0.3,-0.7){\tiny{$\mathbi{W}\cdot \mathbi{X}$}};
+\node [anchor=center](c2) at (-0.3,-0.7){\tiny{$\mathbi{W} \mathbi{X}$}};
 \node [anchor=center,red!70](cr1) at (0.65,-0.65){\scriptsize{$\bullet$}}; 
 \node [anchor=center,ublue](cb1) at (0.6,-0.5){\scriptsize{$\bullet$}};
 \node [anchor=center,red!70](cr2) at (1.65,-0.65){\scriptsize{$\bullet$}}; 
@@ -136,7 +136,7 @@
 \draw [-,thick] (-0.8,1.0)--(-0.8,-1.0);
 \node [anchor=center](c1) at (0.1,0.6){\tiny{$\mathbi{Y}$}};
-\node [anchor=center](c2) at (-0.45,-0.7){\tiny{$\mathbi{W}\cdot \mathbi{X}$}};
+\node [anchor=center](c2) at (-0.45,-0.7){\tiny{$\mathbi{W} \mathbi{X}$}};
 \node [anchor=center,red!70](cr1) at (0.2,-0.35){\scriptsize{$\bullet$}};
 \node [anchor=center,red!70](cr2) at (1.58,-0.78){\scriptsize{$\bullet$}};

--- a/Chapter16/Figures/figure-contrast-diagram-of-beam-search-topk-and-sampling.tex
+++ b/Chapter16/Figures/figure-contrast-diagram-of-beam-search-topk-and-sampling.tex
-% !Mode:: "TeX:UTF-8"
-% !TEX encoding = UTF-8 Unicode
-\begin{tikzpicture}
-\begin{scope}
-%%%%%%%%%%%左侧源语言
-{\small
-\node [anchor=north west] (dictionarylabel) at (0,0) {源语言};
-\node [anchor=north west] (entry1) at ([xshift=1.0em,yshift=-0.8em]dictionarylabel.south west) {今};
-\node [anchor=north west] (entry2) at ([yshift=0.5em]entry1.south west) {天};
-\node [anchor=north west] (entry3) at ([yshift=0.1em]entry2.south west) {是};
-\node [anchor=north west] (entry4) at ([yshift=0.1em]entry3.south west) {个};
-\node [anchor=north west] (entry5) at ([yshift=0.1em]entry4.south west) {好};
-\node [anchor=north west] (entry6) at ([yshift=0.1em]entry5.south west) {天};
-\node [anchor=north west] (entry7) at ([yshift=0.5em]entry6.south west) {气};
-\node [anchor=north west] (entry8) at ([xshift=0.2em,yshift=0.1em]entry7.south west) {！};
-\node [anchor=center] (pos00) at ([yshift=0.63em]entry1){};
-\node [anchor=center] (pos9) at ([yshift=-1.92em]entry8){};
-}
-\begin{pgfonlayer}{background}
-{
-\node[rectangle,fill=yellow!20,inner sep=0.1em] [fit =(pos00) (entry1) (entry2) (entry3) (entry4) (entry5) (entry6)(entry7)(entry8)(pos9)]  {};
-}
-\end{pgfonlayer}
-\end{scope}
-%%%%%%%%%%%左侧源语言
-%%%%%%%%%%%%%%左侧模型
-\begin{scope}[xshift=4.0em,yshift=0.5em]
-\tikzstyle{neuronnode} = [minimum size=0.1em,circle,draw,black,thick,fill=white]
-\node [anchor=west] (dictionarylabel2) at ([xshift=1.2em]dictionarylabel.east) {\small{模型}};
-\node [anchor=center] (pos1) at ([yshift=-1.65em]dictionarylabel2) {};
-\node [anchor=center,neuronnode] (neuron00) at ([xshift=-1.5em,yshift=-4.7em]dictionarylabel2) {};
-\node [anchor=center,neuronnode] (neuron01) at ([yshift=-1.7em]neuron00) {};
-\node [anchor=center,neuronnode] (neuron02) at ([yshift=-1.7em]neuron01) {};
-\node [anchor=center,neuronnode] (neuron03) at ([yshift=-1.7em]neuron02) {};
-\node [anchor=center,neuronnode] (neuron10) at ([xshift=1.5em,yshift=-1.0em]neuron00) {};
-\node [anchor=center,neuronnode] (neuron11) at ([yshift=-1.7em]neuron10) {};
-\node [anchor=center,neuronnode] (neuron12) at ([yshift=-1.7em]neuron11) {};
-\node [anchor=center,neuronnode] (neuron20) at ([xshift=3em]neuron00) {};
-\node [anchor=center,neuronnode] (neuron21) at ([yshift=-1.7em]neuron20) {};
-\node [anchor=center,neuronnode] (neuron22) at ([yshift=-1.7em]neuron21) {};
-\node [anchor=center,neuronnode] (neuron23) at ([yshift=-1.7em]neuron22) {};
-\node [anchor=center] (pos2) at ([yshift=-3.72em]neuron12) {};
-\draw[-](neuron00.east) -- (neuron10.west);
-\draw[-](neuron00.east) -- (neuron11.west);
-\draw[-](neuron00.east) -- (neuron12.west);
-\draw[-](neuron01.east) -- (neuron10.west);
-\draw[-](neuron01.east) -- (neuron11.west);
-\draw[-](neuron01.east) -- (neuron12.west);
-\draw[-](neuron02.east) -- (neuron10.west);
-\draw[-](neuron02.east) -- (neuron11.west);
-\draw[-](neuron02.east) -- (neuron12.west);
-\draw[-](neuron03.east) -- (neuron10.west);
-\draw[-](neuron03.east) -- (neuron11.west);
-\draw[-](neuron03.east) -- (neuron12.west);
-\draw[-](neuron10.east) -- (neuron20.west);
-\draw[-](neuron11.east) -- (neuron20.west);
-\draw[-](neuron12.east) -- (neuron20.west);
-\draw[-](neuron10.east) -- (neuron21.west);
-\draw[-](neuron11.east) -- (neuron21.west);
-\draw[-](neuron12.east) -- (neuron21.west);
-\draw[-](neuron10.east) -- (neuron22.west);
-\draw[-](neuron11.east) -- (neuron22.west);
-\draw[-](neuron12.east) -- (neuron22.west);
-\draw[-](neuron10.east) -- (neuron23.west);
-\draw[-](neuron11.east) -- (neuron23.west);
-\draw[-](neuron12.east) -- (neuron23.west);
-\begin{pgfonlayer}{background}
-{
-\node[rectangle,fill=gray!20,inner sep=0.1em] [fit = (neuron00) (neuron03) (neuron20) (neuron23)(pos1)(pos2)]  {};
-}
-\end{pgfonlayer}
-\end{scope}
-%%%%%%%%%%%%%%左侧模型
-%%%%%%%%%%%%%%%%%%%预测分布
-\begin{scope}[xshift=11.5em]
-\node [anchor=west] (dictionarylabel3) at ([xshift=9.0em]dictionarylabel.east) {\small{预测分布}};
-\node [anchor=center] (pos31) at ([xshift=-5.0em,yshift=-1.77em]dictionarylabel3) {};
-\node [anchor=center] (pos32) at ([yshift=-3.5em]pos31) {};
-\node [anchor=center] (pos33) at ([xshift=9.5em]pos31) {};
-\node [anchor=center] (pos34) at ([yshift=-3.5em]pos33) {};
-\begin{pgfonlayer}{background}
-{
-\node[rectangle,draw=black,inner sep=0.2em,fill=white,drop shadow] [fit =(pos31)(pos32)(pos33)(pos34)]  (remark1label3-1) {};
-}
-\end{pgfonlayer}
-\node [anchor=center] (pos3-21) at ([xshift=-5.0em,yshift=-8.7em]dictionarylabel3) {};
-\node [anchor=center] (pos3-22) at ([yshift=-4em]pos3-21) {};
-\node [anchor=center] (pos3-23) at ([xshift=9.5em]pos3-21) {};
-\node [anchor=center] (pos3-24) at ([yshift=-4em]pos3-23) {};
-\begin{pgfonlayer}{background}
-{
-\node[rectangle,draw=black,inner sep=0.2em,fill=white,drop shadow] [fit =(pos3-21)(pos3-22)(pos3-23)(pos3-24)]  (remark1label3-2) {};
-}
-\end{pgfonlayer}
-\node [anchor=center,minimum height=2.5em,minimum width=1.0em,fill=orange!30] (cy3-11) at ([xshift=-4.6em,yshift=-4.0em]dictionarylabel3) {};
-\node [anchor=center,minimum height=2.0em,minimum width=1.0em,fill=blue!30] (cy3-12) at ([xshift=1.5em,yshift=-0.25em]cy3-11) {};
-\node [anchor=center,minimum height=3.5em,minimum width=1.0em,fill=black!30] (cy3-13) at ([xshift=1.5em,yshift=0.75em]cy3-12) {};
-\node [anchor=center,minimum height=1.0em,minimum width=1.0em,fill=green!30] (cy3-14) at ([xshift=1.5em,yshift=-1.25em]cy3-13) {};
-\node [anchor=center,minimum height=1.5em,minimum width=1.0em,fill=yellow!30] (cy3-15) at ([xshift=1.5em,yshift=0.25em]cy3-14) {};
-\node [anchor=center,minimum height=0.0em,minimum width=1.0em,fill=gray!30] (cy3-16) at ([xshift=1.5em,yshift=-0.4em]cy3-15) {};
-\node [anchor=center] (cy3-17) at ([xshift=1.5em,yshift=1em]cy3-16) {\tiny{$\cdots$}};
-%%%%%%%%%%%%%%%%%%%%%%%下方图注
-\node [anchor=center,minimum height=0.7em,minimum width=1.0em,fill=orange!30] (cu3-11) at ([yshift=-5.2em]cy3-11) {};
-\node [anchor=west] (cu21) at ([xshift=0.0em]cu3-11.east) {\scriptsize{wonderful}};
-\node [anchor=center,minimum height=0.7em,minimum width=1.0em,fill=green!30] (cu3-12) at ([xshift=5.5em]cu3-11) {};
-\node [anchor=west] (cu22) at ([xshift=0.0em]cu3-12.east) {\scriptsize{brilliant}};
-\node [anchor=center,minimum height=0.7em,minimum width=1.0em,fill=blue!30] (cu3-13) at ([yshift=-1.5em]cu3-11) {};
-\node [anchor=west] (cu23) at ([xshift=0.0em]cu3-13.east) {\scriptsize{great}};
-\node [anchor=center,minimum height=0.7em,minimum width=1.0em,fill=black!30] (cu3-14) at ([yshift=-1.5em]cu3-13) {};
-\node [anchor=west] (cu24) at ([xshift=0.0em]cu3-14.east) {\scriptsize{good}};
-\node [anchor=center,minimum height=0.7em,minimum width=1.0em,fill=yellow!30] (cu3-15) at ([yshift=-1.5em]cu3-12) {};
-\node [anchor=west] (cu25) at ([xshift=0.0em]cu3-15.east) {\scriptsize{sunny}};
-\node [anchor=center,minimum height=0.7em,minimum width=1.0em,fill=gray!30] (cu3-16) at ([yshift=-1.5em]cu3-15) {};
-\node [anchor=west] (cu26) at ([xshift=0.0em]cu3-16.east) {\scriptsize{what}};
-\end{scope}
-%%%%%%%%%%%%%%%%%%%预测分布
-%%%%%%%%%%%%%%%%%%%%%%%解码策略
-\begin{scope}[xshift=22.5em]
-\node [anchor=west] (dictionarylabel4) at ([xshift=20.5em]dictionarylabel.east) {\small{解码策略}};
-%%%%%%%%%%%%%%%%%%%底框1
-\node [anchor=center] (pos4-11) at ([xshift=-4.5em,yshift=-1.77em]dictionarylabel4) {};
-\node [anchor=center] (pos4-12) at ([yshift=-2.5em]pos4-11) {};
-\node [anchor=center] (pos4-13) at ([xshift=9.5em]pos4-11) {};
-\node [anchor=center] (pos4-14) at ([yshift=-2.5em]pos4-13) {};
-\begin{pgfonlayer}{background}
-{
-\node[rectangle,draw=black,inner sep=0.2em,fill=white,drop shadow] [fit =(pos4-11)(pos4-12)(pos4-13)(pos4-14)]  (remark1label41) {};
-}
-\end{pgfonlayer}
-%%%%%%%%%%%%%%%%%%%底框2
-\node [anchor=center] (pos4-212) at ([xshift=-4.5em,yshift=-10.2em]dictionarylabel4) {};
-\node [anchor=center] (pos4-222) at ([yshift=-2.5em]pos4-212) {};
-\node [anchor=center] (pos4-232) at ([xshift=9.5em]pos4-212) {};
-\node [anchor=center] (pos4-242) at ([yshift=-2.5em]pos4-232) {};
-\begin{pgfonlayer}{background}
-{
-\node[rectangle,draw=black,inner sep=0.2em,fill=white,drop shadow] [fit =(pos4-212)(pos4-222)(pos4-232)(pos4-242)]  (remark1label42) {};
-}
-\end{pgfonlayer}
-%%%%%%%%%%%%%%%%%%%底框3
-\node [anchor=center] (pos4-313) at ([xshift=-4.5em,yshift=-6em]dictionarylabel4) {};
-\node [anchor=center] (pos4-323) at ([yshift=-2.5em]pos4-313) {};
-\node [anchor=center] (pos4-333) at ([xshift=9.5em]pos4-313) {};
-\node [anchor=center] (pos4-343) at ([yshift=-2.5em]pos4-333) {};
-\begin{pgfonlayer}{background}
-{
-\node[rectangle,draw=black,inner sep=0.2em,fill=white,drop shadow] [fit =(pos4-313)(pos4-323)(pos4-333)(pos4-343)]  (remark1label43) {};
-}
-\end{pgfonlayer}
-%%%%%%%%%%%%束搜索的虚线框1
-\node [anchor=center] (pos12red11) at ([xshift=0.2em,yshift=-0.65em]pos4-11) {};
-\node [anchor=center] (pos22red11) at ([yshift=-0.85em]pos12red11) {};
-\node [anchor=center] (pos32red11) at ([xshift=1.5em]pos12red11) {};
-\node [anchor=center] (pos42red11) at ([yshift=-0.85em]pos32red11) {};
-\begin{pgfonlayer}{background}
-{
-\node[rectangle,draw=red,inner sep=0.2em,fill=white,dashed] [fit =(pos12red11)(pos22red11)(pos32red11)(pos42red11)]  (remark1labe41-1) {};
-}
-\end{pgfonlayer}
-%%%%%%%%%%%%%%%%束搜索里面内容
-{\scriptsize
-\node [anchor=center] (cy00) at ([xshift=6.7em,yshift=0.2em]pos4-11) {\tiny{束搜索}};
-\node [anchor=center,minimum height=1.8em,minimum width=0.8em,fill=orange!30] (cy11) at ([xshift=-0.0em,yshift=-1.80em]pos4-11) {};
-\node [anchor=center,minimum height=1.5em,minimum width=0.8em,fill=blue!30] (cy12) at ([xshift=1.3em,yshift=-0.15em]cy11) {};
-\node [anchor=center,minimum height=2.5em,minimum width=0.8em,fill=black!30] (cy13) at ([xshift=1.3em,yshift=0.5em]cy12) {};
-\node [anchor=center,minimum height=0.8em,minimum width=0.8em,fill=green!30] (cy14) at ([xshift=1.3em,yshift=-0.85em]cy13) {};
-\node [anchor=center,minimum height=1.2em,minimum width=0.8em,fill=yellow!30] (cy15) at ([xshift=1.3em,yshift=0.2em]cy14) {};
-\node [anchor=center,minimum height=0.0em,minimum width=0.8em,fill=gray!30] (cy16) at ([xshift=1.3em,yshift=-0.25em]cy15) {};
-\node [anchor=center] (cy17) at ([xshift=1.5em,yshift=-0.3em]cy16) {$\cdots$};
-\node [anchor=center] (cy18) at ([xshift=1.1em,yshift=1.2em]cy17) {$\Rightarrow$};
-\node [anchor=center,minimum height=1.8em,minimum width=0.8em,fill=orange!30] (cy19) at ([xshift=1.1em,yshift=-0.35em]cy18) {};
-\node [anchor=center,minimum height=1.5em,minimum width=0.8em,fill=blue!30] (cy110) at ([xshift=1.3em,yshift=-0.15em]cy19) {};
-\node [anchor=center,minimum height=2.5em,minimum width=0.8em,fill=black!30] (cy111) at ([xshift=1.3em,yshift=0.5em]cy110) {};
-\node [anchor=center,color=red] (cy112) at ([xshift=1.35em,yshift=-1.23em]cy110) {\tiny{$\Uparrow$}};
-\node [anchor=center,color=red] (cy113) at ([yshift=-0.55em]cy112) {\tiny{good}};
-}
-%%%%%%%%%%%%束搜索的虚线框2
-\node [anchor=center] (pos12red12) at ([xshift=7.65em,yshift=-0.65em]pos4-11) {};
-\node [anchor=center] (pos22red12) at ([yshift=-0.85em]pos12red12) {};
-\node [anchor=center] (pos32red12) at ([xshift=1.5em]pos12red12) {};
-\node [anchor=center] (pos42red12) at ([yshift=-0.85em]pos32red12) {};
-\begin{pgfonlayer}{background}
-{
-\node[rectangle,draw=red,inner sep=0.2em,fill=white,dashed] [fit =(pos12red12)(pos22red12)(pos32red12)(pos42red12)]  (remark1label) {};
-}
-\end{pgfonlayer}
-%%%%%%%%%%%%topk的虚线框1
-\node [anchor=center] (pos12-2) at ([xshift=0.20em,yshift=-0.65em]pos4-212) {};
-\node [anchor=center] (pos22-2) at ([yshift=-0.85em]pos12-2) {};
-\node [anchor=center] (pos32-2) at ([xshift=1.5em]pos12-2) {};
-\node [anchor=center] (pos42-2) at ([yshift=-0.85em]pos32-2) {};
-\begin{pgfonlayer}{background}
-{
-\node[rectangle,draw=red,inner sep=0.2em,fill=white,dashed] [fit =(pos12-2)(pos22-2)(pos32-2)(pos42-2)]  (remark1label-2) {};
-}
-\end{pgfonlayer}
-{\scriptsize
-\node [anchor=center] (cy00-2) at ([xshift=6.7em,yshift=0.2em]pos4-212) {\tiny{$n$-best}};
-\node [anchor=center,minimum height=1.8em,minimum width=0.8em,fill=orange!30] (cy11-2) at ([xshift=0.0em,yshift=-1.8em]pos4-212) {};
-\node [anchor=center,minimum height=1.5em,minimum width=0.8em,fill=blue!30] (cy12-2) at ([xshift=1.3em,yshift=-0.15em]cy11-2) {};
-\node [anchor=center,minimum height=2.5em,minimum width=0.8em,fill=black!30] (cy13-2) at ([xshift=1.3em,yshift=0.5em]cy12-2) {};
-\node [anchor=center,minimum height=0.8em,minimum width=0.8em,fill=green!30] (cy14-2) at ([xshift=1.3em,yshift=-0.85em]cy13-2) {};
-\node [anchor=center,minimum height=1.2em,minimum width=0.8em,fill=yellow!30] (cy15-2) at ([xshift=1.3em,yshift=0.2em]cy14-2) {};
-\node [anchor=center,minimum height=0.0em,minimum width=0.8em,fill=gray!30] (cy16-2) at ([xshift=1.3em,yshift=-0.25em]cy15-2) {};
-\node [anchor=center] (cy17-2) at ([xshift=1.5em,yshift=-0.3em]cy16-2) {$\cdots$};
-\node [anchor=center] (cy18-2) at ([xshift=1.1em,yshift=1.2em]cy17-2) {$\Rightarrow$};
-\node [anchor=center,minimum height=1.8em,minimum width=0.8em,fill=orange!30] (cy19-2) at ([xshift=1.1em,yshift=-0.35em]cy18-2) {};
-\node [anchor=center,minimum height=1.5em,minimum width=0.8em,fill=blue!30] (cy110-2) at ([xshift=1.3em,yshift=-0.15em]cy19-2) {};
-\node [anchor=center,minimum height=2.5em,minimum width=0.8em,fill=black!30] (cy111-2) at ([xshift=1.3em,yshift=0.5em]cy110-2) {};
-\node [anchor=center,color=red] (cy112-2) at ([xshift=0.0em,yshift=-1.2em]cy110-2) {\tiny{$\Uparrow$}};
-\node [anchor=center,color=red] (cy113-2) at ([yshift=-0.65em]cy112-2) {\tiny{wonderful}};
-}
-%%%%%%%%%%%%topk的虚线框2
-\node [anchor=center] (pos12red22) at ([xshift=7.65em,yshift=-0.65em]pos4-212) {};
-\node [anchor=center] (pos22red22) at ([yshift=-0.85em]pos12red22) {};
-\node [anchor=center] (pos32red22) at ([xshift=1.5em]pos12red22) {};
-\node [anchor=center] (pos42red22) at ([yshift=-0.85em]pos32red22) {};
-\begin{pgfonlayer}{background}
-{
-\node[rectangle,draw=red,inner sep=0.2em,fill=white,dashed] [fit =(pos12red22)(pos22red22)(pos32red22)(pos42red22)]  (remark1label2) {};
-}
-\end{pgfonlayer}
-%%%%%%%%%%%%%%%%%采样里面内容
-{\scriptsize
-\node [anchor=center] (cy00-2) at ([xshift=6.7em,yshift=0.2em]pos4-313) {\tiny{采样}};
-\node [anchor=center,minimum height=1.8em,minimum width=0.8em,fill=orange!30] (cy11-3) at ([xshift=2.8em,yshift=-1.8em]pos4-313) {};
-\node [anchor=center,minimum height=1.5em,minimum width=0.8em,fill=blue!30] (cy12-3) at ([xshift=1.3em,yshift=-0.15em]cy11-3) {};
-\node [anchor=center,minimum height=2.5em,minimum width=0.8em,fill=black!30] (cy13-3) at ([xshift=1.3em,yshift=0.5em]cy12-3) {};
-\node [anchor=center,minimum height=0.8em,minimum width=0.8em,fill=green!30] (cy14-3) at ([xshift=1.3em,yshift=-0.85em]cy13-3) {};
-\node [anchor=center,minimum height=1.2em,minimum width=0.8em,fill=yellow!30] (cy15-3) at ([xshift=1.3em,yshift=0.2em]cy14-3) {};
-\node [anchor=center,minimum height=0.0em,minimum width=0.8em,fill=gray!30] (cy16-3) at ([xshift=1.3em,yshift=-0.25em]cy15-3) {};
-\node [anchor=center] (cy17-3) at ([xshift=1.5em,yshift=-0.3em]cy16-3) {$\cdots$};
-\node [anchor=center,color=red] (cy112-3) at ([xshift=0.0em,yshift=-1.1em]cy15-3) {\tiny$\Uparrow$};
-\node [anchor=center,color=red] (cy113-3) at ([yshift=-0.65em]cy112-3) {\tiny{sunny}};
-}
-\end{scope}
-{\small
-\node [anchor=west] (dictionarylabel) at ([xshift=32.0em]dictionarylabel.east) {目标语言};
-\node [anchor=north west] (entry5-1) at ([xshift=0.70em,yshift=-1.37em]dictionarylabel.south west) {Today};
-\node [anchor=north] (entry5-2) at ([yshift=0.5em]entry5-1.south) {is};
-\node [anchor=north] (entry5-3) at ([yshift=0.1em]entry5-2.south) {a};
-\node [anchor=center] (pos5-0) at ([yshift=1.2em]entry5-1){};
-\node [anchor=center] (pos5-4) at ([yshift=-1.2em]entry5-3){};
-\node [anchor=north,color=red,minimum height=1.2em,minimum width=3.41em,fill=blue!20,inner sep=0.1em] (entry5-5) at ([yshift=-0.4em]pos5-4.south) {？};
-\node [anchor=north,minimum height=1.2em,minimum width=3.15em] (entry5-6) at ([yshift=-0.4em]entry5-5.south) {};
-\node [anchor=north,minimum height=1.2em,minimum width=3.15em] (entry5-7) at ([yshift=-3.23em]entry5-6.south) {};
-}
-\begin{pgfonlayer}{background}
-{
-\node[rectangle,fill=green!20,inner sep=0.1em] [fit =(entry5-1) (entry5-2) (entry5-3)(pos5-0)(pos5-4)]  {};
-}
-\end{pgfonlayer}
-\begin{pgfonlayer}{background}
-{
-\node[rectangle,fill=blue!20,inner sep=0.1em] [fit =(entry5-6) (entry5-7)]  {};
-}
-\end{pgfonlayer}
-\draw [->,thick]([xshift=0.1em,yshift=0.3em]entry5.east) -- ([xshift=1.27em,yshift=0.3em]entry5.east);
-\draw [->,thick]([xshift=5.4859em,yshift=0.3em]entry5.east) --  ([xshift=12em,yshift=0.3em]entry5.east) -- ([yshift=-0.57em]cy3-14.south);
-\draw[-,thick] ([xshift=0.0em]cy3-17.east) ..controls+(east:1.2em) and + (west:1.2em)..([xshift=-0.19em]pos4-11.west);\draw[-,thick] ([xshift=0.0em]cy3-17.east) ..controls+(east:1.2em) and + (west:1.2em)..([yshift=-2.3em,xshift=-0.19em]pos4-212.west);
-\draw [->,thick]([xshift=-1.291em]entry5-5.west) -- ([xshift=0.02em]entry5-5.west);
-\draw [-,thick]([xshift=0.415em]pos4-13.east)--([xshift=0.9em]pos4-13.east)--([xshift=0.9em]pos4-242.east)--([xshift=0.415em]pos4-242.east);
-\end{tikzpicture}
-%---------------------------------------------------------------------
--- a/Chapter16/Figures/figure-cycle-consistency.tex
+++ b/Chapter16/Figures/figure-cycle-consistency.tex
-\begin{tikzpicture}
-\node [rectangle,inner sep=2pt,font=\scriptsize] (center) at (0,0) {};
-\node [rectangle,inner sep=2pt,font=\scriptsize] (top) at ([yshift=3em,xshift=0em]center.north) {
-\begin{tabular}{c}
-翻译模型 \\
-$\textrm{P}(\ \mathbi{y}|\ \mathbi{x})$
-\end{tabular}
-};
-\node [rectangle,inner sep=2pt,font=\scriptsize] (left) at ([yshift=0em,xshift=-4em]center.west) {
-\begin{tabular}{c}
-今天天气真好。
-\end{tabular}
-};
-\node [rectangle,inner sep=2pt,font=\scriptsize] (right) at ([yshift=0em,xshift=4em]center.east) {
-\begin{tabular}{c}
-The weather is \\so good today.
-\end{tabular}
-};
-\node [rectangle,inner sep=2pt,font=\scriptsize] (down) at ([yshift=-3em,xshift=0em]center.south) {
-\begin{tabular}{c}
-翻译模型 \\
-$\textrm{P}(\ \mathbi{x}|\ \mathbi{y})$
-\end{tabular}
-};
-\draw [->,line width=0.8pt] (left.north)  .. controls +(north:0.5) and +(west:0.5) .. (top.west);
-\draw [->,line width=0.8pt] (top.east)  .. controls +(east:0.5) and +(north:0.5) .. (right.north);
-\draw [->,line width=0.8pt] (down.west)  .. controls +(west:0.5) and +(south:0.5) .. (left.south);
-\draw [->,line width=0.8pt] (right.south)  .. controls +(south:0.5) and +(east:0.5) .. (down.east) ;
-\end{tikzpicture}
\ No newline at end of file
--- a/Chapter16/Figures/figure-elmo-model-structure.tex
+++ b/Chapter16/Figures/figure-elmo-model-structure.tex
-\begin{tikzpicture}
-\tikzstyle{embedding} = [line width=0.6pt,draw=black,minimum width=2.5em,minimum height=1.6em,fill=green!20]
-\tikzstyle{model} = [line width=0.6pt,draw=black,minimum width=3.0em,minimum height=1.6em,fill=blue!20,rounded corners=2pt]
-\node [anchor=center,model] (node1-1) at (0,0) {\footnotesize{LSTM}};
-\node [anchor=west,model] (node1-2) at ([xshift=1.8em]node1-1.east) {\footnotesize{LSTM}};
-\node [anchor=west,scale=1.8] (node1-3) at ([xshift=1.0em]node1-2.east) {...};
-\node [anchor=west,model] (node1-4) at ([xshift=1.0em]node1-3.east) {\footnotesize{LSTM}};
-\node [anchor=west,model] (node1-5) at ([xshift=2.0em]node1-4.east) {\footnotesize{LSTM}};
-\node [anchor=west,model] (node1-6) at ([xshift=1.8em]node1-5.east) {\footnotesize{LSTM}};
-\node [anchor=west,scale=1.8] (node1-7) at ([xshift=1.0em]node1-6.east) {...};
-\node [anchor=west,model] (node1-8) at ([xshift=1.0em]node1-7.east) {\footnotesize{LSTM}};
-\node [anchor=south,model](node2-1) at ([yshift=1.8em]node1-1.north){\footnotesize{LSTM}};
-\node [anchor=south,model](node2-2) at ([yshift=1.8em]node1-2.north){\footnotesize{LSTM}};
-\node [anchor=west,scale=1.8](node2-3) at ([xshift=1.0em]node2-2.east){...};
-\node [anchor=south,model](node2-4) at ([yshift=1.8em]node1-4.north){\footnotesize{LSTM}};
-\node [anchor=south,model](node2-5) at ([yshift=1.8em]node1-5.north){\footnotesize{LSTM}};
-\node [anchor=south,model](node2-6) at ([yshift=1.8em]node1-6.north){\footnotesize{LSTM}};
-\node [anchor=west,scale=1.8](node2-7) at ([xshift=1.0em]node2-6.east){...};
-\node [anchor=south,model](node2-8) at ([yshift=1.8em]node1-8.north){\footnotesize{LSTM}};
-\draw [->,thick](node1-1.east)--(node1-2.west);
-\draw [->,thick](node1-2.east)--([xshift=0.5em]node1-3.west);
-\draw [->,thick]([xshift=-0.5em]node1-3.east)--(node1-4.west);
-\draw [<-,thick](node1-5.east)--(node1-6.west);
-\draw [<-,thick](node1-6.east)--([xshift=0.5em]node1-7.west);
-\draw [<-,thick]([xshift=-0.5em]node1-7.east)--(node1-8.west);
-\draw [->,thick](node1-1.north)--(node2-1.south);
-\draw [->,thick](node1-2.north)--(node2-2.south);
-\draw [->,thick](node1-4.north)--(node2-4.south);
-\draw [->,thick](node1-5.north)--(node2-5.south);
-\draw [->,thick](node1-6.north)--(node2-6.south);
-\draw [->,thick](node1-8.north)--(node2-8.south);
-\draw [->,thick](node2-1.east)--(node2-2.west);
-\draw [->,thick](node2-2.east)--([xshift=0.5em]node2-3.west);
-\draw [->,thick]([xshift=-0.5em]node2-3.east)--(node2-4.west);
-\draw [<-,thick](node2-5.east)--(node2-6.west);
-\draw [<-,thick](node2-6.east)--([xshift=0.5em]node2-7.west);
-\draw [<-,thick]([xshift=-0.5em]node2-7.east)--(node2-8.west);
-\begin{pgfonlayer}{background}
-{
-\node[fill=white,inner sep=0.5em,draw=black,line width=0.6pt,minimum width=6.0em,rounded corners=2pt,dashed] [fit =(node1-1)(node1-2)(node1-3)(node1-4)(node2-1)] (remark1) {};
-}
-\end{pgfonlayer}
-\begin{pgfonlayer}{background}
-{
-\node[fill=white,inner sep=0.5em,draw=black,line width=0.6pt,minimum width=6.0em,rounded corners=2pt,dashed] [fit =(node1-5)(node1-6)(node1-7)(node1-8)(node2-8)] (remark1) {};
-}
-\end{pgfonlayer}
-\node [anchor=north,embedding] (node0-2) at ([yshift=-2em]node1-4.south){\footnotesize{$\mathbi{e}_2$}};
-\node [anchor=east,embedding] (node0-1) at ([xshift=-1.4em]node0-2.west){\footnotesize{$\mathbi{e}_1$}};
-\node [anchor=north,scale=1.8] (node0-3) at ([yshift=-2em]node1-5.south){...};
-\node [anchor=north,embedding] (node0-4) at ([yshift=-2em]node1-6.south){\footnotesize{$\mathbi{e}_n$}};
-\draw [->,thick](node0-1.north)--(node1-1.south);
-\draw [->,thick](node0-1.north)--(node1-5.south);
-\draw [->,thick](node0-2.north)--(node1-2.south);
-\draw [->,thick](node0-2.north)--(node1-6.south);
-\draw [->,thick](node0-4.north)--(node1-4.south);
-\draw [->,thick](node0-4.north)--(node1-8.south);
-\node [anchor=south,embedding,fill=yellow!20](node3-2) at ([yshift=2em]node2-4.north){\footnotesize{$\seq{P}_2$}};
-\node [anchor=east,embedding,fill=yellow!20] (node3-1) at ([xshift=-1.4em]node3-2.west){\footnotesize{$\seq{P}_1$}};
-\node [anchor=south,scale=1.8] (node3-3) at ([yshift=2em]node2-5.north){...};
-\node [anchor=south,embedding,fill=yellow!20](node3-4) at ([yshift=2em]node2-6.north){\footnotesize{$\seq{P}_n$}};
-\draw [<-,thick](node3-1.south)--(node2-1.north);
-\draw [<-,thick](node3-1.south)--(node2-5.north);
-\draw [<-,thick](node3-2.south)--(node2-2.north);
-\draw [<-,thick](node3-2.south)--(node2-6.north);
-\draw [<-,thick](node3-4.south)--(node2-4.north);
-\draw [<-,thick](node3-4.south)--(node2-8.north);
-\end{tikzpicture}
--- a/Chapter16/Figures/figure-example-of-iterative-back-translation.tex
+++ b/Chapter16/Figures/figure-example-of-iterative-back-translation.tex
@@ -14,6 +14,12 @@
 \draw [->,thick]([yshift=-0.75em]node1-1.west)--(remark1.north east);
 \draw [->,thick,dashed](remark1.south east)--([yshift=-0.75em]node2-1.west);
+\node [anchor=west,font=\tiny,circle,draw=black,inner sep=0.1em](node3-3) at ([xshift=1.0em,yshift=3em]remark1.east){1};
+\node [anchor=west,font=\tiny,circle,draw=black,inner sep=0.1em](node3-4) at ([xshift=1.0em,yshift=-1.8em]remark1.east){2};
+\node [anchor=west,font=\tiny,circle,draw=black,inner sep=0.1em](node3-5) at ([xshift=1.0em,yshift=-1.0em]node1-1.east){3};
+\node [anchor=west,font=\tiny,circle,draw=black,inner sep=0.1em](node3-6) at ([xshift=1.0em,yshift=0.6em]node2-1.east){3};
 \node [anchor=west] (node4-1) at ([xshift=4.0em,yshift=-3.5em]node1-1.east) {\small{反向}};
 \node [anchor=north] (node4-2) at ([yshift=0.5em]node4-1.south) {\small{翻译模型}};
 \begin{pgfonlayer}{background}
@@ -21,6 +27,8 @@
 \node[fill=blue!20,inner sep=0.3em,draw=black,line width=0.6pt,minimum width=3.0em,drop shadow,rounded corners=2pt] [fit =(node4-1)(node4-2)]  (remark2) {};
 }
 \end{pgfonlayer}
+\node [anchor=west,font=\tiny,circle,draw=black,inner sep=0.1em](node4-3) at ([xshift=1.0em,yshift=-1.8em]remark2.east){4};
 \draw [->,thick]([yshift=-0.75em]node1-1.east)--(remark2.north west);
 \draw [->,thick]([yshift=-0.75em]node2-1.east)--(remark2.south west);
@@ -39,6 +47,9 @@
 }
 \end{pgfonlayer}
+\node [anchor=west,font=\tiny,circle,draw=black,inner sep=0.1em](node6-3) at ([xshift=1.0em,yshift=-1.0em]node5-1.east){5};
+\node [anchor=west,font=\tiny,circle,draw=black,inner sep=0.1em](node6-4) at ([xshift=1.0em,yshift=0.6em]node6-1.east){5};
 \draw [->,thick]([yshift=-0.75em]node5-1.east)--(remark3.north west);
 \draw [->,thick]([yshift=-0.75em]node6-1.east)--(remark3.south west);

--- a/Chapter16/Figures/figure-multi-language-single-model-system-diagram.tex
+++ b/Chapter16/Figures/figure-multi-language-single-model-system-diagram.tex
@@ -23,7 +23,7 @@
 \node[anchor=north,font=\footnotesize] (pair2) at ([yshift=-4.5em,xshift=1em]train.south) {双语句对2：};
 \node[anchor=west,lan](train3) at ([yshift=.7em,xshift=0.4em]pair2.east) {法语：{\color{red}<german>} \ Bonjour};
 \node[anchor=west,lan](train4) at ([yshift=-.7em,xshift=0.4em]pair2.east) {德语：Hallo};
-\node[anchor=north,font=\footnotesize] (decode) at ([yshift=-8em]train.south) {\small\bfnew{解码阶段：}};
+\node[anchor=north,font=\footnotesize] (decode) at ([yshift=-8em]train.south) {\small\bfnew{推断阶段：}};
 \node[anchor=north,font=\footnotesize] (input) at ([xshift=2.13em,yshift=-0.6em]decode.south) {输入：};
 \node[anchor=west,lan](decode2) at ([xshift=0.4em]input.east) {英语：{\color{red}<german>} \ hello};
 \node[anchor=north,font=\footnotesize] (output) at ([xshift=2.13em,yshift=-2.6em]decode.south) {输出：};

--- a/Chapter16/Figures/figure-optimization-of-the-model-initialization-method.tex
+++ b/Chapter16/Figures/figure-optimization-of-the-model-initialization-method.tex
@@ -14,7 +14,7 @@
 \draw [->,thick] ([yshift=1pt]data.north) .. controls +(90:2em) and +(90:2em) .. ([yshift=1pt]model.north) node[above,midway] {\small{参数优化}};
 \draw [->,thick] ([yshift=1pt]model.south) .. controls +(-90:2em) and +(-90:2em) .. ([yshift=1pt]data.south) node[below,midway] {\small{数据优化}};
-\node[word] at ([xshift=-0.5em,yshift=-4em]data.south){\small{(a) 思路1}};
+\node[word] at ([xshift=-0.5em,yshift=-4em]data.south){\small{(a) 基于数据的初始化方法}};
 \end{scope}
 \end{tikzpicture}
@@ -33,7 +33,7 @@
 \draw [->,thick] ([yshift=1pt]data.north) .. controls +(90:2em) and +(90:2em) .. ([yshift=1pt]model.north) node[above,midway] {\small{参数优化}};
 \draw [->,thick] ([yshift=1pt]model.south) .. controls +(-90:2em) and +(-90:2em) .. ([yshift=1pt]data.south) node[below,midway] {\small{数据优化}};
-\node[word] at ([xshift=-0.5em,yshift=-4em]model.south){\small{(b) 思路2}};
+\node[word] at ([xshift=-0.5em,yshift=-4em]model.south){\small{(b) 基于模型的初始化方法}};
 \end{scope}
 \end{tikzpicture}

--- a/Chapter16/Figures/figure-shared-space-inductive-bilingual-dictionary.tex
+++ b/Chapter16/Figures/figure-shared-space-inductive-bilingual-dictionary.tex
@@ -41,44 +41,44 @@
 \node[](circle2) at ([xshift=3.0em]circle1.east) {\input{Chapter16/Figures/figure-shared-space-inductive-bilingual-dictionary-b}};
 \node[](circle3) at ([xshift=5.5em]circle2.east) {\input{Chapter16/Figures/figure-shared-space-inductive-bilingual-dictionary-c}};
 \node[](circle4) at ([xshift=5.5em]circle3.east) {\input{Chapter16/Figures/figure-shared-space-inductive-bilingual-dictionary-d}};
-\draw[->,very thick] ([xshift=-0.5em]circle2.east)--([xshift=0.5em]circle3.west)node [pos=0.5,above] (pos1) {\scriptsize{Y空间}};
+\draw[->,very thick] ([xshift=-0.5em]circle2.east)--([xshift=0.5em]circle3.west)node [pos=0.5,above] (pos1) {\scriptsize{$\mathbi{Y}$空间}};
-\node [anchor=south](pos1-2) at ([yshift=-0.5em]pos1.north){\scriptsize{X映射到}};
+\node [anchor=south](pos1-2) at ([yshift=-0.5em]pos1.north){\scriptsize{\mathbi{X}映射到}};
 \draw[->,very thick] ([xshift=-0.5em]circle3.east)--([xshift=0.5em]circle4.west)node [pos=0.5,above] (pos2) {\scriptsize{推断}};
 \node [anchor=south](pos2-2) at ([yshift=-0.5em]pos2.north){\scriptsize{词典}};
 %circle1
-\node[rec,anchor=center,rotate=60,fill=green!40](c1x1) at ([xshift=-7em,yshift=-1.4em]circle1.east){\tiny{1}};
+\node[rec,anchor=center,rotate=60,fill=red!20](c1x1) at ([xshift=-7em,yshift=-1.4em]circle1.east){\tiny{1}};
-\node[rec,anchor=center,rotate=60,fill=green!40](c1x2) at ([xshift=-4.5em,yshift=1.8em]circle1.east){\tiny{2}};
+\node[rec,anchor=center,rotate=60,fill=red!20](c1x2) at ([xshift=-4.5em,yshift=1.8em]circle1.east){\tiny{2}};
-\node[rec,anchor=center,rotate=60,fill=green!40](c1x3) at ([xshift=-4em,yshift=-0.5em]circle1.east){\tiny{3}};
+\node[rec,anchor=center,rotate=60,fill=red!20](c1x3) at ([xshift=-4em,yshift=-0.5em]circle1.east){\tiny{3}};
-\node[rec,anchor=center,rotate=60,fill=green!40](c1x4) at ([xshift=-3.5em,yshift=-2.5em]circle1.east){\tiny{4}};
+\node[rec,anchor=center,rotate=60,fill=red!20](c1x4) at ([xshift=-3.5em,yshift=-2.5em]circle1.east){\tiny{4}};
-\node[rec,anchor=center,rotate=60,fill=green!40](c1x5) at ([xshift=-2em,yshift=1.0em]circle1.east){\tiny{5}};
+\node[rec,anchor=center,rotate=60,fill=red!20](c1x5) at ([xshift=-2em,yshift=1.0em]circle1.east){\tiny{5}};
 %circle2
-\node[cir,anchor=center,rotate=-30,fill=red!40] (c2a) at ([xshift=-5.3em,yshift=2.15em]circle2.east){\tiny{a}};
+\node[cir,anchor=center,rotate=-30,fill=blue!20] (c2a) at ([xshift=-5.3em,yshift=2.15em]circle2.east){\tiny{a}};
-\node[cir,anchor=east,rotate=-30,fill=red!40] (c2b) at ([xshift=2.0em,yshift=-1.25em]c2a.east){\tiny{b}};
+\node[cir,anchor=east,rotate=-30,fill=blue!20] (c2b) at ([xshift=2.0em,yshift=-1.25em]c2a.east){\tiny{b}};
-\node[cir,anchor=east,rotate=-30,fill=red!40] (c2c) at ([xshift=0.8em,yshift=-3.9em]c2a.south){\tiny{c}};
+\node[cir,anchor=east,rotate=-30,fill=blue!20] (c2c) at ([xshift=0.8em,yshift=-3.9em]c2a.south){\tiny{c}};
-\node[cir,anchor=east,rotate=-30,fill=red!40] (c2x) at ([xshift=-0.3em,yshift=-1.9em]c2a.south){\tiny{x}};
+\node[cir,anchor=east,rotate=-30,fill=blue!20] (c2x) at ([xshift=-0.3em,yshift=-1.9em]c2a.south){\tiny{x}};
-\node[cir,anchor=west,rotate=-30,fill=red!40] (c2y) at ([xshift=1.15em,yshift=-2.85em]c2a.east){\tiny{y}};
+\node[cir,anchor=west,rotate=-30,fill=blue!20] (c2y) at ([xshift=1.15em,yshift=-2.85em]c2a.east){\tiny{y}};
 %circle3
-\node[rec,anchor=center,rotate=-30,fill=green!40] (c3x1) at ([xshift=-6.7em,yshift=1.75em]circle3.east){\tiny{1}};
+\node[rec,anchor=center,rotate=-30,fill=red!20] (c3x1) at ([xshift=-6.7em,yshift=1.75em]circle3.east){\tiny{1}};
-\node[rec,anchor=east,rotate=-30,fill=green!40] (c3x2) at ([xshift=4.7em,yshift=-0.95em]c3x1.east){\tiny{2}};
+\node[rec,anchor=east,rotate=-30,fill=red!20] (c3x2) at ([xshift=4.7em,yshift=-0.95em]c3x1.east){\tiny{2}};
-\node[rec,anchor=east,rotate=-30,fill=green!40] (c3x3) at ([xshift=2.6em,yshift=-2.4em]c3x1.south){\tiny{3}};
+\node[rec,anchor=east,rotate=-30,fill=red!20] (c3x3) at ([xshift=2.6em,yshift=-2.4em]c3x1.south){\tiny{3}};
-\node[rec,anchor=east,rotate=-30,fill=green!40] (c3x4) at ([xshift=0.35em,yshift=-2.7em]c3x1.south){\tiny{4}};
+\node[rec,anchor=east,rotate=-30,fill=red!20] (c3x4) at ([xshift=0.35em,yshift=-2.7em]c3x1.south){\tiny{4}};
-\node[rec,anchor=west,rotate=-30,fill=green!40] (c3x5) at ([xshift=2.35em,yshift=-3.85em]c3x1.east){\tiny{5}};
+\node[rec,anchor=west,rotate=-30,fill=red!20] (c3x5) at ([xshift=2.35em,yshift=-3.85em]c3x1.east){\tiny{5}};
 %circle4
-\node[rec,anchor=center,rotate=-30,fill=green!40] (c4x1) at ([xshift=-6.7em,yshift=1.75em]circle4.east){\tiny{1}};
+\node[rec,anchor=center,rotate=-30,fill=red!20] (c4x1) at ([xshift=-6.7em,yshift=1.75em]circle4.east){\tiny{1}};
-\node[rec,anchor=east,rotate=-30,fill=green!40] (c4x2) at ([xshift=4.7em,yshift=-0.95em]c4x1.east){\tiny{2}};
+\node[rec,anchor=east,rotate=-30,fill=red!20] (c4x2) at ([xshift=4.7em,yshift=-0.95em]c4x1.east){\tiny{2}};
-\node[rec,anchor=east,rotate=-30,fill=green!40] (c4x3) at ([xshift=2.6em,yshift=-2.4em]c4x1.south){\tiny{3}};
+\node[rec,anchor=east,rotate=-30,fill=red!20] (c4x3) at ([xshift=2.6em,yshift=-2.4em]c4x1.south){\tiny{3}};
-\node[rec,anchor=east,rotate=-30,fill=green!40] (c4x4) at ([xshift=0.35em,yshift=-2.7em]c4x1.south){\tiny{4}};
+\node[rec,anchor=east,rotate=-30,fill=red!20] (c4x4) at ([xshift=0.35em,yshift=-2.7em]c4x1.south){\tiny{4}};
-\node[rec,anchor=west,rotate=-30,fill=green!40] (c4x5) at ([xshift=2.35em,yshift=-3.85em]c4x1.east){\tiny{5}};
+\node[rec,anchor=west,rotate=-30,fill=red!20] (c4x5) at ([xshift=2.35em,yshift=-3.85em]c4x1.east){\tiny{5}};
-\node[cir,anchor=center,rotate=-30,fill=red!40] (c4a) at ([xshift=-5.3em,yshift=2.15em]circle4.east){\tiny{a}};
+\node[cir,anchor=center,rotate=-30,fill=blue!20] (c4a) at ([xshift=-5.3em,yshift=2.15em]circle4.east){\tiny{a}};
-\node[cir,anchor=east,rotate=-30,fill=red!40] (c4b) at ([xshift=2.0em,yshift=-1.25em]c4a.east){\tiny{b}};
+\node[cir,anchor=east,rotate=-30,fill=blue!20] (c4b) at ([xshift=2.0em,yshift=-1.25em]c4a.east){\tiny{b}};
-\node[cir,anchor=east,rotate=-30,fill=red!40] (c4c) at ([xshift=0.8em,yshift=-3.9em]c4a.south){\tiny{c}};
+\node[cir,anchor=east,rotate=-30,fill=blue!20] (c4c) at ([xshift=0.8em,yshift=-3.9em]c4a.south){\tiny{c}};
-\node[cir,anchor=east,rotate=-30,fill=red!40] (c4x) at ([xshift=-0.3em,yshift=-1.9em]c4a.south){\tiny{x}};
+\node[cir,anchor=east,rotate=-30,fill=blue!20] (c4x) at ([xshift=-0.3em,yshift=-1.9em]c4a.south){\tiny{x}};
-\node[cir,anchor=west,rotate=-30,fill=red!40] (c4y) at ([xshift=1.15em,yshift=-2.85em]c4a.east){\tiny{y}};
+\node[cir,anchor=west,rotate=-30,fill=blue!20] (c4y) at ([xshift=1.15em,yshift=-2.85em]c4a.east){\tiny{y}};
 \draw [color=red,line width=0.7pt,rotate=18] ([xshift=-5.1em,yshift=3.7em]circle4.east) ellipse (1.6em and 0.9em); 
 \draw [color=red,line width=0.7pt,rotate=-5] ([xshift=-2.8em,yshift=0.6em]circle4.east) ellipse (1.6em and 0.9em);
@@ -88,8 +88,8 @@
 \node [anchor=north](part1) at ([yshift=0.5em]circle1.south){\small{$\mathbi{X}$}};
 \node [anchor=west](part2) at ([xshift=6em]part1.east){\small{$\mathbi{Y}$}};
-\node [anchor=west](part3) at ([xshift=8.2em]part2.east){\small{$\mathbi{X}\cdot \mathbi{W}$}};
+\node [anchor=west](part3) at ([xshift=8.5em]part2.east){\small{$\mathbi{X} \mathbi{W}$}};
-\node [anchor=west](part3) at ([xshift=15.0em]part2.east){\small{$\mathbi{X}\cdot \mathbi{W}$和$\mathbi{Y}$在同一空间}};
+\node [anchor=west](part3) at ([xshift=15.0em]part2.east){\small{$\mathbi{X} \mathbi{W}$和$\mathbi{Y}$在同一空间}};
 \node [anchor=center](c1) at (5.4,-1.0){\small{$\mathbi{W}$}};

--- a/Chapter16/Figures/figure-unsupervised-dual-learning-process.tex
+++ b/Chapter16/Figures/figure-unsupervised-dual-learning-process.tex
@@ -3,9 +3,9 @@
 \tikzstyle{circle} = [draw,black,line width=0.6pt,inner sep=3.5pt,rounded corners=4pt,minimum width=2em]
 \tikzstyle{word} = [inner sep=3.5pt]
-\node [anchor=center] (node1-1) at (0,0) {\small{\seq{x}}};
+\node [anchor=center] (node1-1) at (0,0) {\small{$\seq{x}$}};
-\node [anchor=west] (node1-2) at ([xshift=0.8em]node1-1.east) {\small{\seq{y}}};
+\node [anchor=west] (node1-2) at ([xshift=0.8em]node1-1.east) {\small{$\seq{y}$}};
-\node [anchor=north] (node1-3) at ([xshift=1.0em]node1-1.south) {\small{翻译模型f}};
+\node [anchor=north] (node1-3) at ([xshift=1.0em]node1-1.south) {\small{翻译模型$f$}};
 \draw [->,line width=0.6pt](node1-1.east)--(node1-2.west);
 \begin{pgfonlayer}{background}
@@ -21,9 +21,9 @@
 \draw [->,thick]([xshift=0.2em]remark1.east).. controls (2.9,-0.25) and (2.9,-0.7) ..([yshift=0.2em]node3.north);
-\node [anchor=north] (node4-1) at ([xshift=-1.0em,yshift=-7.0em]remark1.south) {\small{\seq{y}}};
+\node [anchor=north] (node4-1) at ([xshift=-1.0em,yshift=-7.0em]remark1.south) {\small{$\seq{y}$}};
-\node [anchor=west] (node4-2) at ([xshift=0.8em]node4-1.east) {\small{\seq{x}}};
+\node [anchor=west] (node4-2) at ([xshift=0.8em]node4-1.east) {\small{$\seq{x}$}};
-\node [anchor=north] (node4-3) at ([xshift=1.0em]node4-1.south) {\small{翻译模型g}};
+\node [anchor=north] (node4-3) at ([xshift=1.0em]node4-1.south) {\small{翻译模型$g$}};
 \draw [->,line width=0.6pt](node4-1.east)--(node4-2.west);
 \begin{pgfonlayer}{background}

--- a/Chapter16/Figures/lm-fusion.tex
+++ b/Chapter16/Figures/lm-fusion.tex
-\begin{tikzpicture}
-\tikzstyle{cir} = [draw,inner sep=2pt,line width=1pt,align=center,minimum height=2em,minimum width=2em,circle,fill=white]
-\tikzstyle{add} = [draw,inner sep=2pt,line width=1pt,align=center,minimum height=1em,minimum width=1em,fill=white]
-\tikzstyle{minicir} = [draw,inner sep=2pt,line width=1pt,align=center,minimum height=1em,minimum width=1em,fill=white,circle]
-\tikzstyle{rec} = [draw,inner sep=2pt,line width=1pt,align=center,minimum height=1.5em,minimum width=2.5em,fill=white]
-\tikzstyle{dia} = [draw,inner sep=2pt,line width=1pt,align=center,fill=white,diamond,minimum height=2em,minimum width=2em]
-\node [cir,anchor=north,dashed] (a0) at (0,0) {\tiny{$y_{t-1}$}};
-\node [cir,anchor=west] (a1) at ([xshift=4.0em]a0.east) {\tiny{$y_t$}};
-\node [add,anchor=north] (a11) at ([yshift=-1em]a1.south) {\tiny{$+$}};
-\node [minicir,anchor=north] (a12) at ([yshift=-1em]a11.south) {\tiny{$\times$}};
-\node [minicir,anchor=west] (a11-1) at ([xshift=0.8em]a12.east) {\tiny{$\beta$}};
-\node [rec,anchor=north] (a13) at ([yshift=-1.0em]a12.south) {\tiny{${\funp{P}}_{t}^{LM}$}};
-\node [rec,anchor=north] (a14) at ([yshift=-2.0em]a13.south) {\tiny{${\funp{P}}_{t}^{TM}$}};
-\node [dia,anchor=north] (a15) at ([yshift=-1em]a14.south) {\tiny{$\funp{C}_{t}$}};
-\node [anchor=west] (a13-2) at ([xshift=-4em]a13.west) {\tiny{$\cdots$}};
-\node [anchor=west] (a14-2) at ([xshift=-4em]a14.west) {\tiny{$\cdots$}};
-\node [anchor=west] (a15-2) at ([xshift=-4.25em]a15.west) {\tiny{$\cdots$}};
-\node [anchor=east] (a13-3) at ([yshift=0.8em]a13-2.west) {\small{模型语言}};
-\node [anchor=north] (a13-4) at ([xshift=0em]a13-3.south) {\small{隐藏层}};
-\node [anchor=east] (a14-3) at ([yshift=0.8em]a14-2.west) {\small{神经机器翻译}};
-\node [anchor=north] (a14-4) at ([xshift=0.5em]a14-3.south) {\small{模型隐藏层}};
-\node [anchor=east] (a15-3) at ([xshift=0em]a15-2.west) {\small{上下文向量}};
-\draw[->,thick](a11.north) -- (a1.south);
-\draw[->,thick](a12.north) -- (a11.south);
-\draw[->,thick](a13.north) -- (a12.south);
-\draw[->,thick](a11-1.west) -- (a12.east);
-\draw[->,dashed](a0.south) -- (a13.north west);
-\draw[->,dashed](a0.south) -- (a14.north west);
-\draw[->,thick](a15.north) -- (a14.south);
-\draw[->,dashed]([xshift=-2.0em]a13.west) -- (a13.west);
-\draw[->,dashed]([xshift=-2.0em]a14.west) -- (a14.west);
-\draw [->,thick] (a14.east) ..controls + (east:1em) and +(east:4.1em).. (a11.east);
-\draw[->,dashed](a1.south east) -- ([xshift=6.0em,yshift=-4.0em]a1.south);
-\draw[->,dashed](a1.south east) -- ([xshift=6.0em,yshift=-7.5em]a1.south);
-\draw[-]([xshift=5.9em,yshift=1.05em]a1.east) -- ([xshift=5.9em,yshift=-14.7em]a1.east);
-%%%%%%%%%%%%%%%%%%%%%%
-\node [cir,anchor=west] (a2) at ([xshift=10.0em]a1.east) {\tiny{$y_{t}$}};
-\node [add,anchor=north] (a21) at ([yshift=-1em]a2.south) {\tiny{$+$}};
-\node [minicir,anchor=north] (a22) at ([yshift=-1em]a21.south) {\tiny{$\times$}};
-\node [minicir,anchor=west] (a21-1) at ([xshift=0.8em]a22.east) {\tiny{$g_{t}$}};
-\node [cir,anchor=north] (a23) at ([yshift=-0.6125em]a22.south) {\tiny{${\funp{P}}_{t}^{LM}$}};
-\node [cir,anchor=north] (a24) at ([yshift=-1.217em]a23.south) {\tiny{${\funp{P}}_{t}^{TM}$}};
-\node [dia,anchor=north] (a25) at ([yshift=-0.6044em]a24.south) {\tiny{$\funp{C}_{t}$}};
-\node [anchor=west] (a23-2) at ([xshift=-3.5em]a23.west) {\tiny{$\cdots$}};
-\node [anchor=west] (a24-2) at ([xshift=-3.5em]a24.west) {\tiny{$\cdots$}};
-\node [anchor=west] (a25-2) at ([xshift=-3.65em]a25.west) {\tiny{$\cdots$}};
-\draw[->,thick](a21.north) -- (a2.south);
-\draw[->,thick](a22.north) -- (a21.south);
-\draw[->,thick](a23.north) -- (a22.south);
-\draw[->,thick](a21-1.west) -- (a22.east);
-\draw[->,thick](a25.north) -- (a24.south);
-\draw [->,thick] (a24.east) ..controls + (east:1em) and +(east:4.2em).. (a21.east);
-\draw [->,thick] (a25.west) ..controls + (west:1em) and +(west:2em).. (a21.west);
-\draw[->,dashed]([xshift=-1.5em]a23.west) -- (a23.west);
-\draw[->,dashed]([xshift=-1.5em]a24.west) -- (a24.west);
-\node [cir,anchor=west] (a3) at ([xshift=4.0em]a2.east) {\tiny{$y_{t+1}$}};
-\node [add,anchor=north] (a31) at ([yshift=-1em]a3.south) {\tiny{$+$}};
-\node [minicir,anchor=north] (a32) at ([yshift=-1em]a31.south) {\tiny{$\times$}};
-\node [minicir,anchor=west] (a31-1) at ([xshift=0.8em]a32.east) {\tiny{$g_{t}$}};
-\node [cir,anchor=north] (a33) at ([yshift=-0.6125em]a32.south) {\tiny{${\funp{P}}_{t}^{LM}$}};
-\node [cir,anchor=north] (a34) at ([yshift=-1.217em]a33.south) {\tiny{${\funp{P}}_{t}^{TM}$}};
-\node [dia,anchor=north] (a35) at ([yshift=-0.6044em]a34.south) {\tiny{$\funp{C}_{t}$}};
-\draw[->,thick](a31.north) -- (a3.south);
-\draw[->,thick](a32.north) -- (a31.south);
-\draw[->,thick](a33.north) -- (a32.south);
-\draw[->,thick](a31-1.west) -- (a32.east);
-\draw[->,thick](a35.north) -- (a34.south);
-\draw[->,dashed](a23.east) -- (a33.west);
-\draw[->,dashed](a24.east) -- (a34.west);
-\draw [->,thick] (a34.east) ..controls + (east:1em) and +(east:4.2em).. (a31.east);
-\draw [->,thick] (a35.west) ..controls + (west:1em) and +(west:2em).. (a31.west);
-\draw[->,dashed](a33.east) -- ([xshift=2em]a33.east);
-\draw[->,dashed](a34.east) -- ([xshift=2em]a34.east);
-\draw[->,dashed](a3.south east) -- ([xshift=6.0em,yshift=-4.0em]a3.south);
-\draw[->,dashed](a3.south east) -- ([xshift=6.0em,yshift=-7.5em]a3.south);
-\node[anchor=north](pos1) at ([xshift=-1.5em,yshift=-0.5em]a15.south) {(a) 浅融合};
-\node[anchor=north](pos2) at ([xshift=-2.0em,yshift=-0.5em]a35.south) {(b) 深融合};
-\end{tikzpicture}
--- a/Chapter16/chapter16.tex
+++ b/Chapter16/chapter16.tex
--- a/Chapter17/Figures/fig-cover.jpg
+++ b/Chapter17/Figures/fig-cover.jpg
--- a/Chapter17/Figures/figure-audio-processing.tex
+++ b/Chapter17/Figures/figure-audio-processing.tex
+\tikzstyle{process} = [rectangle,very thick,rounded corners,minimum width=4.7cm,minimum height=2.5cm,text centered,draw=black!70,fill=red!20]
+\tikzstyle{cir} = [circle,thick,rounded corners,minimum width=0.7cm,text centered,draw=black,fill=green!25]
+\begin{tikzpicture}[node distance = 0,scale = 0.7]
+\tikzstyle{every node}=[scale=0.7]
+\node(voice)[scale=1.0]{声波};
+\node(microphone)[rectangle,right of = voice,xshift=1.4cm,yshift=-1cm,minimum width=0.32cm,minimum height=0.35cm,fill=black!85,draw=black!85]{};
+\draw[black!85,line width=1.8]([yshift=0.38cm,xshift=-0.4cm]microphone.north)arc(180:360:0.4cm);
+\node(microphone_1)[rectangle,minimum width=0.4cm,minimum height=0.8cm,rounded corners=3pt,above of =microphone,yshift=0.75cm,draw=black!85,line width=2.5]{};
+\draw[-,black!85,very thick]([yshift=0.4cm,xshift=-0.2cm]microphone.north)--([yshift=0.4cm,xshift=-0cm]microphone.north);
+\draw[-,black!85,very thick]([yshift=0.5cm,xshift=-0.2cm]microphone.north)--([yshift=0.5cm,xshift=-0cm]microphone.north);
+\draw[-,black!85,very thick]([yshift=0.6cm,xshift=-0.2cm]microphone.north)--([yshift=0.6cm,xshift=-0cm]microphone.north);
+\draw[-,black!85,line width=1.8]([yshift=0.6cm,xshift=-0.4cm]microphone.north)--([yshift=0.37cm,xshift=-0.4cm]microphone.north);
+\draw[-,black!85,line width=1.8]([yshift=0.6cm,xshift=0.4cm]microphone.north)--([yshift=0.37cm,xshift=0.4cm]microphone.north);
+\draw[black!85,line width=1]([yshift=0.8cm,xshift=-0.8cm]microphone.north)arc(-45:45:0.3cm);
+\draw[black!85,line width=1]([yshift=0.75cm,xshift=-0.7cm]microphone.north)arc(-45:45:0.4cm);
+\draw[black!85,line width=1]([yshift=0.7cm,xshift=-0.6cm]microphone.north)arc(-45:45:0.5cm);
+\node(process_1)[process,right of = microphone,xshift=4.7cm,yshift=0.5cm]{};
+\node(text_1)[below of = process_1,yshift=-2cm,scale=1.3]{采样};
+\draw [very thick,rounded corners=10pt]([xshift=-2.2cm,yshift=-1cm]process_1.center)--([xshift=-1.8cm,yshift=1cm]process_1.center)--([xshift=-1.4cm,yshift=0cm]process_1.center)--([xshift=-1.1cm,yshift=0.8cm]process_1.center)--([xshift=-0.8cm,yshift=-0.4cm]process_1.center)--([xshift=-0.5cm,yshift=0.4cm]process_1.center);
+\draw [->,very thick]([xshift=-0.3cm]process_1.center)to([xshift=0.3cm]process_1.center);
+\draw [very thick,rounded corners=10pt,densely dotted]([xshift=0.5cm,yshift=-1cm]process_1.center)--([xshift=0.9cm,yshift=1cm]process_1.center)--([xshift=1.3cm,yshift=0cm]process_1.center)--([xshift=1.6cm,yshift=0.8cm]process_1.center)--([xshift=1.9cm,yshift=-0.4cm]process_1.center)--([xshift=2.2cm,yshift=0.4cm]process_1.center);
+\node(process_2)[process,right of = process_1,xshift=6.6cm]{};
+\node(text_2)[below of = process_2,yshift=-2cm,scale=1.3]{量化};
+\draw [very thick,rounded corners=10pt,densely dotted]([xshift=-2.2cm,yshift=-1cm]process_2.center)--([xshift=-1.8cm,yshift=1cm]process_2.center)--([xshift=-1.4cm,yshift=0cm]process_2.center)--([xshift=-1.1cm,yshift=0.8cm]process_2.center)--([xshift=-0.8cm,yshift=-0.4cm]process_2.center)--([xshift=-0.5cm,yshift=0.4cm]process_2.center);
+\draw [->,very thick]([xshift=-0.3cm]process_2.center)to([xshift=0.3cm]process_2.center);
+\draw [very thick,]([xshift=0.5cm,yshift=-0.8cm]process_2.center)--([xshift=0.5cm,yshift=0.3cm]process_2.center)--([xshift=0.7cm,yshift=0.3cm]process_2.center)--([xshift=0.7cm,yshift=0.8cm]process_2.center)--([xshift=1cm,yshift=0.8cm]process_2.center)--([xshift=1cm,yshift=0.2cm]process_2.center)--([xshift=1.3cm,yshift=0.2cm]process_2.center)--([xshift=1.3cm,yshift=0.6cm]process_2.center)--([xshift=1.6cm,yshift=0.6cm]process_2.center)--([xshift=1.6cm,yshift=-0.3cm]process_2.center)--([xshift=1.8cm,yshift=-0.3cm]process_2.center)--([xshift=1.8cm,yshift=0.3cm]process_2.center)--([xshift=2cm,yshift=0.3cm]process_2.center);
+\node(text1)[left of = process_1,xshift=-3.2cm,yshift=-0.5cm,align=center]{模拟\\语音信号};
+\node(text2)[right of = process_1,xshift=3.3cm,yshift=-0.5cm,align=center]{离散\\时间信号};
+\node(text3)[right of = process_2,xshift=3.2cm,yshift=-0.5cm,align=center]{数字离散\\时间信号};
+\draw[->,very thick](process_1.east)to(process_2.west);
+\draw[->,very thick]([xshift=-1.8cm]process_1.west)to(process_1.west);
+\draw[->,very thick](process_2.east)to([xshift=1.8cm]process_2.east);
+%%%%音频
+\node(signal)[right of = process_2,xshift=5.5cm]{};
+\draw[-,thick,]([xshift=-1.2cm]signal.center)--([xshift=1.2cm]signal.center);
+\draw[-,thick]([xshift=-1cm,yshift=-0.8cm]signal.center)--([xshift=-0.9cm,yshift=0.4cm]signal.center)--([xshift=-0.8cm,yshift=-0.3cm]signal.center)--([xshift=-0.7cm,yshift=0.7cm]signal.center)--([xshift=-0.6cm,yshift=-0.1cm]signal.center)--([xshift=-0.5cm,yshift=0.3cm]signal.center)--([xshift=-0.4cm,yshift=-0.5cm]signal.center)--([xshift=-0.3cm,yshift=0.7cm]signal.center)--([xshift=-0.2cm,yshift=-0.2cm]signal.center)--([xshift=-0.1cm,yshift=0.4cm]signal.center)--([xshift=0cm,yshift=-0.9cm]signal.center)--([xshift=0.1cm,yshift=0.5cm]signal.center)--([xshift=0.2cm,yshift=-0.4cm]signal.center)--([xshift=0.3cm,yshift=0.3cm]signal.center)--([xshift=0.4cm,yshift=-0.2cm]signal.center)--([xshift=0.5cm,yshift=0.1cm]signal.center)--([xshift=0.6cm,yshift=-0.8cm]signal.center)--([xshift=0.7cm,yshift=0.4cm]signal.center)--([xshift=0.8cm,yshift=-0.6cm]signal.center)--([xshift=0.9cm,yshift=0.7cm]signal.center)--([xshift=1cm,yshift=-0.2cm]signal.center);
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter17/Figures/figure-cache.tex
+++ b/Chapter17/Figures/figure-cache.tex
+\begin{tikzpicture}
+	\tikzstyle{prob}=[minimum width=0.4em, fill=blue!15,inner sep=0pt]
+\node[draw,fill=red!20,inner sep=0pt,minimum width=3em,minimum height=5em](key) at (0,0){};
+\draw[] ([yshift=0.5em]key.180) -- ([yshift=0.5em]key.0);
+\draw[] ([yshift=1.5em]key.180) -- ([yshift=1.5em]key.0);
+\draw[] ([yshift=-0.5em]key.180) -- ([yshift=-0.5em]key.0);
+\draw[] ([yshift=-1.5em]key.180) -- ([yshift=-1.5em]key.0);
+\node[draw,fill=blue!20,inner sep=0pt,minimum width=3em,minimum height=5em](value) at (3em,0){};
+\draw[] ([yshift=0.5em]value.180) -- ([yshift=0.5em]value.0);
+\draw[] ([yshift=1.5em]value.180) -- ([yshift=1.5em]value.0);
+\draw[] ([yshift=-0.5em]value.180) -- ([yshift=-0.5em]value.0);
+\draw[] ([yshift=-1.5em]value.180) -- ([yshift=-1.5em]value.0);
+\node[anchor=south,font=\footnotesize,inner sep=0pt] at ([yshift=0.1em]key.north){key};
+\node[anchor=south,font=\footnotesize,inner sep=0pt] at ([yshift=0.2em]value.north){value};
+\node[anchor=south,font=\footnotesize,inner sep=0pt] (cache)at ([yshift=2em,xshift=1.5em]key.north){\small\bfnew{Cache}};
+\node[draw,anchor=east,minimum size=2.4em] (dt) at ([yshift=1.4em,xshift=-4em]key.west){$\mathbi{d}_\mathbi{t}$};
+\node[draw,anchor=east,minimum size=2.4em] (st) at ([xshift=-4em]dt.west){$\mathbi{s}_\mathbi{t}$};
+\node[draw,anchor=east,minimum size=2.4em] (st2) at ([xshift=-0.8em,yshift=4em]dt.west){$ \widetilde{\mathbi{s}}_\mathbi{t}$};
+\node[draw,anchor=north,circle,inner sep=0pt, minimum size=1.2em,fill=yellow] (add) at ([yshift=-1em]st2.south){+};
+\node[anchor=north,inner sep=0pt,font=\footnotesize,text=red] at ([yshift=-1em]add.south){combining};
+\node[draw,anchor=east,minimum size=2.4em] (ct) at ([xshift=-3em,yshift=-3.5em]st.west){$ \widetilde{\mathbi{C}}_\mathbi{t}$};
+\node[anchor=east] (y) at ([xshift=-6em,yshift=1em]st.west){$\mathbi{y}_{\mathbi{t}-\mathbi{1}}$};
+\node[draw,anchor=east,minimum width=8em,minimum height=1.6em,fill=blue!20] (output) at ([xshift=-2.6em,yshift=2.6em]st2.west){};
+\node[anchor=south] (yt) at ([yshift=5em]output.north){$\mathbi{y}_{\mathbi{t}}$};
+\draw[] ([xshift=-0.8em]output.90) -- ([xshift=-0.8em]output.-90);
+\draw[] ([xshift=-2.4em]output.90) -- ([xshift=-2.4em]output.-90);
+\draw[] ([xshift=0.8em]output.90) -- ([xshift=0.8em]output.-90);
+\draw[] ([xshift=2.4em]output.90) -- ([xshift=2.4em]output.-90);
+\foreach \x/\y in {1/2,2/1,3/5,4/1,5/1,6/1,7/3,8/4,9/2,10/3,11/5,12/5,13/2,14/5,15/5,16/5,17/13,18/2,19/4,20/1}
+	\node[draw=blue!20,anchor=south,prob,minimum height=0.2em*\y] at ([yshift=1em,xshift=-4.2em+0.4em*\x]output.north){};
+\begin{pgfonlayer}{background}
+\node[inner sep=3pt,draw,dotted,rounded corners=2pt,very thick][fit=(key)(value)(cache)](box){};
+\end{pgfonlayer}
+\draw[-latex,dashed,very thick,out=-145,in=10] ([yshift=1.6em]box.180) to node[above,font=\footnotesize,text=red,rotate=25]{reading}(dt.0);
+\draw[-latex,dashed,very thick,out=-5,in=-170] (ct.0) to node[above,font=\footnotesize,text=red,pos=0.7,rotate=8]{matching}([yshift=-2.5em]box.180);
+\draw[-,very thick,out=0,in=-135](st.0) to (add.-135);
+\draw[-,very thick,out=180,in=-45](dt.180) to (add.-45);
+\draw[-latex,very thick] (add.90) -- (st2.-90);
+\draw[-latex,very thick,out=100,in=-100] (ct.90) to (output.-90);
+\draw[-latex,very thick,out=180,in=-100] (st2.180) to (output.-90);
+\draw[-latex,very thick,out=80,in=-100] (y.90) to (output.-90);
+\draw[-latex,very thick] (output.90) -- ([yshift=1em]output.90);
+\draw[-latex,very thick] ([yshift=-1em]yt.-90) -- (yt.-90);
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter17/Figures/figure-cascading-speech-translation.tex
+++ b/Chapter17/Figures/figure-cascading-speech-translation.tex
+\tikzstyle{process} = [rectangle,very thick,rounded corners,minimum width=3.2cm,minimum height=3cm,text centered,draw=black!70,fill=red!20]
+\tikzstyle{cir} = [circle,thick,rounded corners,minimum width=0.7cm,text centered,draw=black,fill=green!25]
+\begin{tikzpicture}[node distance = 0,scale = 0.5]
+\tikzstyle{every node}=[scale=0.5]
+\node(process_1)[process]{};
+\draw[-,thick]([xshift=-1.2cm]process_1.center)--([xshift=1.2cm]process_1.center);
+\draw[-,thick]([xshift=-1cm,yshift=-0.8cm]process_1.center)--([xshift=-0.9cm,yshift=0.4cm]process_1.center)--([xshift=-0.8cm,yshift=-0.3cm]process_1.center)--([xshift=-0.7cm,yshift=0.7cm]process_1.center)--([xshift=-0.6cm,yshift=-0.1cm]process_1.center)--([xshift=-0.5cm,yshift=0.3cm]process_1.center)--([xshift=-0.4cm,yshift=-0.5cm]process_1.center)--([xshift=-0.3cm,yshift=0.7cm]process_1.center)--([xshift=-0.2cm,yshift=-0.2cm]process_1.center)--([xshift=-0.1cm,yshift=0.4cm]process_1.center)--([xshift=0cm,yshift=-0.9cm]process_1.center)--([xshift=0.1cm,yshift=0.5cm]process_1.center)--([xshift=0.2cm,yshift=-0.4cm]process_1.center)--([xshift=0.3cm,yshift=0.3cm]process_1.center)--([xshift=0.4cm,yshift=-0.2cm]process_1.center)--([xshift=0.5cm,yshift=0.1cm]process_1.center)--([xshift=0.6cm,yshift=-0.8cm]process_1.center)--([xshift=0.7cm,yshift=0.4cm]process_1.center)--([xshift=0.8cm,yshift=-0.6cm]process_1.center)--([xshift=0.9cm,yshift=0.7cm]process_1.center)--([xshift=1cm,yshift=-0.2cm]process_1.center);
+\node(text_1)[below of = process_1,yshift=-2cm,scale=1.5]{语音信号};
+\node(process_2)[process,right of = process_1,xshift=7.0cm,text width=4cm,align=center]{\baselineskip=4pt\LARGE{[[0.2,...,0.3], \qquad ..., \qquad  0.3,...,0.5]]}\par};
+\node(text_2)[below of = process_2,yshift=-2cm,scale=1.5]{语音特征};
+\node(process_3)[process,,minimum width=6cm,minimum height=5cm,right of = process_2,xshift=8.2cm,text width=4cm,align=center]{};
+\node(text_3)[below of = process_3,yshift=-3cm,scale=1.5]{源语文本及其词格};
+\node(cir_s)[cir,very thick, below of = process_3,xshift=-2.2cm,yshift=1.1cm]{\LARGE S};
+\node(cir_a)[cir,right of = cir_s,xshift=1cm,yshift=0.8cm]{\LARGE a};
+\node(cir_c)[cir,right of = cir_a,xshift=1.2cm,yshift=0cm]{\LARGE c};
+\node(cir_f)[cir,right of = cir_c,xshift=1.2cm,yshift=0cm]{\LARGE f};
+\node(cir_E)[cir,very thick,right of = cir_f,xshift=1cm,yshift=-0.8cm]{\LARGE E};
+\node(cir_b)[cir,right of = cir_s,xshift=1cm,yshift=-0.8cm]{\Large b};
+\node(cir_d)[cir,right of = cir_b,xshift=1cm,yshift=0.6cm]{\Large d};
+\node(cir_e)[cir, right of = cir_b,xshift=1cm,yshift=-0.8cm]{\LARGE e};
+\node(cir_g)[cir,right of = cir_e,xshift=1cm,yshift=0.8cm]{\LARGE g};
+\draw[-latex](cir_s)node[above,xshift=0.3cm,yshift=0.4cm]{0.4}to(cir_a);
+\draw[-latex](cir_a)node[above,xshift=0.6cm,yshift=0cm]{1}to(cir_c);
+\draw[-latex](cir_c)node[above,xshift=0.6cm,yshift=0cm]{1}to(cir_f);
+\draw[-latex](cir_f)node[above,xshift=0.6cm,yshift=-0.3cm]{1}to(cir_E);
+\draw[-latex](cir_s)node[above,xshift=0.7cm,yshift=-0.4cm]{0.6}to(cir_b);
+\draw[-latex](cir_b)node[above,xshift=0.3cm,yshift=0.3cm]{0.8}to(cir_d);
+\draw[-latex](cir_b)node[above,xshift=0.7cm,yshift=-0.4cm]{0.2}to(cir_e);
+\draw[-latex](cir_e)node[above,xshift=0.3cm,yshift=0.3cm]{1}to(cir_g);
+\draw[-latex](cir_d)node[above,xshift=0.7cm,yshift=0cm]{1}to(cir_f);
+\draw[-latex](cir_g)node[above,xshift=0.6cm,yshift=0.3cm]{1}--(cir_E);
+\node(text)[below of = process_3,yshift=-1.8cm,scale=1.8]{你是谁};
+\node(process_4)[process,right of = process_3,xshift=8.2cm,text width=4cm,align=center]{\Large\textbf{Who are you?}};
+\node(text_4)[below of = process_4,yshift=-2cm,scale=1.5]{翻译译文};
+\draw[->,very thick](process_1.east)to(process_2.west);
+\draw[->,very thick](process_2.east)to(process_3.west);
+\draw[->,very thick](process_3.east)to(process_4.west);
+\node(arrow_text1)[right of = process_1,xshift=3.2cm,yshift=0.7cm,scale=1.4,align=center]{音频\\特征提取};
+\node(arrow_text2)[right of = process_2,xshift=3.6cm,yshift=0.7cm,scale=1.4,align=center]{语音\\识别系统};
+\node(arrow_text3)[right of = process_3,xshift=4.5cm,yshift=0.4cm,scale=1.4]{翻译系统};
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter17/Figures/figure-framing-schematic.tex
+++ b/Chapter17/Figures/figure-framing-schematic.tex
+\tikzstyle{process} = [rectangle,very thick,rounded corners,minimum width=5cm,minimum height=2.5cm,text centered,draw=black!70,fill=red!25]
+\tikzstyle{cir} = [circle,thick,rounded corners,minimum width=0.7cm,text centered,draw=black,fill=green!25]
+\begin{tikzpicture}[node distance = 0,scale = 1]
+\tikzstyle{every node}=[scale=1]
+\node [anchor=center](ori) at (-0.2,-0.2) {$O$};
+\draw[->,thick](-0.5,0)--(5,0)node[below]{$t$};
+\draw[->,thick](0,-2)--(0,2)node[left,scale=0.8]{量化值};
+\draw[-,thick](0,0)sin(0.7,1.5)cos(1.4,0)sin(2.1,-1.5)cos(2.8,0)sin(3.5,1.5)cos(4.2,0);
+\draw[-,thick,dashed](0.5,-1.8)--(0.5,1.8);
+\draw[-](1.2,-1.8)--(1.2,1.8);
+\draw[-,thick,dashed](1.9,-1.8)--(1.9,1.8);
+\draw[<->,thick](0,-1.1)--(1.2,-1.1)node[left,xshift=-0.05cm,yshift=0.15cm,scale=0.6]{帧长};
+\draw[<->,thick](0,-1.4)--(0.5,-1.4)node[left,xshift=0.05cm,yshift=-0.25cm,scale=0.6]{帧移};
+\draw[<->,thick](0.5,-1.4)--(1.9,-1.4);
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter17/Figures/figure-joint-encoder-mode.tex
+++ b/Chapter17/Figures/figure-joint-encoder-mode.tex
+\tikzstyle{coder} = [rectangle,thick,rounded corners,minimum width=2.3cm,minimum height=1cm,text centered,draw=black!70,fill=red!25]
+\begin{tikzpicture}[node distance = 0,scale = 0.75]
+\tikzstyle{every node}=[scale=0.75]
+\node(encoder)[coder]{\large{编码器}};
+\node(decoder_1)[coder,above of =encoder,xshift=-1.6cm,yshift=2.4cm,fill=blue!20]{\large{解码器}};
+\node(decoder_2)[coder,above of =encoder, xshift=1.6cm,yshift=2.4cm,fill=yellow!25]{\large{解码器}};
+\node(s)[below of = encoder,yshift=-1.8cm,scale=1.6]{$s$};
+\node(y)[above of = decoder_2,yshift=1.8cm,scale=1.6]{$y$};
+\draw[->,thick](s.north)to(encoder.south);
+\draw[->,thick](decoder_1.east)to(decoder_2.west);
+\draw[->,thick](decoder_2.north)to(y.south);
+\draw[->,thick](encoder.north)--([yshift=0.6725cm]encoder.north)--([yshift=-0.7cm]decoder_1.south)--(decoder_1.south);
+\draw[->,thick](encoder.north)--([yshift=0.6725cm]encoder.north)--([yshift=-0.7cm]decoder_2.south)--(decoder_2.south);
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter17/Figures/figure-layer.tex
+++ b/Chapter17/Figures/figure-layer.tex
+\begin{tikzpicture}
+\foreach \x in {1,2,3,4}
+	\node[inner sep=0pt,minimum size=1em,fill=ublue,circle] (c1_\x) at (0em+1.6em*\x, 0em){};
+\foreach \x in {1,2,3,4,5,6}
+	\node[inner sep=0pt,minimum size=1em,fill=ublue,circle] (c2_\x) at (8.4em+1.6em*\x, 0em){};
+\foreach \x in {1,2,3,4,5}
+	\node[inner sep=0pt,minimum size=1em,fill=ublue,circle] (c3_\x) at (20em+1.6em*\x, 0em){};
+\foreach \x in {1,2,3,4,5}
+	\node[inner sep=0pt,minimum size=1em,fill=orange,circle] (c4_\x) at (20em+1.6em*\x, 9.4em){};
+\node[inner sep=0pt,minimum size=1em,fill=ugreen,circle] (c5) at (9em, 7em){};
+\node[inner sep=0pt,minimum size=1.2em,fill=ugreen,circle] (qs) at (18.6em, 5em){};
+\node[inner sep=0pt,minimum size=1.2em,fill=ugreen,circle] (qw) at (18.6em, 3em){};
+\node[fill=orange,inner sep=0pt, minimum size=1.2em, circle, text=white] (sigma) at (24.8em, 7em){\small\bfnew{$\sigma$}};
+\node[fill=ugreen,inner sep=0pt, minimum size=1.2em, circle, text=white] (add1) at (4em, 3em){\small\bfnew{+}};
+\node[fill=ugreen,inner sep=0pt, minimum size=1.2em, circle, text=white] (add2) at (14em, 3em){\small\bfnew{+}};
+\node[fill=ugreen,inner sep=0pt, minimum size=1.2em, circle, text=white] (add3) at (9em, 5em){\small\bfnew{+}};
+\begin{pgfonlayer}{background}
+\node[draw,rounded corners=2pt,drop shadow,fill=white][fit=(c1_1)(c1_4)](box1){};
+\node[draw,rounded corners=2pt,drop shadow,fill=white][fit=(c2_1)(c2_6)](box2){};
+\node[draw,rounded corners=2pt,drop shadow,fill=white][fit=(c3_1)(c3_5)](box3){};
+\node[draw,rounded corners=2pt,drop shadow,fill=white][fit=(c4_1)(c4_5)](box4){};
+\node[draw,rounded corners=2pt,inner xsep=6pt,drop shadow,fill=white][fit=(c5)](box5){};
+\end{pgfonlayer}
+\node[draw,dash pattern=on 3pt off 1pt,minimum width=1.6em, minimum height=2em,very thick] (n1) at (24.8em,0em){};
+\node[draw,dash pattern=on 3pt off 1pt,minimum width=1.6em, minimum height=2em,very thick] (n2) at (24.8em,9.4em){};
+\node[] at (24.8em, -1.5em){$\mathbi{x}_\mathbi{t}$};
+\node[text=ublue] at (8.2em, 0em) {\small\bfnew{...}};
+\draw[-latex, out=70, in=-120] (c1_1.90) node[xshift=-0.4em,yshift=1.2em]{$ \mathbi{h}_ \mathbi{i}^ \mathbi{j}$}to (add1.-90);
+\draw[-latex, out=80, in=-100] (c1_2.90) to (add1.-90);
+\draw[-latex, out=100, in=-80] (c1_3.90) to (add1.-90);
+\draw[-latex, out=110, in=-60] (c1_4.90) to (add1.-90);
+\draw[-latex, out=60, in=-140] (c2_1.90) to (add2.-90);
+\draw[-latex, out=70, in=-120] (c2_2.90) to (add2.-90);
+\draw[-latex, out=80, in=-100] (c2_3.90) to (add2.-90);
+\draw[-latex, out=100, in=-80] (c2_4.90) to (add2.-90);
+\draw[-latex, out=110, in=-60] (c2_5.90) to (add2.-90);
+\draw[-latex, out=120, in=-40] (c2_6.90) to (add2.-90);
+\draw[-latex, out=20, in=-150] (add1.90) node[xshift=-0.4em,yshift=1.2em]{$ \mathbi{s}^ \mathbi{j}$} to (add3.-90);
+\draw[-latex, out=160, in=-30] (add2.90) to (add3.-90);
+\draw[-latex] (add3.90) -- (box5.-90);
+\draw[-latex] (box5.0) -- node[xshift=-3em,above]{$ \mathbi{d}_\mathbi{t}$}(sigma.180);
+\draw[-latex, ugreen!60] (qs.180) node[xshift=-1em,above,text=black]{$ \mathbi{q}_\mathbi{s}$}-- (add3.0);
+\draw[-, ugreen!60] (qw.180) node[xshift=-1em,above,text=black]{$ \mathbi{q}_\mathbi{w}$}-- (add2.0);
+\draw[-latex, ugreen!60] (add2.180) -- (add1.0);
+\draw[-latex] (n1.130) -- (qw.0);
+\draw[-latex] (n1.120) -- (qs.0);
+\draw[-latex] (n1.90) node[yshift=1em,right]{$ \mathbi{h}_\mathbi{t}$}-- (sigma.-90);
+\draw[-latex] (sigma.90) -- (n2.-90);
+\draw[-latex] (n2.90) -- node[right]{$ \widetilde{\mathbi{h}}_\mathbi{t}$}([yshift=2em]n2.90);
+\draw[decorate,decoration={brace, mirror},gray, thick] ([yshift=-2em]box1.-180) -- node[font=\scriptsize,text=black,below]{前几句}([yshift=-2em]box2.0);
+\draw[decorate,decoration={brace, mirror},gray, thick] ([yshift=-2em]box3.-180) -- node[font=\scriptsize,text=black,below]{当前句}([yshift=-2em]box3.0);
+%annotation
+\node[fill=ublue,rounded corners=1pt,inner sep=0pt,minimum size=1em] (a1) at (2em,-4.5em) {};
+\node[anchor=west,font=\footnotesize] (w1) at ([xshift=0.4em]a1.east) {编码表示};
+\node[anchor=west,fill=ugreen,rounded corners=1pt,inner sep=0pt,minimum size=1em] (a2) at ([xshift=2em]w1.east) {};
+\node[anchor=west,font=\footnotesize] (w2)at ([xshift=0.4em]a2.east) {层次注意力};
+\node[anchor=west,fill=orange,rounded corners=1pt,inner sep=0pt,minimum size=1em] (a3) at ([xshift=2em]w2.east) {};
+\node[anchor=west,font=\footnotesize] at ([xshift=0.4em]a3.east) {融合上下文信息的编码表示};
+\end{tikzpicture}
--- a/Chapter17/Figures/figure-multiencoder.tex
+++ b/Chapter17/Figures/figure-multiencoder.tex
+\definecolor{color1}{rgb}{1,0.725,0.058}
+\tikzstyle{coder} = [rectangle,thick,rounded corners,minimum width=2.8cm,minimum height=1.1cm,text centered,draw=black!70,fill=blue!10,drop shadow]
+\tikzstyle{attention} = [rectangle,thick,rounded corners,minimum width=2.6cm,minimum height=0.9cm,text centered,draw=black!70,fill=green!25,drop shadow]
+\begin{tikzpicture}[node distance = 0,scale = 0.7]
+\tikzstyle{every node}=[scale=0.7]
+\node(encoder_c)[coder]{\large{编码器}};
+\node(encoder_s)[coder, right of = encoder_c, xshift=3.5cm, fill=red!25]{\large{编码器}};
+\node(h_pre)[above of = encoder_c, yshift=1.3cm,scale=1.3]{${\mathbi{h}}_{pre}$};
+\node(h)[above of = encoder_s, yshift=1.3cm,scale=1.3]{$\mathbi{h}$};
+\node(cir)[circle,very thick, right of = h, draw=black!90,minimum width=0.5cm,xshift=1.1cm]{};
+\draw[-,very thick,draw=black!90]([xshift=0.04cm]cir.west)--([xshift=-0.04cm]cir.east);
+\draw[-,very thick,draw=black!90]([yshift=-0.04cm]cir.north)--([yshift=0.04cm]cir.south);
+\node(last)[below of = encoder_c, yshift=-1.3cm]{\large{前一句}};
+\node(current)[below of = encoder_s, yshift=-1.3cm]{\large{当前句}};
+\node(attention_left)[attention, above of = encoder_c, xshift=2.4cm,yshift=3.1cm]{\large{注意力机制}};
+\node(d)[above of = attention_left, yshift=1.1cm,scale=1.3]{$\mathbi{d}$};
+\node(ground)[rectangle, thick, rounded corners, minimum width=5cm, minimum height=5.5cm, right of = encoder_s, xshift=4.4cm,yshift=2.2cm, draw=black!70, fill=gray!10]{};
+\node(decoder)[above of = encoder_s, xshift=3.1cm]{\large{解码器}};
+\node(attention_right)[attention, right of = attention_left, xshift=5.4cm,yshift=-0.4cm]{\large{注意力机制}};
+\node(target)[right of = current, xshift=5.3cm]{\large{目标句}};
+\node(point_below)[right of = encoder_s, xshift=5.3cm]{\Huge{...}};
+\node(point_above)[above of = attention_right, yshift=1.8cm]{\Huge{...}};
+\draw[->, very thick](last)to([yshift=-0.05cm]encoder_c.south);
+\draw[->, very thick](current)to([yshift=-0.05cm]encoder_s.south);
+\draw[->, very thick](target.north)to([yshift=-0.05cm]point_below.south);
+\draw[->, very thick]([yshift=0.05cm]encoder_c.north)to([yshift=0.03cm]h_pre.south);
+\draw[->, very thick]([yshift=0.05cm]encoder_s.north)to(h.south);
+\draw[->, very thick]([yshift=0cm]h.north)to([yshift=0.95cm]h.north);
+\draw[->, very thick,in=270,out=90]([yshift=-0.15cm]h_pre.north)to([xshift=1.25cm,yshift=0.9cm]h_pre.north);
+\draw[->, very thick,in=270,out=80]([yshift=-0.15cm]h_pre.north)to([xshift=2.4cm,yshift=0.9cm]h_pre.north);
+\draw[->, very thick]([yshift=0.03cm]attention_left.north)to([yshift=0.1cm]d.south);
+\draw[->, very thick]([xshift=-0.03cm]h.east)to([xshift=-0.03cm]cir.west);
+\draw[->, very thick](point_below.north)to([yshift=2.03cm]point_below.north);
+\draw[->, very thick](attention_right.north)to([yshift=-0.03cm]point_above.south);
+\draw[->, very thick](point_above.north)to([yshift=0.83cm]point_above.north);
+\draw[->, very thick, in=270,out=0]([xshift=0.2cm]cir.east)to([xshift=3cm,yshift=0.88cm]cir.east);
+\draw[->, very thick, in=270,out=0]([xshift=0.2cm]cir.east)to([xshift=2cm,yshift=0.88cm]cir.east);
+\draw[->,very thick,]([xshift=0.1cm]d.east)to([xshift=1.92cm]d.east)to([yshift=0.03cm]cir.north);
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter17/Figures/figure-three-ways-of-dual-decoder-speech-translation.tex
+++ b/Chapter17/Figures/figure-three-ways-of-dual-decoder-speech-translation.tex
+\tikzstyle{coder} = [rectangle,thick,rounded corners,minimum width=2.3cm,minimum height=1cm,text centered,draw=black!70,fill=red!20]
+\begin{tikzpicture}[node distance = 0,scale = 0.75]
+\tikzstyle{every node}=[scale=0.75]
+\node(encoder)[coder]at (0,0){\large{编码器}};
+\node(decoder_1)[coder,above of =encoder,xshift=-1.6cm,yshift=2.8cm,fill=blue!20]{\large{解码器}};
+\node(decoder_2)[coder,above of =encoder, xshift=1.6cm,yshift=2.8cm,fill=yellow!20]{\large{解码器}};
+\node(s)[below of = encoder,yshift=-1.8cm,scale=1.6]{$s$};
+\node(x)[above of = decoder_1,yshift=1.8cm,scale=1.6]{$x$};
+\node(y)[above of = decoder_2,yshift=1.8cm,scale=1.6]{$y$};
+\draw[->,thick](s.north)to(encoder.south);
+\draw[->,thick](decoder_1.north)to(x.south);
+\draw[->,thick](decoder_2.north)to(y.south);
+\draw[->,thick](encoder.north)--([yshift=0.7cm]encoder.north)--([xshift=-4.16em,yshift=0.7cm]encoder.north)--(decoder_1.south);
+\draw[->,thick](encoder.north)--([yshift=0.7cm]encoder.north)--([xshift=4.16em,yshift=0.7cm]encoder.north)--(decoder_2.south);
+\node [anchor=north](pos1) at (s.south) {(a) 单编码器-双解码器方式};
+%%%%%%%%%%%%%%%%%%%%%%%%级联
+\node(encoder-2)[coder]at ([xshift=10.0em]encoder.east){\large{编码器}};
+\node(decoder_1-2)[coder,above of =encoder-2,yshift=1.4cm,fill=blue!20]{\large{解码器}};
+\node(decoder_2-2)[coder,above of =decoder_1-2, yshift=1.4cm,fill=yellow!20]{\large{解码器}};
+\node(s-2)[below of = encoder-2,yshift=-1.8cm,scale=1.6]{$s$};
+\node(y-2)[above of = decoder_2-2,yshift=1.8cm,scale=1.6]{$y$};
+\draw[->,thick](s-2.north)to(encoder-2.south);
+\draw[->,thick](encoder-2.north)to(decoder_1-2.south);
+\draw[->,thick](decoder_1-2.north)to(decoder_2-2.south);
+\draw[->,thick](decoder_2-2.north)to(y-2.south);
+\node [anchor=north](pos2) at (s-2.south) {(b) 级联编码器方式};
+%%%%%%%%%%%%%%%%%%%%%%%%联合
+\node(encoder-3)[coder]at([xshift=10.0em]encoder-2.east){\large{编码器}};
+\node(decoder_1-3)[coder,above of =encoder-3,xshift=-1.6cm,yshift=2.8cm,fill=blue!20]{\large{解码器}};
+\node(decoder_2-3)[coder,above of =encoder-3, xshift=1.6cm,yshift=2.8cm,fill=yellow!20]{\large{解码器}};
+\node(s-3)[below of = encoder-3,yshift=-1.8cm,scale=1.6]{$s$};
+\node(y-3)[above of = decoder_2-3,yshift=1.8cm,scale=1.6]{$y$};
+\draw[->,thick](s-3.north)to(encoder-3.south);
+\draw[->,thick](decoder_1-3.east)to(decoder_2-3.west);
+\draw[->,thick](decoder_2-3.north)to(y-3.south);
+\draw[->,thick](encoder-3.north)--([yshift=0.7cm]encoder-3.north)--([xshift=-4.16em,yshift=0.7cm]encoder-3.north)--(decoder_1-3.south);
+\draw[->,thick](encoder-3.north)--([yshift=0.7cm]encoder-3.north)--([xshift=4.16em,yshift=0.7cm]encoder-3.north)--(decoder_2-3.south);
+\node [anchor=north](pos3) at (s-3.south) {(c) 联合编码器方式};
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter17/Figures/figure-twodecoding.tex
+++ b/Chapter17/Figures/figure-twodecoding.tex
+\tikzstyle{encoder} = [rectangle,thick,rounded corners,minimum width=2.3cm,minimum height=1.4cm,text centered,draw=black!70,fill=red!25]
+\tikzstyle{decoder} = [rectangle,thick,rounded corners,minimum width=2.3cm,minimum height=1.4cm,text centered,draw=black!70,fill=blue!15]
+\tikzstyle{attention} = [rectangle,thick,rounded corners,minimum width=2.6cm,minimum height=0.9cm,text centered,draw=black!70,fill=green!25]
+\begin{tikzpicture}[node distance = 0,scale = 1]
+\tikzstyle{every node}=[scale=1]
+\node(encoder_left)[encoder]{\large{编码器}};
+\node(encoder_right)[encoder, right of = encoder_left, xshift=3cm]{\large{编码器}};
+\node(decoder_left)[decoder, above of = encoder_left, yshift=2.7cm]{\large{解码器}};
+\node(decoder_right)[decoder, above of = encoder_right, yshift=2.7cm]{\large{解码器}};
+\node(text_left)[below of = encoder_left, yshift=-2.2cm]{\large{前文}};
+\node(text_right)[below of = encoder_right, yshift=-2.2cm]{\large{源语}};
+\node(text_top)[above of = decoder_right, yshift=2cm]{\large{句子级翻译结果}};
+\node(title_1)[above of = text_top, xshift=-1.5cm, yshift=1.3cm]{\large\bfnew{一阶段解码}};
+\node(ground2)[rectangle,very thick,rounded corners,minimum width=5cm,minimum height=5.3cm,right of = encoder_right,xshift=6cm,yshift=1.4cm,draw=black,dashed]{};
+\node(ground1)[rectangle,thick,rounded corners,minimum width=3.3cm,minimum height=4.5cm,right of = encoder_right,xshift=5.5cm,yshift=1.4cm,draw=black,fill=yellow!15]{};
+\node(attention_below)[attention, right of = encoder_right, xshift=5.5cm]{\large{注意力机制}};
+\node(attention_above)[attention, above of = attention_below, yshift=1.4cm]{\large{注意力机制}};
+\node(ffn)[attention, above of = attention_above, yshift=1.4cm, fill=blue!8]{\large{前馈神经网络}};
+\node(n)[right of = attention_above, xshift=2.4cm,scale=1.5]{$\times N$};
+\node(text_2)[above of = ffn, yshift=1.9cm]{\large{上下文修正结果}};
+\node(title_2)[above of = text_2, xshift=0.5cm,yshift=1.3cm]{\large\bfnew{二阶段解码}};
+\node(text_rright)[right of = text_right, xshift=5.5cm]{\large{句子级翻译结果}};
+\draw[->,very thick]([yshift=0.2cm]text_left.north)to(encoder_left.south);
+\draw[->,very thick]([yshift=0.2cm]text_right.north)to(encoder_right.south);
+\draw[->,very thick](encoder_left.north)to(decoder_left.south);
+\draw[->,very thick](encoder_right.north)to(decoder_right.south);
+\draw[->,very thick](decoder_right.north)to([yshift=-0.1cm]text_top.south);
+\draw[->,very thick]([yshift=0.2cm]text_rright.north)to(attention_below.south);
+\draw[->,very thick](attention_below.north)to(attention_above.south);
+\draw[->,very thick](attention_above.north)to([yshift=-0.05cm]ffn.south);
+\draw[->,very thick](ffn.north)to([yshift=-0.05cm]text_2.south);
+\draw[-,very thick,dashed]([xshift=2cm,yshift=-0.2cm]text_right.east)to([xshift=2cm,yshift=9cm]text_right.east);
+\draw[-,very thick]([yshift=0.5cm]encoder_left.north)--([yshift=0.5cm,xshift=4.5cm]encoder_left.north)--([xshift=-2.68cm]attention_below.west)--(attention_below.west);
+\draw[-,very thick](decoder_left.north)--([yshift=0.5cm]decoder_left.north)--([yshift=0.5cm,xshift=4.7cm]decoder_left.north)--([xshift=-2.48cm]attention_above.west)--(attention_above.west);
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter17/chapter17.tex
+++ b/Chapter17/chapter17.tex
@@ -15,16 +15,660 @@
 \renewcommand\figurename{图}%将figure改为图
 \renewcommand\tablename{表}%将figure改为图
-\chapterimage{fig-NEU-8.jpg} % Chapter heading image
+\chapterimage{fig-NEU-5.jpg} % Chapter heading image
 %----------------------------------------------------------------------------------------
 %	CHAPTER 17
 %----------------------------------------------------------------------------------------
-\chapter{神经机器翻译实践}
+\chapter{多模态、多层次机器翻译}
+\parinterval 基于上下文的翻译是机器翻译的一个重要分支。传统方法中，机器翻译通常被定义为对一个句子进行翻译的问题。但是，现实中每句话往往不是独立出现的。比如，人们会使用语音进行表达，或者通过图片来传递信息，这些语音和图片内容都可以伴随着文字一起出现在翻译场景中。此外，句子往往存在于段落或者篇章之中，如果要理解这个句子，也需要整个段落或者篇章的信息。而这些上下文都是机器翻译可以利用的。
+\parinterval 本章在句子级翻译的基础上将问题扩展为更大上下文中的翻译，具体包括：图像翻译、语音翻译、篇章翻译三个主题。这些问题均为机器翻译应用中的真实需求。同时，使用多模态等信息也是当下自然语言处理的热点方向之一。
+%----------------------------------------------------------------------------------------
+%    NEW SECTION
+%----------------------------------------------------------------------------------------
+\section{机器翻译需要更多的上下文}
+\parinterval 长期以来，机器翻译的任务都是指句子级翻译。主要原因在于，句子级的翻译建模可以大大简化问题，使得机器翻译方法更容易进行实践和验证。但是人类使用语言的过程并不是孤立在一个个句子上进行的。这个问题可以类比于我们学习语言的过程：小孩成长过程中会接受视觉、听觉、触觉等多种信号，这些信号的共同作用使得他们产生对客观世界的“认识”，同时促使其使用“语言”进行表达。从这个角度说，语言能力并不是由单一因素形成的，它往往伴随着其他信息的相互作用，比如，当我们翻译一句话的时候，会用到看到的画面、听到的语调、甚至前面说过句子中的信息。
+\parinterval 从广义上讲，当前句子以外的信息都可以被看作是一种上下文。比如，图XXX中，需要把英语句子“XXX”翻译为汉语。但是，其中的“bank”有多个含义，因此仅仅使用英语句子本身的信息可能会将其翻译为“银行”，而非正确的译文“河床”。但是，图XXX中也提供了这个英语句子所对应的图片，显然图片中直接展示了河床，这时“bank”是没有歧义的。通常也会把这种使用图片和文字一起进行机器翻译的任务称作多模态机器翻译（参考文献）。
+\parinterval 图图
+\parinterval 所谓模态（Modality）是指某一种信息来源。例如，视觉、听觉、嗅觉、味觉都可以被看作是不同的模态。因此视频、语音、文字等都可以被看作是承载这些模态的媒介。在机器翻译中使用多模态这个概念，更多是为了区分某些不同于文字的信息。除了图像等视觉模态信息，机器翻译也可以利用语音模态信息。比如，直接对语音进行翻译，甚至直接用语音表达出翻译结果。
+\parinterval 此外，除了不同信息源所引入的上下文，机器翻译也可以利用文字本身的上下文。比如，翻译一篇文章中的某个句子时，可以根据整个篇章的内容进行翻译。显然这种篇章的语境是有助于机器翻译的。在本章后面的内容中，会就机器翻译中使用不同上下文（多模态和篇章信息）的方法展开讨论。
 %----------------------------------------------------------------------------------------
 %    NEW SECTION
 %----------------------------------------------------------------------------------------
+\section{语音翻译}
+\parinterval 语音，是人类日常生活与交流中最常用的一种信息载体。从日常聊天、国际旅游，到国际会议、跨国合作，对于语言进行翻译的需求不断增加。甚至在有些场景下，用语音进行交互要比用文本进行交互频繁的多。因此，{\small\bfnew{语音翻译}}\index{语音翻译}（Speech Translation）\index{Speech Translation}也成为了语音处理和机器翻译相结合的重要产物。根据目标语言的载体类型，可以将语音翻译分为{\small\bfnew{语音到文本翻译}}\index{语音到文本翻译}（Speech-to-Text Translation）\index{Speech-to-Text Translation}和{\small\bfnew{语音到语音翻译}}（Speech-to-Speech Translation）\index{Speech-to-Speech Translation}；基于翻译的实时性，还可以分为{\small\bfnew{实时语音翻译}}\index{实时语音翻译}（即同声传译，Simultaneous Translation）\index{Simultaneous Translation}和{\small\bfnew{离线语音翻译}}（Offline speech translation）\index{Offline speech translation}。本节主要关注离线语音到文本翻译方法（简称为语音翻译），分别从音频处理、级联语音翻译和端到端语音翻译进行介绍。
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsection{音频处理}
+\parinterval 不同于文本，音频本质上是经过若干信号处理之后的{\small\bfnew{波形}}（Waveform）\index{Waveform}。具体来说，声音是一种空气的震动，因此可以被转换为模拟信号。模拟信号是一段连续的信号，经过采样变为离散数字信号。采样是每隔固定的时间记录一下声音的振幅，采样率表示每秒的采样点数，单位是赫兹（Hz）。采样率越高，结果的损失则越小。通常来说，采样的标准是能够通过离散化的数字信号重现原始语音。我们日常生活中使用的手机和电脑设备的采样率一般为16kHz，表示每秒16000个采样点；而音频CD的采样率可以达到44.1kHz。经过进一步的量化，将采样点的值转换为整型数值保存，从而减少占用的存储空间，通常采用的是16位量化。将采样率和量化位数相乘，就可以得到{\small\bfnew{比特率}}\index{比特率}（Bits Per Second，BPS）\index{Bits Per Second}，表示音频每秒占用的位数。16kHz采样率和16位量化的音频，比特率为256kb/s。整体流程如图\ref{fig:17-1}所示\upcite{洪青阳2020语音识别原理与应用,陈果果2020语音识别实战}。
+%----------------------------------------------------------------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter17/Figures/figure-audio-processing}
+\caption{音频处理过程}
+\label{fig:17-1}
+\end{figure}
+%----------------------------------------------------------------------------------------------------
+\parinterval 经过上面的描述，音频的表示实际上是一个非常长的采样点序列，这导致了直接使用现有的深度学习技术处理音频序列较为困难。并且，原始的音频信号中可能包含着较多的噪声、环境声或冗余信息也会对模型产生干扰。因此，一般会对音频序列进行处理来提取声学特征，具体为将长序列的采样点序列转换为短序列的特征向量序列，再用于下游系统模块。虽然已有一些工作不依赖特征提取，直接在原始的采样点序列上进行声学建模和模型训练\upcite{DBLP:conf/interspeech/SainathWSWV15}，但目前的主流方法仍然是基于声学特征进行建模\upcite{DBLP:conf/icassp/MohamedHP12}。
+\parinterval 声学特征提取的第一步是预处理。其流程主要是对音频进行预加重、分帧和加窗。预加重用来提升音频信号中的高频部分，目的是使频谱更加平滑。分帧是基于短时平稳假设，即根据生物学特征，语音信号是一个缓慢变化的过程，10ms~30ms的信号片段是相对平稳的。基于这个假设，一般将每25ms作为一帧来提取特征，这个时间称为{\small\bfnew{帧长}}\index{帧长}（Frame Length）\index{Frame Length}。同时，为了保证不同帧之间的信号平滑性，使每两个相邻帧之间存在一定的重合部分。一般每隔10ms取一帧，这个时长称为{\small\bfnew{帧移}}\index{帧移}（Frame Shift）\index{Frame Shift}。为了缓解分帧带来的频谱泄漏，对每帧的信号进行加窗处理使其幅度在两段渐变到0，一般采用的是{\small\bfnew{汉明窗}}\index{汉明窗}（Hamming）\index{Hamming}。
+%----------------------------------------------------------------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter17/Figures/figure-framing-schematic}
+\caption{分帧原理图}
+\label{fig:17-2}
+\end{figure}
+%----------------------------------------------------------------------------------------------------
+\parinterval 经过了上述的预处理操作，可以得到音频对应的帧序列，之后通过不同的操作来提取不同类型的声学特征。常用的声学特征包括{\small\bfnew{Mel频率倒谱系数}}\index{Mel频率倒谱系数}（Mel-Frequency Cepstral Coefficient, MFCC）\index{Mel-Frequency Cepstral Coefficient}、{\small\bfnew{感知线性预测系数}}\index{感知线性预测系数}（Perceptual Lienar Predictive, PLP）\index{Perceptual Lienar Predictive}、{\small\bfnew{滤波器组}}\index{滤波器组}（Filter-bank, Fbank）\index{Filter-bank}等。MFCC、PLP和Fbank特征都需要对预处理后的音频做{\small\bfnew{短时傅里叶变换}}\index{短时傅里叶变换}（Short-time Fourier Tranform, STFT）\index{Short-time Fourier Tranform}，得到具有规律的线性分辨率。之后再经过特定的操作，得到各种声学特征。不同声学特征的特点是不同的，MFCC去相关性较好，PLP抗噪性强，FBank可以保留更多的语音原始特征。在语音翻译中，比较常用的声学特征为FBank或MFCC\upcite{洪青阳2020语音识别原理与应用}。
+\parinterval 某种程度上讲，提取到的声学特征可以理解计算机视觉中的像素特征，或者自然语言处理中的词嵌入表示。不同之处在于，声学特征更加复杂多变，可能存在着较多的噪声和冗余信息。此外，相比对应的文字序列，音频提取到的特征序列长度要大十倍以上。比如，人类正常交流中每秒钟一般可以说2-3个字，而每秒钟的语音可以提取得到100帧的特征序列。巨大的长度比差异也为语音翻译中对声学特征建模带来了困难。
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsection{级联式语音翻译}
+\parinterval 实现语音翻译最简单的思路是基于级联的方式，即：先通过{\small\bfnew{自动语音识别}}\index{自动语音识别}（Automatic Speech Recognition，ASR）\index{Automatic Speech Recognition}系统将语音识别为源语言文本，然后利用机器翻译系统将源语言文本翻译为目标语言文本。这种做法的好处在于语音识别和机器翻译模型可以分别进行训练，有很多数据资源以及成熟技术可以分别运用到两个系统中。因此，级联语音翻译是很长时间以来的主流方法，深受工业界的青睐。级联语音翻译主要的流程如图\ref{fig:17-3}所示。
+%----------------------------------------------------------------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter17/Figures/figure-cascading-speech-translation}
+\caption{级联语音翻译}
+\label{fig:17-3}
+\end{figure}
+%----------------------------------------------------------------------------------------------------
+\parinterval 由于声学特征提取在上一节中已经进行了描述，而且文本翻译可以直接使用本书介绍的统计机器翻译或者神经机器翻译。因此下面简要介绍一下语音识别模型，以便读者对级联式语音翻译系统有一个完整的认识。
+\parinterval 传统的语音识别模型和统计机器翻译相似，需要利用声学模型、语言模型和发音词典联合进行识别，系统较为复杂\upcite{DBLP:journals/ftsig/GalesY07,DBLP:journals/taslp/MohamedDH12,DBLP:journals/spm/X12a}。而近些年来，随着神经网络的发展，基于神经网络的端到端语音识别模型逐渐成为主流，大大简化了训练流程\upcite{DBLP:conf/nips/ChorowskiBSCB15,DBLP:conf/icassp/ChanJLV16}。目前的端到端语音识别模型主要基于序列到序列结构，编码器根据输入的声学特征进一步提取高级特征，解码器根据编码器提取的特征识别对应的文本。在后文中即将介绍的端到端语音翻译模型也是使用十分相似的结构。因此，从某种意义上说，语音识别和翻译的端到端方法与神经机器翻译是一致的。
+\parinterval 语音识别目前广泛使用基于Transformer的模型结构（见{\chaptertwelve}），如图\ref{fig:17-2-3}所示。可以看出，相比文本翻译，模型结构上唯一的区别在于编码器的输入为声学特征，以及编码器底层会使用额外的卷积层来减小输入序列的长度，从而降低长序列带来的显存占用以及建模困难。通过大量的语音-标注平行数据对模型进行训练，可以得到高质量的语音识别模型。
+%----------------------------------------------------------------------------------------------------
+\begin{figure}[htp]
+\centering
+\caption{基于Transformer的语音识别模型}
+\label{fig:17-2-3}
+\end{figure}
+%----------------------------------------------------------------------------------------------------
+\parinterval 级联语音翻译模型利用翻译模型将语音识别结果翻译为目标语言文本，但存在的一个问题是语音识别模型只输出One-best，其中可能存在一些识别错误，这些错误在翻译过程中会被放大，导致最终翻译结果偏离原本意思，也就是错误传播问题。传统级联语音模型的一个主要方向是丰富语音识别模型的预测结果，为翻译模型提供更多的信息，具体做法是在语音识别模型中，声学模型解码得到{\small\bfnew{词格}}\index{词格}（Word Lattice）\index{Word Lattice}来取代One-best识别结果。词格是一种有向无环图，包含单个起点和终点，图中的每条边记录了每个词和对应的转移概率信息，如图\ref{fig:17-2-4}所示。
+%----------------------------------------------------------------------------------------------------
+\begin{figure}[htp]
+\centering
+\caption{词格示例}
+\label{fig:17-2-4}
+\end{figure}
+%----------------------------------------------------------------------------------------------------
+\parinterval 可以看出，词格可以保存多条搜索路径，路径中保存了输入序列的时间信息以及解码过程，翻译模型基于更丰富的词格信息进行翻译，可以降低语音识别模型带来的误差\upcite{DBLP:conf/acl/ZhangGCF19,DBLP:conf/acl/SperberNPW19}。但在端到端语音识别模型中，一般基于束搜索方法进行解码，且解码序列的长度与输入序列并不匹配，相比传统声学模型解码丢失了语音的时间信息，因此这种基于词格的方法主要集中在传统语音识别模型上和端到端文本翻译模型上。
+\parinterval 为了错误传播问题带来的影响，一种思路是通过一个后处理模型修正识别结果中的错误，再送给文本翻译模型进行翻译。这一做法在工业界得到了广泛应用，但由于每个模型只能串行地计算，也会带来额外的计算代价以及运算时间。另外一种思路是训练鲁棒的文本翻译模型，使其可以处理输入中存在的噪声或误差\upcite{DBLP:conf/acl/LiuTMCZ18}。随着技术的不断发展，如何利用单个模型实现语音翻译成为了人们关注的热点，也就是端到端语音翻译，我们在下一节中进行介绍。
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsection{端到端语音翻译}
+\parinterval 尽管级联语音翻译模型可以利用语音识别和文本翻译模型来得到语音对应的翻译结果，但不可避免地存在一些缺陷：
+%----------------------------------------------------------------------------------------------------
+\begin{itemize}
+    \vspace{0.5em}
+    \item 错误传播问题。级联模型导致的一个很严重的问题在于，语音识别模型得到的文本如果存在错误，这些错误很可能在翻译过程中被放大，从而使最后翻译结果出现比较大的误差。比如识别时在句尾少生成了个“吗”，会导致翻译模型将疑问句翻译为陈述句。
+    \vspace{0.5em}
+    \item 翻译效率问题。由于需要语音识别模型和文本标注模型只能串行地计算，翻译效率相对较低，而实际很多场景中都需要达到低延时的翻译。
+    \vspace{0.5em}
+    \item 语音中的副语言信息丢失。将语音识别为文本的过程中，语音中包含的语气、情感、音调等信息会丢失，而同一句话在不同的语气中表达的意思很可能是不同的，导致翻译出现偏差。
+    \vspace{0.5em}
+\end{itemize}
+%----------------------------------------------------------------------------------------------------
+\parinterval 针对级联语音翻译模型存在的缺陷，研究者们提出了{\small\bfnew{端到端的语音翻译模型}}\index{端到端的语音翻译模型}（End-to-End Speech Translation, E2E-ST）\index{End-to-End Speech Translation}\upcite{DBLP:conf/naacl/DuongACBC16,DBLP:conf/interspeech/WeissCJWC17,DBLP:journals/corr/BerardPSB16}，也就是模型的输入是一条语音，输出是对应的目标语言文本。相比级联模型，端到端模型有如下优点：
+%----------------------------------------------------------------------------------------------------
+\begin{itemize}
+    \vspace{0.5em}
+    \item 端到端模型不需要多阶段的生成，因此避免了错误传播问题。
+    \vspace{0.5em}
+    \item 同样地，端到端模型相比级联模型可以减少将近一半的模型参数，翻译效率可以得到明显提升。
+    \vspace{0.5em}
+    \item 由于端到端模型语音信号可以直接作用于翻译过程，因此可以使得副语言信息得以体现。
+    \vspace{0.5em}
+\end{itemize}
+%----------------------------------------------------------------------------------------------------
+\parinterval 因此，端到端模型收到了研究人员的关注。目前比较火热的，基于Transformer的语音翻译模型架构如图\ref{fig:17-2-5}所示（下文中语音翻译模型均指端到端的模型）。该模型采用的也是序列到序列架构，编码器的输入是从语音中提取的特征（比如FBank特征）。由于语音对应的特征序列过长，在计算Attention的时候，会占用大量的内存/显存，从而降低计算效率，过长的序列也会增加模型训练的难度。因此，通常会先对语音特征做一个下采样，缩小语音的序列长度。目前一个常用的做法，是在输入的语音特征上进行两层步长为2的卷积操作，从而将输入序列的长度缩小为之前的1/4。之后的流程和标准的机器翻译是完全一致的，编码器对语音特征进行编码，解码器根据编码表示生成目标语言的翻译结果。
+%----------------------------------------------------------------------------------------------------
+\begin{figure}[htp]
+\centering
+\caption{基于Transformer的端到端语音翻译模型}
+\label{fig:17-2-5}
+\end{figure}
+%----------------------------------------------------------------------------------------------------
+\parinterval 虽然端到端语音翻译模型解决了级联模型存在的问题，但同时也面临着两个严峻的问题：
+%----------------------------------------------------------------------------------------------------
+\begin{itemize}
+    \vspace{0.5em}
+    \item 训练数据稀缺。虽然语音识别和文本翻译的训练数据都很多，但是直接由语音到翻译的数据十分有限，因此端到端语音翻译天然地就是一种低资源翻译任务。
+    \vspace{0.5em}
+    \item 建模复杂度更高。在语音识别中，模型是学习如何生成语音对应的文字序列，输入和输出的对齐比较简单，不涉及到调序的问题。在文本翻译中，学习如何生成源语言序列对应的目标语言序列，仅需要学习不同语言之间的映射，不涉及到模态的转换。而语音翻译模型需要学习从语音到目标语言文本的生成，任务更加复杂。
+    \vspace{0.5em}
+\end{itemize}
+%----------------------------------------------------------------------------------------------------
+\parinterval 针对这两个问题，研究人员们也提出了很多方法进行缓解，包括多任务学习、迁移学习等，主要思想都是利用语音识别或文本翻译数据来指导语音模型学习。并且，文本翻译中的很多方法和思想都对语音翻译技术的发展提供了思路。如何将其他领域现有的工作在语音翻译任务上验证，并针对语音这一信息载体进行特定的建模适应，是语音翻译任务当前的研究重点\upcite{DBLP:conf/mtsummit/GangiNCDT19}。
+%----------------------------------------------------------------------------------------------------
+\noindent{\small\bfnew{1）多任务学习}}
+\parinterval 针对语音翻译模型建模复杂度较高问题，常用的一个方法是进行多任务学习，使模型在训练过程中有更多的监督信息，从而使模型收敛地更加充分。语音语言中多任务学习主要借助语音对应的标注信息，也就是源语言文本。{\small\bfnew{连接时序分类}}\index{连接时序分类}（Connectionist Temporal Classification，CTC）\index{Connectionist Temporal Classification}\upcite{DBLP:conf/icml/GravesFGS06}是语音处理中最简单有效的一种多任务学习方法\upcite{DBLP:journals/jstsp/WatanabeHKHH17,DBLP:conf/icassp/KimHW17}，也被广泛应用于文本识别任务中\upcite{DBLP:journals/pami/ShiBY17}。CTC可以将输入序列的每一位置都对应到标注文本中，学习语音和文字之间的软对齐关系。比如，对于下面的音频序列，CTC可以将每个位置分别对应到同一个词。需要注意的是，CTC会额外新增一个词$\epsilon$，类似于一个空白词，表示这个位置没有声音或者没有任何对应的预测结果。然后，将相同且连续的词合并，去除$\epsilon$，就可以得到预测结果，如图\ref{fig:17-2-6}所示。
+%----------------------------------------------------------------------------------------------------
+\begin{figure}[htp]
+\centering
+\caption{CTC预测单词序列示例}
+\label{fig:17-2-6}
+\end{figure}
+%----------------------------------------------------------------------------------------------------
+\parinterval CTC具备的以下特性使其可以很好的完成输入输出之间的对齐。
+%----------------------------------------------------------------------------------------------------
+\begin{itemize}
+    \vspace{0.5em}
+    \item 输入输出之间的对齐是单调的。也就是后面的输入只会预测与前面的序列相同或后面的输出内容。比如对于上面的例子，如果输入的位置t已经预测了字符r，那么t之后的位置不会再预测前面的字符w和o。
+    \vspace{0.5em}
+    \item 输入和输出之间是多对一的关系。也就是多个输入会对应到同一个输出上。这对于语音序列来说是非常自然的一件事情，由于输入的每个位置只包含非常短的语音特征，因此多个输入才可以对应到一个输出字符。
+    \vspace{0.5em}
+\end{itemize}
+%----------------------------------------------------------------------------------------------------
+\parinterval 将CTC应用到语音翻译中的方法非常简单，只需要在编码器的顶层加上一个额外的输出层即可。通过这种方式，不需要增加过多的额外参数，就可以给模型加入一个较强的监督信息，提高模型的收敛性。
+%----------------------------------------------------------------------------------------------------
+\begin{figure}[htp]
+\centering
+\caption{基于CTC的语音翻译模型}
+\label{fig:17-2-7}
+\end{figure}
+%----------------------------------------------------------------------------------------------------
+\parinterval 另外一种多任务学习的思想是通过两个解码器，分别预测语音对应的源语言句子和目标语言句子，具体有图\ref{fig:17-9}展示的三种方式\upcite{DBLP:conf/naacl/AnastasopoulosC18,DBLP:conf/asru/BaharBN19}。图\ref{fig:17-9}(a)中采用单编码器-双解码器的方式，两个解码器根据编码器的表示，分别预测源语言句子和目标语言句子，从而使编码器训练地更加充分。这种做法的好处在于仅仅增加了训练代价，解码时只需要生成目标语言句子即可。图\ref{fig:17-9}(b)则通过使用两个级联的解码器，先利用第一个解码器生成源语言句子，然后再利用第一个解码器的表示，通过第二个解码器生成目标语言句子。这种方法通过增加一个中间输出，降低了模型的训练难度，但同时也会带来额外的解码耗时，因为两个解码器需要串行地进行生成。图\ref{fig:17-9}(c)中模型更进一步，第二个编码器联合编码器和第一个解码器的表示进行生成，更充分地利用了已有信息。
+%----------------------------------------------------------------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter17/Figures/figure-three-ways-of-dual-decoder-speech-translation}
+\caption{双解码器语音翻译的三种方式}
+\label{fig:17-9}
+\end{figure}
+%----------------------------------------------------------------------------------------------------
+\noindent{\small\bfnew{2）迁移学习}}
+\parinterval 相比语音识别和文本翻译，端到端语音翻译的训练数据量要小很多，因此，如何利用其它数据来增加可用的数据量是语音翻译的一个重要方向。和文本翻译中的方法相似，一种思路是利用迁移学习或预训练，利用其他语言的双语数据预训练模型参数，然后迁移到目标语言任务上\upcite{DBLP:conf/naacl/BansalKLLG19}，或者是利用语音识别数据或文本翻译数据，分别预训练编码器和解码器参数，用于初始化语音翻译模型参数\upcite{DBLP:conf/icassp/BerardBKP18}。预训练的编码器对语音翻译模型的学习尤为重要\upcite{DBLP:conf/naacl/BansalKLLG19}，相比文本数据，语音数据的复杂性更高，如果仅从小规模语音翻译数据上学习很难学习充分。此外，模型对声学特征的学习与语言并不是强相关的，在其他语种预训练的编码器对模型学习也是有帮助的。
+\noindent{\small\bfnew{3）数据增强}}
+\parinterval 数据增强是增加训练数据最简单直观的一种方法。但是相比文本翻译中，可以利用回译的方法生成伪数据（见{\chaptersixteen}）。语音翻译正向翻译模型通过源语言语音生成目标语言文本，如果直接利用回译的思想，需要通过一个模型，将目标语文本翻译为目标语语音，但实际上这种模型是不能简单得到。因此，一个简单的思路是通过一个反向翻译模型和语音合成模型级联来生成伪数据\upcite{DBLP:conf/icassp/JiaJMWCCALW19}。另外，正向翻译模型生成的伪数据在文本翻译中也被验证了对模型训练有一定的帮助，因此同样可以利用语音识别和文本翻译模型，将源语言语音生成目标语言翻译，得到伪平行语料。
+%----------------------------------------------------------------------------------------------------
+\parinterval 此外，研究人员们还探索了很多其他方法来提高语音翻译模型的性能。利用在海量的无标注语音数据上预训练的{\small\bfnew{自监督}}\index{自监督}（Self-supervised）\index{Self-supervised}模型作为一个特征提取器，将从语音中提取的特征作为语音翻译模型的输入，可以有效提高模型的性能\upcite{DBLP:conf/interspeech/WuWPG20}。相比语音翻译模型，文本翻译模型任务更加简单，因此一种思想是利用文本翻译模型来指导语音翻译模型，比如通过知识蒸馏\upcite{DBLP:conf/interspeech/LiuXZHWWZ19}、正则化\upcite{DBLP:conf/emnlp/AlinejadS20}等方法。为了简化语音翻译模型的学习，可以通过课程学习的策略，使模型从语音识别任务，逐渐过渡到语音翻译任务，这种由易到难的训练策略可以使模型训练更加充分\upcite{DBLP:journals/corr/abs-1802-06003,DBLP:conf/acl/WangWLZY20}。
+%----------------------------------------------------------------------------------------
+%    NEW SECTION
+%----------------------------------------------------------------------------------------
+\section{图像翻译}
+\parinterval 人类所接受的信息中视觉信息的比重往往不亚于语言信息，甚至更多。视觉信息通常以图像的形式存在，近几年，结合图像的多模态机器翻译任务受到了广泛的研究。多模态机器翻译（图a）简单来说就是结合源语言和其他模态（例如图像等）的信息生成目标语言的过程。这种结合图像的机器翻译还是一种狭义上的“翻译”，它本质上还是从源语言到目标语言或者说从文本到文本的翻译。那么从图像到文本上（图b）的转换，例如，{\small\bfnew{图片描述生成}}\index{图片描述生成}（Image Captioning）\index{Image Captioning}，即给定图像生成与图像内容相关的描述，也可以被称为广义上的“翻译”，当然，这种广义上的翻译形式不仅仅包括图像到文本，还应该包括从图像到图像（图c），甚至是从文本到图像（图d）等等。这里将这些与图像相关的翻译任务统称为图像翻译。
+%----------------------------------------------------------------------------------------------------
+\begin{table}[htp]
+\centering
+\caption{图像翻译任务}
+\label{tab:17-2-1-c}
+\end{table}
+%----------------------------------------------------------------------------------------------------
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsection{基于图像增强的文本翻译}
+\parinterval 在文本翻译中引入图像信息是最典型的多模态机器翻译任务。虽然多模态机器翻译还是一种从源语言文字到目标语言文字的转换，但是在转换的过程中，融入了其他模态的信息减少了歧义的产生。例如前文提到的通过与源语言相关的图像信息，将“A medium sized  child jumps off of a dusty bank”中“bank”译为“河岸”而不是“银行”，通过给定一张相关的图片，机器翻译模型就可以利用视觉信息更好的理解歧义词，避免产生歧义。换句话说，对于同一图像或者视觉场景的描述，源语言和目标语言描述的本质意义是一致的，只不过，体现在语言上会有表达方法上的差异。那么，图像就会存在一些源语言和目标语言的隐含对齐“约束”，将这种“约束”融入到机器翻译系统，会让模型加深对某些歧义词语上下文的理解，从而进一步提高机器翻译质量。
+\parinterval WMT机器翻译评测在2016年首次将融合图像和文本的多模态机器翻译作为机器翻译和跨语言图像描述的共享任务\upcite{DBLP:conf/wmt/SpeciaFSE16}，这项任务也受到了广泛的研究\upcite{DBLP:conf/wmt/CaglayanABGBBMH17,DBLP:conf/wmt/LibovickyHTBP16}。如何融入视觉信息，更好的理解多模态上下文语义是多模态机器翻译研究的热点，大体的研究方向包括基于特征融合的方法\upcite{DBLP:conf/emnlp/CalixtoL17,DBLP:journals/corr/abs-1712-03449,DBLP:conf/wmt/HelclLV18}、基于多任务学习的方法\upcite{DBLP:conf/ijcnlp/ElliottK17,DBLP:conf/acl/YinMSZYZL20}。接下来将从这两个方向，对多模态机器翻译的研究展开介绍。
+%----------------------------------------------------------------------------------------
+%    NEW SUBSUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsubsection{1. 基于特征融合的方法}
+\parinterval 较为早期的研究工作通常将图像信息作为输入句子的一部分\upcite{DBLP:conf/emnlp/CalixtoL17,DBLP:conf/wmt/HuangLSOD16}，或者用其对编码器、解码器的状态进行初始化\upcite{DBLP:conf/emnlp/CalixtoL17,Elliott2015MultilingualID,DBLP:conf/wmt/MadhyasthaWS17}。如图2所示，对图像特征的提取通常是基于卷积神经网络，有关卷积神经网络的内容，请参考{\chaptereleven}内容。通过卷积神经网络得到全局视觉特征，在进行维度变换后，将其作为源语言输入的一部分或者初始化状态引入到模型当中。但是，这种图像信息的引入方式有以下两个缺点：
+\begin{itemize}
+    \vspace{0.5em}
+    \item 图像信息不全都是有用的，往往存在一些与源语言或目标语言无关的信息，作为全局特征会引入噪音
+    \vspace{0.5em}
+    \item 图像信息作为源语言的一部分或者初始化状态，间接参与目标语言单词的生成，在循环神经网络信息传递的过程中，图像信息会有一定的损失。
+    \vspace{0.5em}
+\end{itemize}
+%----------------------------------------------------------------------------------------------------
+\begin{table}[htp]
+\centering
+\caption{建模全局的视觉特征方法}
+\label{tab:17-2-2-c}
+\end{table}
+%----------------------------------------------------------------------------------------------------
+\parinterval 说到噪音问题就不得不提到注意力机制的引入，前面章节中提到过这样的一个例子：
+\vspace{0.8em}
+\centerline{中午\ 没\ 吃饭\ ，\ 又\ 刚\ 打\ 了\ 一\ 下午\ 篮球\ ，\ 我\ 现在\ 很\ 饿\ ，\ 我\ 想\underline{\quad \quad \quad} 。}
+\vspace{0.8em}
+\parinterval 想在横线处填写“吃饭”，“吃东西”的原因是我们在读句子的过程中，关注到了“没/吃饭”，“很/饿”等关键息。这是在自然语言处理中注意力机制解决的问题，即对于要生成的目标语言单词时，相关性更高的源语言片段应该在源语言句子的表示中体现出来，而不是将所有的源语言单词一视同仁。同样的，注意力机制也用在多模态机器翻译中，即在生成目标单词时，对于图像而言，更应该关注与目标单词相关的图像部分，而弱化对其他部分的关注，这样就达到了降噪的目的，另外，注意力机制的引入，也使图像信息直接参与目标语言的生成，解决了在编码器中，图像信息传递损失的问题。
+%----------------------------------------------------------------------------------------------------
+\begin{table}[htp]
+\centering
+\caption{目标词“bank”注意力机制前后对比}
+\label{tab:17-2-3-c}
+\end{table}
+%----------------------------------------------------------------------------------------------------
+\parinterval 那么，多模态机器翻译是如何计算上下文向量的呢？这里仿照第十章的内容给出具体解释(参考图10.19)：
+\parinterval 编码器输出的状态序列${\mathbi{h}_1,\mathbi{h}_2,...\mathbi{h}_m}$，m为状态序列的长度，需要注意的是，这里的状态序列不是源语言的状态序列，而是通过基于卷积循环网络提取到的图像的状态序列。假设图像的特征维度16×16×512，其中前两个维度分别表示图像的高和宽，这里会将图像的维度映射为256×512的状态序列，512为每个状态的维度，对于目标语位置$j$，上下文向量$\mathbi{C}_{j}$被定义为对序列的编码器输出进行加权求和，如下：
+\begin{eqnarray}
+\mathbi{C}_{j}&=& \sum_{i}{{\alpha}_{i,j}{\mathbi{h}}_{i}}
+\label{eq:17-2-1}
+\end{eqnarray}
+\noindent 其中，${\alpha}_{i,j}$是注意力权重，它表示目标语言第j个位置与图片编码状态序列第i个位置的相关性大小，计算方式与{\chapterten}描述的注意力函数一致。
+\parinterval 这里，将每个时间步编码器的输出$\mathbi{h}_{i}$看作源图像序列位置$i$的表示结果。图3说明了模型在生成目标词“man”时，图像经过注意力机制对图像区域关注度的可视化效果，可以看到，经过注意力机制后，模型更注重的是与目标词相关的图像部分。当然，多模态机器翻译的输入还包括源语言文字序列。通常，源语言文字对于翻译的作用比图像更大\upcite{DBLP:conf/acl/YaoW20}。从这个角度说，图像信息更多的是作为文字信息的补充，而不是替代。除此之外，注意力机制在多模态机器翻译中也有很多研究，不仅仅在解码器端将经过注意力机制的文本特征和视觉特征作为解码输入的一部分，还有的工作在编码端将源语言与图像信息进行注意力建模\upcite{DBLP:journals/corr/abs-1712-03449,DBLP:conf/acl/YaoW20}，得到更好的源语言特征表示。
+%----------------------------------------------------------------------------------------
+%    NEW SUBSUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsubsection{2. 基于多任务学习的方法}
+\parinterval 基于多任务学习的方法通常是把翻译任务与其他视觉任务结合，进行联合训练。在{\chapterfifteen}和{\chaptersixteen}已经提到过多任务学习。一种常见的多任务学习框架是针对多个相关的任务，共享模型的部分参数来学习不同任务之间相似的部分，并通过特定的模块来学习每个任务特有的部分。在多模态机器翻译中，应用多任务学习的主要策略就是将翻译作为主任务，同时设置一些与其他模态相关的子任务，通过这些子任务来辅助源语言理解自身的语言知识。
+\parinterval 如图4所示，可以将多模态机器翻译任务分解为两个子任务：机器翻译和图片生成\upcite{DBLP:conf/ijcnlp/ElliottK17}。其中机器翻译作为主任务，图片生成作为子任务，图片生成这里指的是从一个图片描述生成对应图片，对于图片生成任务在后面叙述。通过单个编码器对源语言数据进行建模，然后通过两个解码器（翻译解码器和图像解码器）来学习翻译任务和图像生成任务。顶层任务学习每个任务的独立特征，底层共享参数层能够学习到更丰富的文本特征表示。另外在视觉问答领域有研究表明\upcite{DBLP:conf/nips/LuYBP16}，在多模态任务中，不宜引入多层的注意力，因为多层注意力会导致模型严重的过拟合，从另一角度来说，利用多任务学习的方式，提高模型的泛化能力，也是一种有效防止过拟合现象的方式。类似的思想，也大量使用在多模态自然语言处理中，例如图像描述生成、视觉问答\upcite{DBLP:conf/iccv/AntolALMBZP15}等。
+%----------------------------------------------------------------------------------------------------
+\begin{table}[htp]
+\centering
+\caption{多模态机器翻译多任务学习的应用}
+\label{tab:17-2-4-c}
+\end{table}
+%----------------------------------------------------------------------------------------------------
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsection{图像到文本的翻译}
+\parinterval 图像到文本的转换也可以看作是广义上的翻译，简单来说，就是把源语言的形式替换成了图像。其中，图像描述生成是最典型的任务。虽然，这部分内容并不是本书的重点，不过为了保证多模态翻译内容的完整性，这里对相关技术进行简要介绍。图像描述生成是指给定图像生成文字描述，有时也被称作图说话、图像字幕生成。如何理解图像信息、在理解图像信息基础上生成描述是图像描述任务要解决的问题，可以发现，该任务涉及到自然语言处理和计算机视觉两个领域，是一项很有挑战的任务。同时，图像描述在图像检索、智能导盲、人机交互等领域有着广泛的应用场景，有很大的研究价值。
+%----------------------------------------------------------------------------------------------------
+\begin{table}[htp]
+\centering
+\caption{图像描述传统方法}
+\label{tab:17-2-5-c}
+\end{table}
+%----------------------------------------------------------------------------------------------------
+\parinterval 传统图像描述生成有两种范式：基于检索的方法和基于模板的方法。其中基于检索的方法（图5左）是指在指定的图像描述候选句子中选择其中的句子作为图像的描述，这种方法的弊端是所选择的句子可能会和图像很大程度上不相符。而基于模板的方法（图5右）是指在图像上检测视觉特征，然后把内容填在实现设计好的模板当中，这种方法的缺点是生成的图像描述过于呆板，‘像是在一个模子中刻出来的’说的就是这个意思。近几年来 ，由于卷积神经网络在计算机视觉领域效果显著，而循环神经网络在自然语言处理领域卓有成效，受到机器翻译领域编码器-解码器框架的启发，逐渐的，这种基于卷积神经网络作为编码器编码图像，循环神经网络作为解码器解码描述的编码器-解码器框架成了图像描述任务的基础范式。本章节，从基础的图像描述范式编码器-解码器框架展开\upcite{DBLP:conf/cvpr/VinyalsTBE15,DBLP:conf/icml/XuBKCCSZB15}，从编码器的改进、解码器的改进展开介绍。  
+%----------------------------------------------------------------------------------------
+%    NEW SUBSUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsubsection{1. 基础框架}
+\parinterval 受到神经机器翻译的启发，编码器-解码器框架也应用到图像描述任务当中。其中，编码器将输入的图像转换为一种新的“表示”形式，这种表示包含了输入图像的所有信息。之后解码器把这种“表示”重新转换为输出的描述。图XX中（上）是编码器-解码器框架在图像描述生成的应用\upcite{DBLP:conf/cvpr/VinyalsTBE15}。首先，通过卷积神经网络提取图像特征到一个合适的长度向量表示。然后，利用长短时记忆网络（LSTM）解码生成文字描述，这个过程中与机器翻译解码过程类似。这种建模方式存在一定的短板：生成的描述单词不一定需要所有的图像信息，将全局的图像信息送入模型中，可能会引入噪音，使这种“表示”形式不准确。针对这个问题，图XX（下）\upcite{DBLP:conf/icml/XuBKCCSZB15}为了弥补这种建模的局限性，引入了注意力机制。利用注意力机制在生成不同单词时，使模型不再只关注图像的全局特征，而是关注“应该”关注的图像特征。
+%----------------------------------------------------------------------------------------------------
+\begin{table}[htp]
+\centering
+\caption{图像描述的编码器-解码器框架}
+\label{tab:17-2-6-c}
+\end{table}
+%----------------------------------------------------------------------------------------------------
+\parinterval 图像描述生成基本上沿用了编码器-解码器框架。接下来，分别从编码器端的改进和解码器端的改进展开介绍。这些改进总体来说是在解决以下两个问题：
+\begin{itemize}
+    \vspace{0.5em}
+    \item 在编码器端，如何更丰富、更全面的编码图像信息？
+    \vspace{0.5em}
+    \item 在解码器端，如何更好的利用编码器端的特征表示？
+    \vspace{0.5em}
+\end{itemize}
+%----------------------------------------------------------------------------------------
+%    NEW SUBSUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsubsection{2. 编码器的改进}
+\parinterval 要想使编码器-解码器框架在图像描述中充分发挥作用，编码器也要更好的表示图像信息。对于编码器的改进，大多也是从这个方向出发。通常，体现在向编码器中添加图像的语义信息\upcite{DBLP:conf/cvpr/YouJWFL16,DBLP:conf/cvpr/ChenZXNSLC17,DBLP:journals/pami/FuJCSZ17}和位置信息\upcite{DBLP:conf/cvpr/ChenZXNSLC17,DBLP:conf/ijcai/LiuSWWY17}。
+\parinterval 图像的语义信息一般是指图像中存在的实体、属性、场景等等。如图XX所示，从图像中利用属性或实体检测器提取出“child”、“river”、“bank”等等的属性词和实体词作为图像的语义信息，提取全局的图像特征初始化循环神经网络，再利用注意力机制计算目标词与属性词或实体词之间的注意力权重，根据该权重计算上下文向量，从而将编码语义信息送入解码端\upcite{DBLP:conf/cvpr/YouJWFL16}，在解码‘bank’单词时，会更关注图像语义信息中的‘bank’。当然，除了图像中的实体和属性作为语义信息外，也可以将图片的场景信息也加入到编码器当中\upcite{DBLP:journals/pami/FuJCSZ17}。有关如何做属性、实体和场景的检测，涉及到目标检测任务的工作，例如Faster-RCNN\upcite{DBLP:journals/pami/RenHG017}、YOLO\upcite{DBLP:journals/corr/abs-1804-02767,DBLP:journals/corr/abs-2004-10934}等等,这里不过多赘述。
+%----------------------------------------------------------------------------------------------------
+\begin{table}[htp]
+\centering
+\caption{编码器“显式”融入语义信息}
+\label{tab:17-2-6-c}
+\end{table}
+%----------------------------------------------------------------------------------------------------
+\parinterval 以上的方法大都是将图像中的实体、属性、场景等映射到文字上，并把这些信息显式地添加到编码器端。令一种方式，把图像中的语义特征隐式地作用到编码器端\upcite{DBLP:conf/cvpr/ChenZXNSLC17}。例如，可以图像数据可以分解为三个通道（红、绿、蓝），简单来说，就是将图像的每一个像素点按照红色、绿色、蓝色分成三个部分，这样就将图像分成了三个通道。在很多图像中，不同通道随伴随的特征是不一样的，可以将其作用于编码器端。另一种方法是基于位置信息的编码器增强。位置信息指的是图像中对象（物体）的位置。利用目标检测技术检测系统获得图中的对象和对应的特征，这样就确定了图中的对象位置。显然，这些信息也可以加入到编码端，以加强编码器的表示能力\upcite{DBLP:conf/eccv/YaoPLM18}。
+%----------------------------------------------------------------------------------------
+%    NEW SUBSUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsubsection{3. 解码器的改进}
+\parinterval 由于解码器输出的是语言文字序列，因此需要考虑语言的特点对其进行改进。 例如，解码过程中， “the”,“on”，“at”这种介词或者冠词与图像的相关性较低，这时图像信息的引入就会产生负面影响\upcite{DBLP:conf/cvpr/LuXPS17}。因此，可以通过门等结构，控制视觉信号作用于文字生成的程度。另外,在解码过程中，生成的每个单词对应着图像的区域可能是不同的。因此也可以设计更为有效的注意力机制来捕捉解码端对不同图像局部信息的关注程度\upcite{DBLP:conf/cvpr/00010BT0GZ18}。 
+\parinterval 除了在解码端更好的使生成文本与图像特征相互作用以外，还有一些其他的解码器端改进的方向。例如：用其它结构（如卷积神经网络或者Transformer）代替解码器端循环神经网络\upcite{DBLP:conf/cvpr/AnejaDS18}。或者使用更深层的神经网络学习动词或者名词等视觉中不易表现出来的单词\upcite{DBLP:journals/mta/FangWCT18}，其思想与深层神经机器翻译模型有相通之处（{\chapterfifteen}）。
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsection{图像、文本到图像的翻译}
+\parinterval 当生成的目标对象是图像时，问题就变为了图像生成问题。虽然，这个领域本身并不属于机器翻译，但是其使用的基本方法与机器翻译有类似之处。二者也可以相互借鉴。因此，这里对图像生成问题也进行简要描述。
+\parinterval 计算机视觉领域，图像风格转移、图像语义分割、图像超分辨率等任务，都可以被视为{\small\bfnew{图像到图像的翻译}}\index{图像到图像的翻译}（Image-to-Image Translation）\index{Image-to-Image Translation}问题。与机器翻译类似，这些问题的共同目标是学习从一个对象到另一个对象的映射，只不过这里的对象是指图像，而非机器翻译中的文字。例如，给定物体的轮廓生成真实物体照片或者给定白天照片生成夜晚的照片等。图像到图像的翻译有广阔的应用场景，如图片补全、风格迁移等。
+\parinterval 对抗神经网络被广泛地应用再图像到图像的翻译任务当中\upcite{DBLP:conf/nips/GoodfellowPMXWOCB14,DBLP:conf/nips/ZhuZPDEWS17,DBLP:journals/corr/abs-1908-06616}。实际上，这类方法非常适合图像生成类的任务。简单来说，对抗生成网络包括两个部分分别是：生成器和判别器。基于输入生成器生成一个结果，而判别器要判别生成的结果和真实结果是否是相同的，对抗的思想是，通过强化生成器的生成能力和判别器的判别能力，当生成器生成的结果可以“骗”过判别器时，即判别器无法分清真实结果和生成结果，认为模型学到了这种映射关系。在图像到图像的翻译中，根据输入图像，生成器生成预测图像，判别器判别是否为目标图像，多次迭代后，生成图像被判别为目标图像时，则模型学习到了“翻译能力”。以上的工作都是有监督的，即基于对齐的图像对数据集，但是，这种数据的标注是极为费时费力的，所以有很多的工作也基于无监督的方法展开\upcite{DBLP:conf/iccv/ZhuPIE17,DBLP:conf/iccv/YiZTG17,DBLP:conf/nips/LiuBK17}，这里不过多赘述。
+\parinterval {\small\bfnew{文本到图像的翻译}}\index{文本到图像的翻译}（Text-to-Image Translation）\index{Text-to-Image Translation}是指给定描述物体颜色和形状等细节的一自然语言文字，生成对应的图像。该任务也可以看作是图像描述任务的逆任务。目前方法上大部分基于对抗神经网络\upcite{DBLP:conf/icml/ReedAYLSL16,DBLP:journals/corr/DashGALA17,DBLP:conf/nips/ReedAMTSL16}。基本流程为：首先利用自然语言处理技术提取出文本信息，然后再用文本特征作为后面生成图像的约束，在对抗神经网络中生成器（Generator）中根据文本特征生成图像的约束，从而别鉴别器（Discriminator）鉴定其生成效果。
+%----------------------------------------------------------------------------------------
+%    NEW SECTION
+%----------------------------------------------------------------------------------------
+\section{篇章级翻译}
+\parinterval 目前大多数机器翻译系统都是句子级的，这种系统的输入和输出都是以句子为单位，并基于句子之间相互独立的假设。然而句子级模型却缺少了对句子间上下文信息的建模，篇章级翻译则是用于解决该问题，进而改善机器翻译在整个篇章上的翻译质量。篇章级翻译的目的就是对句子间的上下文信息进行建模，改善机器翻译在整个篇章上的质量。篇章级翻译的概念在很早就已经被提出\upcite{DBLP:journals/ac/Bar-Hillel60}。随着近几年神经机器翻译取得了巨大进展，如何使用篇章级上下文信息成为进一步改善机器翻译质量的重要方向\upcite{DBLP:journals/corr/abs-1912-08494,DBLP:journals/corr/abs-1901-09115}。本节我们将主要从篇章级机器翻译的评价、建模方法等角度展开介绍。
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsection{什么是篇章级翻译}
+\parinterval “篇章”在这里指一系列连续的段落或者句子所构成的整体，其中各个句子间从形式和内容上都具有一定的连贯性和一致性\upcite{jurafsky2000speech}。这些联系主要体现在{\small\sffamily\bfseries{衔接}}\index{衔接}（Cohesion \index{Cohesion}）以及{\small\sffamily\bfseries{连贯}}\index{连贯}（Coherence \index{Coherence}）两个方面。其中衔接体现在显性的语言成分和结构上，包括篇章中句子间语法和词汇上的联系，而连贯体现在各个句子之间逻辑和语义上的联系。因此，篇章级翻译的目的就是要考虑到这些上下文之间的联系，从而生成相比句子级翻译更连贯和准确的翻译结果（如表\ref{tab:17-3-1}）。但是由于不同语言的特性多种多样，上下文信息在篇章级翻译中的作用也不尽相同。比如在德语中名词是分词性的，因此在代词翻译的过程中需要根据其先行词的词性进行区分，而这种现象在其它不区分词性的语言中是不存在的。这导致篇章级翻译在不同的语种中可能对应多种不同的上下文现象。
+%----------------------------------------------------------------------------------------------------
+\begin{table}[htp]
+\centering
+\caption{篇章级翻译中时态一致性的问题}
+\label{tab:17-3-1}
+\end{table}
+%----------------------------------------------------------------------------------------------------
+\parinterval 正是由于这种上下文现象的多样性，使得篇章级翻译模型的性能评价相对困难。目前篇章级机器翻译主要针对一些常见上下文的现象，比如代词翻译、省略、连接和词汇衔接等，而{\chapterfour}介绍的BLEU等通用自动评价指标通常对这些上下文现象不敏感，篇章级翻译需要采用一些专用方法来对这些具体的现象进行评价。之前已经有一些研究工作针对具体的上下文现象提出了相应的评价标准并且在篇章级翻译中得到应用\upcite{DBLP:conf/naacl/BawdenSBH18,DBLP:conf/acl/VoitaST19}，但是目前并没有达成共识，这也在一定程度上阻碍了篇章级机器翻译的进一步发展。我们将在ref{sec:17-3-2}节中对这些评价标准进行介绍。
+\parinterval 从建模的角度来看，篇章级翻译需要引入额外的上下文信息，来解决上述上下文现象。在统计机器翻译时代就已经有一些相关工作，这些工作都是针对某一具体的上下文现象进行建模，比如篇章结构\upcite{DBLP:conf/anlp/MarcuCW00,foster2010translating,DBLP:conf/eacl/LouisW14}、代词回指\upcite{DBLP:conf/iwslt/HardmeierF10,DBLP:conf/wmt/NagardK10,DBLP:conf/eamt/LuongP16,}、词汇衔接\upcite{tiedemann2010context,DBLP:conf/emnlp/GongZZ11,DBLP:conf/ijcai/XiongBZLL13,xiao2011document}和篇章连接词\upcite{DBLP:conf/sigdial/MeyerPZC11,DBLP:conf/hytra/MeyerP12,}等。但是由于统计机器翻译本身流程复杂，依赖于许多组件和针对上下文现象所精心构造的特征，其建模方法相对比较困难。到了神经机器翻译时代，翻译质量相比统计机器翻译取得了大幅提升\upcite{DBLP:conf/nips/SutskeverVL14,bahdanau2014neural,vaswani2017attention}，这也鼓励研究人员进一步探索利用篇章上下文的信息\upcite{DBLP:conf/emnlp/LaubliS018}。近几年，相关工作不断涌现并且取得了一些阶段性进展\upcite{DBLP:journals/corr/abs-1912-08494}。
+\parinterval 
+区别于篇章级统计机器翻译，篇章级神经机器翻译通常采用直接对上下文句子进行建模的端到端的方式。这种方法不再需要针对某一具体的上下文现象构造相应的特征，而是通过翻译模型本身从上下文句子中抽取和融合相应的上下文信息。通常情况下，待翻译句子的上下文信息一般来自于近距离的上下文，篇章级机器翻译可以采用局部建模的手段将前一句或者周围几句作为上下文送入模型。针对长距离的上下文现象，也可以使用全局建模的手段直接从篇章所有其他句子中提取上下文信息。近几年多数研究工作都在探索更有效的局部建模或者全局建模的方法，主要包括改进输入\upcite{DBLP:conf/discomt/TiedemannS17,DBLP:conf/naacl/BawdenSBH18,DBLP:conf/wmt/GonzalesMS17,DBLP:journals/corr/abs-1910-07481}、多编码器结构\upcite{DBLP:journals/corr/JeanLFC17,DBLP:conf/acl/TitovSSV18,DBLP:conf/emnlp/ZhangLSZXZL18}、层次结构\upcite{DBLP:conf/emnlp/WangTWL17,DBLP:conf/emnlp/TanZXZ19,Werlen2018DocumentLevelNM,DBLP:conf/naacl/MarufMH19,DBLP:conf/acl/HaffariM18,DBLP:conf/emnlp/YangZMGFZ19,DBLP:conf/ijcai/ZhengYHCB20}以及基于缓存的方法\upcite{DBLP:conf/coling/KuangXLZ18,DBLP:journals/tacl/TuLSZ18}四类。
+\parinterval 此外，篇章级机器翻译面临的另外一个挑战是数据稀缺。篇章级机器翻译所需要的双语数据需要保留篇章边界，数量相比于句子级双语数据要少很多。除了在之前提到的端到端做法中采用预训练或者参数共享的手段（见{\chaptersixteen}），也可以采用另外的建模手段来缓解数据稀缺问题。比如在句子级翻译模型推断过程中，通过目标端篇章级语言模型\upcite{DBLP:conf/discomt/GarciaCE19,DBLP:journals/tacl/YuSSLKBD20,DBLP:journals/corr/abs-2010-12827}来引入上下文信息，或者对句子级的解码结果进行修正\upcite{DBLP:conf/aaai/XiongH0W19,DBLP:conf/acl/VoitaST19,DBLP:conf/emnlp/VoitaST19}。这种方法能够充分利用句子级的双语数据，并且在一定程度上缓解篇章级双语数据稀缺问题。
+%----------------------------------------------------------------------------------------
+%    NEW SUBSUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsection{篇章级翻译的评价}\label{sec:17-3-2}
+\parinterval BLEU等自动评价指标能够在一定程度上反映译文的整体质量，但是并不能有效地评估篇章级翻译模型的性能。这是由于传统测试数据中出现篇章级上下文现象的比例相对较少，并且$n$-gram的匹配很难检测到一些具体的语言现象，这使得研究人员很难通过BLEU的涨幅来判断篇章级翻译模型的效果。
+\parinterval 为此，研究人员总结了篇章级机器翻译任务中存在的上下文现象，并基于此设计了相应的自动评价指标。比如针对代词翻译现象，首先使用词对齐寻找源语言中代词在译文和参考译文中的对应位置，然后通过计数计算最终的准确率和召回率等指标\upcite{DBLP:conf/iwslt/HardmeierF10,DBLP:conf/discomt/WerlenP17}。针对词汇衔接现象，使用{\small\sffamily\bfseries{词汇链}}\index{词汇链}（Lexical Chain\index{Lexical Chain}）\footnote{词汇链指篇章中语义相关的词所构成的序列}等来获取相应分数，然后通过加权平均的方式对BLEU和METEOR等指标进行扩展\upcite{DBLP:conf/emnlp/WongK12,DBLP:conf/discomt/GongZZ15}。{\red{针对篇章连接词}}，使用候选词典和词对齐对源语中连接词的正确翻译结果进行计数，计算其准确率\upcite{DBLP:conf/cicling/HajlaouiP13}。
+\parinterval 除了自动评价指标，也有一些研究人员针对特有的上下文现象手工构造了相应的测试套件。例如，可以采用对比测试的方式。测试集中每一个测试样例都包含一个正确翻译的结果，以及多个错误结果，一个理想的模型应该对正确的翻译评价最高，排名在所有错误答案之上,此时就可以通过模型是否能挑选出正确答案来评估其性能。这种方法可以很好地衡量模型在某一特定上下文现象上的处理能力，比如词义消歧\upcite{DBLP:conf/wmt/RiosMS18}、代词翻译\upcite{DBLP:conf/naacl/BawdenSBH18,DBLP:conf/wmt/MullerRVS18}和一些衔接问题\upcite{DBLP:conf/acl/VoitaST19}等。但是该方法缺点在于其使用范围受限于测试集的语种和规模，扩展性较差。
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsection{篇章级翻译的建模}
+\parinterval 篇章级神经机器翻译不再针对具体的上下文现象构造特征，而是对篇章上下文直接进行建模。在理想情况下，这种方法将以整个篇章为单位作为模型的输入和输出。然而由于现实中篇章对应的序列长度过长，因此直接对整个篇章对应序列建模难度很大，使得主流的序列到序列模型很难达到良好的效果甚至难以训练。一种思路是采用能够处理超长序列的模型对篇章信息建模，比如，使用{\chapterfifteen}中提到的处理长序列的Transformer模型就是针对该问题的一个有效的解决方法\upcite{DBLP:conf/iclr/KitaevKL20}。不过，这类模型并不针对篇章级翻译的具体翻译问题，因此并不是篇章级翻译中的主流方法。
+\parinterval 现在常见的端到端做法还是从句子级翻译出发，通过额外的模块来对篇章中的上下文句子进行抽象表示，然后提取相应的上下文信息并融入到当前句子的翻译过程中。形式上，篇章级翻译的建模方式如下：
+\begin{eqnarray}
+\funp{P}(\seq{Y}|\seq{X})&=&\prod_{i=1}^{T}{\funp{P}(Y_i|X_i,D_i)}
+\label{eq:17-3-1}\\
+D_i&\subseteq&\{X_{-i},Y_{-i}\} \label{eq:17-3-2}
+\end{eqnarray}
+其中$\seq{X}$和$\seq{Y}$分别为源语言篇章和目标语言篇章，$X_i$和$Y_i$分别为源语言篇章和目标语言篇章中的某个句子，$X_{-i}$和$Y_{-i}$分别为去掉第$i$个句子的源语言篇章和目标语言，$T$表示篇章中句子的数目\footnote{为了简化问题，我们假设源语言端和目标语言段具有相同的句子数目$T$}。$D_i$表示翻译第个句子时所对应的上下文句子集合，代表源语言篇章和目标语言篇章中其它的句子。考虑到不同的任务场景需求与模型的应用效率，篇章级神经机器翻译在建模的时候通常仅使用一部分作为上下文句子输入。对应的，篇章级神经机器翻译主要需要考虑两个问题：1）上下文范围的选取，比如上下文句子的多少\upcite{agrawal2018contextual,DBLP:conf/emnlp/WerlenRPH18,DBLP:conf/naacl/MarufMH19}，是否考虑目标端上下文句子\upcite{DBLP:conf/discomt/TiedemannS17,agrawal2018contextual}等；2）不同的上下文范围也对应着不同的建模方式，即如何从上下文句子中提取上下文信息，并且融入到翻译模型中。接下来将对一些典型的建模方法进行介绍，包括改进输入\upcite{DBLP:conf/discomt/TiedemannS17,DBLP:conf/naacl/BawdenSBH18,DBLP:conf/wmt/GonzalesMS17,DBLP:journals/corr/abs-1910-07481}、多编码器结构\upcite{DBLP:journals/corr/JeanLFC17,DBLP:conf/acl/TitovSSV18,DBLP:conf/emnlp/ZhangLSZXZL18}、层次结构\upcite{DBLP:conf/emnlp/WangTWL17,DBLP:conf/emnlp/TanZXZ19,Werlen2018DocumentLevelNM,DBLP:conf/naacl/MarufMH19,DBLP:conf/acl/HaffariM18,DBLP:conf/emnlp/YangZMGFZ19,DBLP:conf/ijcai/ZhengYHCB20}以及基于缓存的方法\upcite{DBLP:conf/coling/KuangXLZ18,DBLP:journals/tacl/TuLSZ1}。
+%----------------------------------------------------------------------------------------
+%    NEW SUBSUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsubsection{1. 改进输入形式}
+\parinterval 一种简单的方法是直接复用传统的序列到序列模型，将上下文句子与当前句子拼接作为模型输入。如实例\ref{eg:17-3-1}所示，这种做法不需要改动模型结构，操作简单，适用于包括基于循环神经网络\upcite{DBLP:conf/discomt/TiedemannS17}和Transformer\upcite{agrawal2018contextual,DBLP:conf/discomt/ScherrerTL19}在内的神经机器翻译系统。但是由于过长的序列会导致模型难以训练，通常只会采用局部的上下文句子进行拼接，比如源语言端前一句或者周围几句\upcite{DBLP:conf/discomt/TiedemannS17}。同时，引入目标语言端的上下文\upcite{DBLP:conf/naacl/BawdenSBH18,agrawal2018contextual,DBLP:conf/discomt/ScherrerTL19}，比如在解码时拼接目标语言端上下文和当前句同样会带来一定的性能提升。但是过大的窗口在推断时会导致错误累计的问题\upcite{agrawal2018contextual}，因此通常只考虑目标语端的前一句。
+\begin{example}
+传统模型训练输入：
+\hspace{10em}源语言：你看到了吗？
+\hspace{10em}目标语言：Do you see them?
+\vspace{0.5em}
+\qquad\ 改进后模型训练输入：
+\hspace{10em}源语言：{\red{他们在哪？ <sep> }}你看到了吗？
+\hspace{10em}目标语言：Do you see them?
+\label{eg:17-3-1}
+\end{example}
+\parinterval 其他改进输入的做法相比于拼接的方法要复杂一些，首先需要对篇章进行处理，得到词汇链（Lexical Chain）\footnote{词汇链指篇章中语义相关的词所构成的序列}\upcite{DBLP:conf/wmt/GonzalesMS17}或者篇章嵌入\upcite{DBLP:journals/corr/abs-1910-07481}等信息，然后融入到当前句子的序列表示中，送入模型进行翻译。这种方式中上下文信息来自于预先提取的篇章表示，但是这种表示是否适合机器翻译还有待论证。
+%----------------------------------------------------------------------------------------
+%    NEW SUBSUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsubsection{2. 多编码器结构}
+%----------------------------------------------
+\begin{figure}[htp]
+    \centering
+	\input{./Chapter17/Figures/figure-multiencoder}
+    \caption{多编码器结构\upcite{DBLP:conf/acl/LiLWJXZLL20}}
+    \label{fig:17-3-1}
+\end{figure}
+%----------------------------------------------
+\parinterval 区别于在输入上进行改进，另一种思路是对传统的编码器-解码器框架进行更改，采用额外的编码器来编码上下文句子，称之为多编码器结构\upcite{DBLP:conf/acl/LiLWJXZLL20,DBLP:conf/acl/LiLWJXZLL20,DBLP:conf/discomt/SugiyamaY19}。这种结构最早被应用在基于循环神经网络的篇章级翻译模型\upcite{DBLP:journals/corr/JeanLFC17,DBLP:conf/coling/KuangX18,DBLP:conf/naacl/BawdenSBH18,DBLP:conf/pacling/YamagishiK19}，并且在Transformer模型上同样适用\upcite{DBLP:journals/corr/abs-1805-10163,DBLP:conf/emnlp/ZhangLSZXZL18}。图\ref{fig:17-3-1}展示了一个基于Transformer模型的多编码器结构，基于源语言当前待翻译句子的编码表示$\mathbi{h}$和上下文句子的编码表示$\mathbi{h}_{pre}$，模型首先通过注意力机制提取句子间上下文信息$\mathbi{d}$：
+\begin{eqnarray}
+\mathbi{d}&=&Attention(\mathbi{h},\mathbi{h}_{pre},\mathbi{h}_{pre})
+\label{eq:17-3-3}
+\end{eqnarray}
+在注意力机制中，$\mathbi{h}$作为query（查询），$\mathbi{h}_{pre}$作为key（键）和value（值）。然后通过门控机制将每个位置的编码表示和上下文信息进行融合，具体方式如下：
+\begin{eqnarray}
+\widetilde{\mathbi{h}_{t}}&=&\lambda_{t}\mathbi{h}_{t}+(1-\lambda_{t})\mathbi{d}_{t}
+\label{eq:17-3-4}\\
+\lambda_{t}&=&\sigma(\mathbi{W}_{\lambda}[\mathbi{h}_{t};\mathbi{d}_{t}]+\mathbi{b}_{\lambda})
+\label{eq:17-3-5}
+\end{eqnarray}
+其中$\widetilde{\mathbi{h}}$为融合了上下文信息的最终序列表示结果，$\widetilde{\mathbi{h}_{t}}$为其中第$t$个位置的表示。$\mathbi{W}_{\lambda}$和$\mathbi{b}_{\lambda}$为模型可学习的参数，$\sigma$为Sigmoid函数，用来获取门控权值$\lambda$。
+\parinterval 除了在解码端外部进行融合，也可以将$\mathbi{h}_{pre}$送入解码器，在解码器中采用类似的机制进行融合\upcite{DBLP:conf/emnlp/ZhangLSZXZL18}。此外，多编码器结构由于引入了额外的模块，模型整体参数量大大增加，会导致其难以训练。为此一些研究人员提出使用句子级模型预训练的方式来初始化模型参数\upcite{DBLP:journals/corr/JeanLFC17,DBLP:conf/emnlp/ZhangLSZXZL18}，或者使用编码器参数共享的手段来减小模型复杂度\upcite{DBLP:conf/pacling/YamagishiK19,DBLP:conf/coling/KuangX18,DBLP:journals/corr/abs-1805-10163}。
+%----------------------------------------------------------------------------------------
+%    NEW SUBSUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsubsection{3. 层次结构}
+%----------------------------------------------
+\begin{figure}[htp]
+    \centering
+	\input{./Chapter17/Figures/figure-layer}
+    \caption{层次注意力结构\upcite{Werlen2018DocumentLevelNM}}
+    \label{fig:17-3-2}
+\end{figure}
+%----------------------------------------------
+\parinterval 多编码器结构通过额外的编码器对前一句进行编码，但是无法处理多个上下文句子的情况。为了能够捕捉到更充分的上下文信息，可以采用层次结构来对更多的上下文句子进行建模。层次结构可以有效的处理更长的上下文序列，以及序列内不同单元之间的相互作用。类似的思想也成功的应用在基于树的翻译模型中（{\chaptereight}和{\chapterfifteen}）。
+\parinterval 图\ref{fig:17-3-2}描述了一个基于层次注意力的模型结构\upcite{DBLP:conf/emnlp/WerlenRPH18}。首先通过翻译模型的编码器获取前文$k$个句子的序列编码表示$(\mathbi{h}^k,\dots,\mathbi{h}^2,\mathbi{h}^1)$，然后使用层次注意力机制从这些编码表示中提取上下文信息$\mathbi{d}$，进而可以和当前句子的编码表示$\mathbi{h}$融合，得到一个上下文相关的当前句子表示$\widetilde{\mathbi{h}}$。其中层次注意力的计算过程也是分为两步，第一步针对前文每个句子的词序列表示$\mathbi{h}^{j}$，使用词级注意力提取从各个句子的上下文信息$\mathbi{s}^{j}$，然后在这$k$个句子级上下文表示$\mathbi{s}=(\mathbi{s}^k,\dots,\mathbi{s}^2,\mathbi{s}^1)$基础上，使用句子级注意力提取最终的上下文信息。具体计算过程如下所示：
+\begin{eqnarray}
+\mathbi{q}_{w}&=&f_w(\mathbi{h}_t)
+\label{eq:17-3-6}\\
+\mathbi{s}^j&=&WordAttention(\mathbi{q}_{w},\mathbi{h}^{j},\mathbi{h}^{j})
+\label{eq:17-3-7}\\
+\mathbi{q}_{s}&=&f_s(\mathbi{h}_t)
+\label{eq:17-3-8}\\
+\mathbi{d}_t&=&FFN(SentAttention(\mathbi{q}_{s},\mathbi{s},\mathbi{s}))
+\label{eq:17-3-9}
+\end{eqnarray}
+其中$\mathbi{h}_{t}$表示当前句子第$t$个位置的编码表示。为了增强模型表示能力，首先通过$f_w$和$f_s$两个线性变换分别获取词级注意力和句子级注意力的查询$\mathbi{q}_{w}$和$\mathbi{q}_{s}$，另外在句子级注意力之后添加了一个前馈全连接网络子层FFN。在获得上下文表示$\mathbi{d}_{t}$后，模型同样采用门控机制（如公式\eqref{eq:17-3-4}和公式\eqref{eq:17-3-5}）与$\mathbi{h}_{t}$进行融合来得到最终的编码表示$\widetilde{\mathbi{h}_{t}}$。
+\parinterval 通过层次注意力，模型可以在词级和句子级两个维度从多个句子中提取更充分的上下文信息，并且可以同时用于解码端来获取目标端的上下文信息。基于层次注意力，为了进一步编码整个篇章的上下文信息，研究人员提出选择性注意力\upcite{DBLP:conf/naacl/MarufMH19}来对篇章中整体上下文进行有选择的信息提取。此外，也有研究人员使用循环神经网络\upcite{DBLP:conf/emnlp/WangTWL17}、记忆网络\upcite{DBLP:conf/acl/HaffariM18}、胶囊网络\upcite{DBLP:conf/emnlp/YangZMGFZ19}和片段级相对注意力\upcite{DBLP:conf/ijcai/ZhengYHCB20}等结构来对多个上下文句子进行上下文信息提取。
+%----------------------------------------------------------------------------------------
+%    NEW SUBSUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsubsection{4. 基于缓存的方法}
+\parinterval 除了以上提到的建模方法，还有一类基于缓存的方法\upcite{DBLP:journals/tacl/TuLSZ18,DBLP:conf/coling/KuangXLZ18}。这种方法最大的特点在于将篇章翻译看作一个连续的过程，然后在这个过程中通过一个额外的缓存来记录一些相关信息，最后在每个句子解码的过程中使用这个缓存来提供上下文信息。图\ref{fig:17-3-3}描述了一种基于缓存的篇章级翻译模型结构\upcite{DBLP:journals/tacl/TuLSZ18}。在这里，翻译模型基于循环神经网络（参考{\chapterten}），但是这种方法同样适用于包括Transformer在内的其他神经机器翻译模型。模型中篇章上下文的建模依赖于缓存的读和写操作。其中读操作以及与目标端表示的融合方法和层次结构中提到的方法类似，同样使用注意力机制以及门控机制来获取最终的目标端表示$\widetilde{\mathbi{s}_{t}}$。而缓存的写操作则是在每个句子翻译结束后，将句子中每个词${y}_{t}$对应的表示对$<\mathbi{c}_{t},\mathbi{s}_{t}>$作为注意力的键和值按照一定规则写入缓存。其中，$\mathbi{c}_{t}$和$\mathbi{s}_{t}$分别表示第$t$个目标词所对应的源语表示和解码器隐层状态。如果${y}_{t}$不存在于缓存，则写入其中的空槽或者替换最久未使用的键值对；如果${y}_{t}$存在于缓存，则将对应的键值对进行更新:
+\begin{eqnarray}
+\mathbi{k}_{i}&=&(\mathbi{k}_{i}+\mathbi{c}_{t})/2
+\label{eq:17-3-10}\\
+\mathbi{v}_{i}&=&(\mathbi{v}_{i}+\mathbi{s}_{t})/2
+\label{eq:17-3-11}
+\end{eqnarray}
+其中$i$表示$y_t$在缓存中的位置，$\mathbi{k}_{i}$和$\mathbi{v}_{i}$分别为缓存中对应的键和值。这种方法缓存的都是目标端历史的词级表示，因此能够解决一些词汇衔接的问题，比如词汇一致性和一些搭配问题，产生更连贯的翻译结果。
+%----------------------------------------------
+\begin{figure}[htp]
+    \centering
+	\input{./Chapter17/Figures/figure-cache}
+    \caption{基于Cache的解码器结构\upcite{DBLP:journals/tacl/TuLSZ18}}
+    \label{fig:17-3-3}
+\end{figure}
+%----------------------------------------------
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+\subsection{在推断阶段结合篇章上下文}
+\parinterval 上一节介绍的建模方法主要针对上下文句子进行建模，通过端到端的方式进行上下文信息的提取和融合。由于篇章级双语数据相对稀缺，这种复杂的篇章级翻译模型很难通过直接训练取得很好的效果，通常会采用两阶段训练或参数共享的方式。此外，相比之下句子级双语数据更为丰富，在此基础上训练得到的模型性能通常能够达到预期。因此，一个自然的想法是基于高质量句子级翻译模型，在推断过程中结合上下文信息方法来构造篇章级翻译模型。比如通过结合目标语言端的篇章级语言模型\upcite{DBLP:conf/discomt/GarciaCE19,DBLP:journals/tacl/YuSSLKBD20,DBLP:journals/corr/abs-2010-12827}来引入上下文信息，或者通过两阶段解码\upcite{DBLP:conf/aaai/XiongH0W19,DBLP:conf/acl/VoitaST19}和后编辑\upcite{DBLP:conf/emnlp/VoitaST19}的方法在句子级翻译结果上进行修正。
+\parinterval 相比于篇章级双语数据，篇章级单语数据更容易获取。在双语数据稀缺的情况下，通过引入目标语言端的篇章级语言模型可以更充分的利用这些单语数据。最简单的做法是在翻译模型的分数基础上加上语言模型的分数\upcite{DBLP:conf/discomt/GarciaCE19,DBLP:journals/corr/abs-2010-12827}，既可以在推断的搜索过程中作为模型最终打分，也可以将其作为重排序阶段的一种特征。其次，也可以使用噪声信道模型对篇章级翻译进行建模\upcite{DBLP:journals/tacl/YuSSLKBD20}。使用贝叶斯规则，将篇章翻译问题转换成如下形式（参考5.3节内容）：
+\begin{eqnarray}
+\widehat{Y}&=&\argmax_{Y}\funp{P}(Y|X)\\
+&=&\argmax_{Y}\underbrace{\funp{P}(X|Y)}_{\textrm{信道模型}}\times\underbrace{\funp{P}(Y)}_{\textrm{语言模型}}
+\label{eq:17-3-12}
+\end{eqnarray}
+其中$X$和$Y$分别表示源语言端和目标语言端篇章。进一步，可以得到近似形式：
+\begin{eqnarray}
+\widehat{Y}&\approx&\argmax_{Y}\prod_{i=1}^{T}{\funp{P}(X_i|Y_i)\times\funp{P}(Y_i|Y_{<i})}
+\label{eq:17-3-13}
+\end{eqnarray}
+通过这种生成式模型，只需要使用句子级的翻译模型以及目标端的篇章级翻译模型，避免了对篇章级双语数据的依赖。
+\parinterval 另一种改进方法不影响句子级翻译模型的推断过程，而是在完成翻译后使用额外的模块进行第二阶段的解码，通过两阶段的解码来引入上下文信息\upcite{DBLP:conf/aaai/XiongH0W19,DBLP:conf/acl/VoitaST19}。如图\ref{fig:17-3-4}所示，这种两阶段解码的做法相当于将篇章级翻译的问题进行了分离和简化，适用于篇章级双语数据稀缺的场景。基于类似的思想，有研究人员使用后编辑的做法对翻译结果进行修正\upcite{DBLP:conf/emnlp/VoitaST19}。区别于两阶段解码的方法，后编辑的方法无需参考源语信息，只是基于目标语言端的连续翻译结果来提供上下文信息。通过这种方式，可以降低对篇章级双语数据的需求。
+%----------------------------------------------
+\begin{figure}[htp]
+    \centering
+	\input{./Chapter17/Figures/figure-twodecoding}
+    \caption{两阶段解码}
+    \label{fig:17-3-4}
+\end{figure}
+%----------------------------------------------
+%----------------------------------------------------------------------------------------
+%    NEW SECTION
+%----------------------------------------------------------------------------------------
+\section{小结及扩展阅读}
+\parinterval 本章仅对音频处理和语音识别进行了简单的介绍，具体内容可以参考一些经典书籍，比如关于信号处理的基础知识\upcite{Oppenheim2001DiscretetimeSP,Quatieri2001DiscreteTimeSS}，以及语音识别的传统方法\upcite{DBLP:books/daglib/0071550,Huang2001SpokenLP}和基于深度学习的最新方法\upcite{benesty2008automatic}。此外，语音翻译的一个重要应用是机器同声传译。
+\parinterval 在篇章级翻译方面，一些研究工作对这类模型的上下文建模能力进行了探索\upcite{DBLP:conf/discomt/KimTN19,DBLP:conf/acl/LiLWJXZLL20}，发现模型性能在小数据集上的BLEU提升并不完全来自于上下文信息的利用。同时，受限于数据规模，篇章级翻译模型相对难以训练。一些研究人员通过调整训练策略来帮助模型更容易捕获上下文信息\upcite{DBLP:journals/corr/abs-1903-04715,DBLP:conf/acl/SaundersSB20,DBLP:conf/mtsummit/StojanovskiF19}。除了训练策略的调整，也可以使用数据增强\upcite{DBLP:conf/discomt/SugiyamaY19}和预训练\upcite{DBLP:journals/corr/abs-1911-03110,DBLP:journals/tacl/LiuGGLEGLZ20}的手段来缓解数据稀缺的问题。此外，区别于传统的篇章级翻译，一些对话翻译也需要使用长距离上下文信息\upcite{DBLP:conf/wmt/MarufMH18}。
+\parinterval 最近，多模态机器翻译、图像描述、视觉问答\upcite{DBLP:conf/iccv/AntolALMBZP15}（Visual Question Answering）等多模态任务受到人工智能领域的广泛关注。如何将多个模态的信息充分融合，是研究多模态任务的重要问题。在自然语言处理领域transformer\upcite{vaswani2017attention}框架的提出后，被应用到计算机视觉\upcite{DBLP:conf/eccv/CarionMSUKZ20}、多模态任务\upcite{DBLP:conf/acl/YaoW20,DBLP:journals/tcsv/YuLYH20,Huasong2020SelfAdaptiveNM}效果也有显著的提升。另外，数据稀缺是多模态任务受限之处，可以采取数据增强\upcite{DBLP:conf/emnlp/GokhaleBBY20,DBLP:conf/eccv/Tang0ZWY20}的方式缓解。但是，这时仍需要回答在：模型没有充分训练时，图像等模态信息究竟在翻译里发挥了多少作用？类似的问题在篇章级机器翻译中也存在，上下文模型在训练数据量很小的时候对翻译的作用十分微弱（引用李北ACL）。因此，也有必要探究究竟图像等上下文信息如何可以更有效地发挥作用。此外，受到预训练模型的启发，在多模态领域，图像和文本联合预训练\upcite{DBLP:conf/eccv/Li0LZHZWH0WCG20,DBLP:conf/aaai/ZhouPZHCG20,DBLP:conf/iclr/SuZCLLWD20}的工作也相继开展，利用transformer框架，通过自注意力机制捕捉图像和文本的隐藏对齐，提升模型性能，同时缓解数据稀缺问题。
-\section{}
--- a/Chapter2/chapter2.tex
+++ b/Chapter2/chapter2.tex
@@ -273,7 +273,7 @@ F(x)=\int_{-\infty}^x f(x)\textrm{d}x
 \label{eq:2-15}
 \end{eqnarray}
-\parinterval 相对熵的意义在于：在一个事件空间里，概率分布$\funp{P}(x)$对应的每个事件的可能性。若用概率分布$\funp{Q}(x)$编码$\funp{P}(x)$，平均每个事件的信息量增加了多少。它衡量的是同一个事件空间里两个概率分布的差异。KL距离有两条重要的性质：
+\parinterval 其中，概率分布$\funp{P}(x)$对应的每个事件的可能性。相对熵的意义在于：在一个事件空间里，若用概率分布$\funp{Q}(x)$来编码$\funp{P}(x)$，相比于用概率分布$\funp{P}(x)$来编码$\funp{P}(x)$时信息量增加了多少。它衡量的是同一个事件空间里两个概率分布的差异。KL距离有两条重要的性质：
 \begin{itemize}
 \vspace{0.5em}
@@ -474,10 +474,12 @@ F(x)=\int_{-\infty}^x f(x)\textrm{d}x
 \label{eq:2-23}
 \end{eqnarray}
-\parinterval 这样，整个序列$w_1 w_2 \ldots w_m$的生成概率可以被重新定义为：
+\parinterval 如表\ref{tab:2-2}所示，整个序列$w_1 w_2 \ldots w_m$的生成概率可以被重新定义为：
 %------------------------------------------------------
+\begin{table}[htp]{
 \begin{center}
+\caption{基于$n$-gram的序列生成概率}
 {\footnotesize
 \begin{tabular}{l|l|l |l|l}
 链式法则 & 1-gram & 2-gram & $ \ldots $ & $n$-gram\\
@@ -491,7 +493,10 @@ F(x)=\int_{-\infty}^x f(x)\textrm{d}x
 \rule{0pt}{10pt} $\funp{P}(w_m|w_1  \ldots  w_{m-1})$ & $\funp{P}(w_m)$ & $\funp{P}(w_m|w_{m-1})$ & $ \ldots $ & $\funp{P}(w_m|w_{m-n+1}  \ldots  w_{m-1})$
 \end{tabular}
 }
+\label{tab:2-2}
 \end{center}
+}
+\end{table}
 %------------------------------------------------------
 \parinterval 可以看到，1-gram语言模型只是$n$-gram语言模型的一种特殊形式。基于独立性假设，1-gram假定当前单词出现与否与任何历史都无关，这种方法大大化简了求解句子概率的复杂度。比如，上一节中公式\eqref{eq:seq-independ}就是一个1-gram语言模型。但是，句子中的单词并非完全相互独立的，这种独立性假设并不能完美地描述客观世界的问题。如果需要更精确地获取句子的概率，就需要使用更长的“历史”信息，比如，2-gram、3-gram、甚至更高阶的语言模型。
@@ -565,7 +570,7 @@ F(x)=\int_{-\infty}^x f(x)\textrm{d}x
 \subsubsection{1. 加法平滑方法}
-\parinterval {\small\bfnew{加法平滑}}\index{加法平滑}（Additive Smoothing）\index{Additive Smoothing}是一种简单的平滑技术。本小节首先介绍这一方法，希望通过它了解平滑算法的思想。通常情况下，系统研发者会利用采集到的语料库来模拟真实的全部语料库。当然，没有一个语料库能覆盖所有的语言现象。假设有一个语料库$C$，其中从未出现“确实\ 现在”这样的2-gram，现在要计算一个句子$S$ =“确实/现在/物价/很/高”的概率。当计算“确实/现在”的概率时，$\funp{P}(S) = 0$，导致整个句子的概率为0。
+\parinterval {\small\bfnew{加法平滑}}\index{加法平滑}（Additive Smoothing）\index{Additive Smoothing}是一种简单的平滑技术。通常情况下，系统研发者会利用采集到的语料库来模拟真实的全部语料库。当然，没有一个语料库能覆盖所有的语言现象。假设有一个语料库$C$，其中从未出现“确实\ 现在”这样的2-gram，现在要计算一个句子$S$ =“确实/现在/物价/很/高”的概率。当计算“确实/现在”的概率时，$\funp{P}(S) = 0$，导致整个句子的概率为0。
 \parinterval 加法平滑方法假设每个$n$-gram出现的次数比实际统计次数多$\theta$次，$0 < \theta\le 1$。这样，计算概率的时候分子部分不会为0。重新计算$\funp{P}(\textrm{现在}|\textrm{确实})$，可以得到：
 \begin{eqnarray}
@@ -632,7 +637,7 @@ N & = & \sum_{r=0}^{\infty}{r^{*}n_r} \nonumber \\
 \noindent 其中$n_1/N$就是分配给所有出现为0次事件的概率。古德-图灵方法最终通过出现1次的$n$-gram估计了出现为0次的事件概率，达到了平滑的效果。
-\parinterval 下面通过一个例子来说明这个方法是如何对事件出现的可能性进行平滑的。仍然考虑在加法平滑法中统计单词的例子，根据古德-图灵方法进行修正如表\ref{tab:2-2}所示。
+\parinterval 下面通过一个例子来说明这个方法是如何对事件出现的可能性进行平滑的。仍然考虑在加法平滑法中统计单词的例子，根据古德-图灵方法进行修正如表\ref{tab:2-3}所示。
 %------------------------------------------------------
 \begin{table}[htp]{
@@ -647,7 +652,7 @@ N & = & \sum_{r=0}^{\infty}{r^{*}n_r} \nonumber \\
 \rule{0pt}{10pt} 3 & 1 & 4 & 0.333 \\
 \rule{0pt}{10pt} 4 & 1 & - & - \\
 \end{tabular}
-\label{tab:2-2}
+\label{tab:2-3}
 }
 \end{center}
 }\end{table}
@@ -684,7 +689,7 @@ I cannot see without my reading \underline{\ \ \ \ \ \ \ \ }
 \parinterval 观察语料中的2-gram发现，“Francisco”的前一个词仅可能是“San”，不会出现“reading”。这个分析证实了，考虑前一个词的影响是有帮助的，比如仅在前一个词是“San”时，才给“Francisco”赋予一个较高的概率值。基于这种想法，改进原有的1-gram模型，创造一个新的1-gram模型$\funp{P}_{\textrm{continuation}}$，简写为$\funp{P}_{\textrm{cont}}$。这个模型可以通过考虑前一个词的影响评估当前词作为第二个词出现的可能性。
-\parinterval 为了评估$\funp{P}_{\textrm{cont}}$，统计使用当前词作为第二个词所出现2-gram的种类，2-gram种类越多，这个词作为第二个词出现的可能性越高，呈正比：
+\parinterval 为了评估$\funp{P}_{\textrm{cont}}$，统计使用当前词作为第二个词所出现2-gram的种类，2-gram种类越多，这个词作为第二个词出现的可能性越高：
 \begin{eqnarray}
 \funp{P}_{\textrm{cont}}(w_i) \varpropto |\{w_{i-1}: c(w_{i-1} w_i )>0\}|
 \label{eq:2-34}
@@ -749,7 +754,7 @@ c(\cdot) & \textrm{当计算最高阶模型时}  \\
 \label{eq:5-65}
 \end{eqnarray}
-\parinterval  本质上，PPL反映了语言模型对序列可能性预测能力的一种评估。如果$ w_1\dots w_m $\\是真实的自然语言，``完美''的模型会得到$ \funp{P}(w_1\dots w_m)=1 $，它对应了最低的困惑度PPL=1，这说明模型可以完美地对词序列出现的可能性进行预测。当然，真实的语言模型是无法达到PPL=1的，比如，在著名的Penn Treebank（PTB）数据上最好的语言模型的PPL值也只能到达35左右。可见自然语言处理任务的困难程度。
+\parinterval  本质上，PPL反映了语言模型对序列可能性预测能力的一种评估。如果$ w_1\dots w_m $\\是真实的自然语言，“完美”的模型会得到$ \funp{P}(w_1\dots w_m)=1 $，它对应了最低的困惑度PPL=1，这说明模型可以完美地对词序列出现的可能性进行预测。当然，真实的语言模型是无法达到PPL=1的，比如，在著名的Penn Treebank（PTB）数据上最好的语言模型的PPL值也只能到达35左右。可见自然语言处理任务的困难程度。
 %----------------------------------------------------------------------------------------
 %    NEW SECTION
@@ -814,7 +819,7 @@ c(\cdot) & \textrm{当计算最高阶模型时}  \\
 \noindent 这里$\arg$即argument（参数），$\argmax_x f(x)$表示返回使$f(x)$达到最大的$x$。$\argmax_{w \in \chi}$\\$\funp{P}(w)$表示找到使语言模型得分$\funp{P}(w)$达到最大的单词序列$w$。$\chi$ 是搜索问题的解空间，它是所有可能的单词序列$w$的集合。$\hat{w}$可以被看做该搜索问题中的“最优解”，即概率最大的单词序列。
-\parinterval 在序列生成任务中，最简单的策略就是对词表中的词汇进行任意组合，通过这种枚举的方式得到全部可能的序列。但是，很多时候并生成序列的长度是无法预先知道的。比如，机器翻译中目标语序列的长度是任意的。那么怎样判断一个序列何时完成了生成过程呢？这里借用现代人类书写中文和英文的过程：句子的生成首先从一片空白开始，然后从左到右逐词生成，除了第一个单词，所有单词的生成都依赖于前面已经生成的单词。为了方便计算机实现，通常定义单词序列从一个特殊的符号<sos>后开始生成。同样地，一个单词序列的结束也用一个特殊的符号<eos>来表示。
+\parinterval 在序列生成任务中，最简单的策略就是对词表中的词汇进行任意组合，通过这种枚举的方式得到全部可能的序列。但是，很多时候待生成序列的长度是无法预先知道的。比如，机器翻译中目标语序列的长度是任意的。那么怎样判断一个序列何时完成了生成过程呢？这里借用现代人类书写中文和英文的过程：句子的生成首先从一片空白开始，然后从左到右逐词生成，除了第一个单词，所有单词的生成都依赖于前面已经生成的单词。为了方便计算机实现，通常定义单词序列从一个特殊的符号<sos>后开始生成。同样地，一个单词序列的结束也用一个特殊的符号<eos>来表示。
 \parinterval 对于一个序列$<$sos$>$\ I\ agree\ $<$eos$>$，图\ref{fig:2-12}展示语言模型视角下该序列的生成过程。该过程通过在序列的末尾不断附加词表中的单词来逐渐扩展序列，直到这段序列结束。这种生成单词序列的过程被称作{\small\bfnew{自左向右生成}}\index{自左向右生成}（Left-to-Right Generation）\index{Left-to-Right Generation}。注意，这种序列生成策略与$n$-gram的思想天然契合，因为$n$-gram语言模型中，每个词的生成概率依赖前面（左侧）若干词，因此$n$-gram语言模型也是一种自左向右的计算模型。
@@ -857,7 +862,7 @@ c(\cdot) & \textrm{当计算最高阶模型时}  \\
 \parinterval 当任务对单词序列长度没有限制时，上述两种方法枚举出的单词序列也是无穷无尽的。因此这两种枚举策略并不具备完备性而且会导致枚举过程无法停止。由于日常生活中通常不会见到特别长的句子，因此可以通过限制单词序列的最大长度来避免这个问题。一旦单词序列的最大长度被确定，以上两种枚举策略就可以在一定时间内枚举出所有可能的单词序列，因而一定可以找到最优的单词序列，即具备最优性。
-\parinterval 此时上述生成策略虽然可以满足完备性和最优性，但其仍然算不上是优秀的生成策略，因为这两种算法在时间复杂度和空间复杂度上的表现很差，如表\ref{tab:2-3}所示。其中$|V|$为词表大小，$m$ 为序列长度。值得注意的是，在之前的遍历过程中，除了在序列开头一定会挑选<sos>之外，其他位置每次可挑选的单词并不只有词表中的单词，还有结束符号<eos>，因此实际上生成过程中每个位置的单词候选数量为$|V|+1$。
+\parinterval 此时上述生成策略虽然可以满足完备性和最优性，但其仍然算不上是优秀的生成策略，因为这两种算法在时间复杂度和空间复杂度上的表现很差，如表\ref{tab:2-4}所示。其中$|V|$为词表大小，$m$ 为序列长度。值得注意的是，在之前的遍历过程中，除了在序列开头一定会挑选<sos>之外，其他位置每次可挑选的单词并不只有词表中的单词，还有结束符号<eos>，因此实际上生成过程中每个位置的单词候选数量为$|V|+1$。
 \vspace{0.5em}
 %------------------------------------------------------
@@ -870,13 +875,13 @@ c(\cdot) & \textrm{当计算最高阶模型时}  \\
 \rule{0pt}{10pt} 深度优先 & $O({(|V|+1)}^{m-1})$ & $O(m)$ \\
 \rule{0pt}{10pt} 宽度优先 & $O({(|V|+1)}^{m-1}$) & $O({(|V|+1)}^{m})$ \\
 \end{tabular}
-\label{tab:2-3}
+\label{tab:2-4}
 }
 \end{center}
 }\end{table}
 %------------------------------------------------------
-\parinterval 那么是否有比枚举策略更高效的方法呢？答案是肯定的。一种直观的方法是将搜索的过程表示成树型结构，称为解空间树。它包含了搜索过程中可生成的全部序列。该树的根节点恒为<sos>，代表序列均从<sos> 开始。该树结构中非叶子节点的兄弟节点有$|V|+1$个，由词表和结束符号<eos>构成。从图\ref{fig:2-13}可以看到，对于一个最大长度为4的序列的搜索过程，生成某个单词序列的过程实际上就是访问解空间树中从根节点<sos> 开始一直到叶子节点<eos>结束的某条路径，而这条的路径上节点按顺序组成了一段独特的单词序列。此时对所有可能单词序列的枚举就变成了对解空间树的遍历。并且枚举的过程与语言模型打分的过程也是一致的，每枚举一个词$i$也就是在上图选择$w_i$一列的一个节点，语言模型就可以为当前的树节点$w_i$给出一个分值，即$\funp{P}(w_i | w_1 w_2 \ldots w_{i-1})$。对于$n$-gram语言模型，这个分值可以表示为$\funp{P}(w_i | w_1 w_2 \ldots w_{i-1})=\funp{P}(w_i | w_{i-n+1} \ldots w_{i-1})$
+\parinterval 那么是否有比枚举策略更高效的方法呢？答案是肯定的。一种直观的方法是将搜索的过程表示成树型结构，称为解空间树。它包含了搜索过程中可生成的全部序列。该树的根节点恒为<sos>，代表序列均从<sos> 开始。该树结构中非叶子节点的兄弟节点有$|V|+1$个，由词表和结束符号<eos>构成。从图\ref{fig:2-13}可以看到，对于一个最大长度为4的序列的搜索过程，生成某个单词序列的过程实际上就是访问解空间树中从根节点<sos> 开始一直到叶子节点<eos>结束的某条路径，而这条的路径上节点按顺序组成了一段独特的单词序列。此时对所有可能单词序列的枚举就变成了对解空间树的遍历。并且枚举的过程与语言模型打分的过程也是一致的，每枚举一个词$i$也就是在图\ref{fig:2-13}选择$w_i$一列的一个节点，语言模型就可以为当前的树节点$w_i$给出一个分值，即$\funp{P}(w_i | w_1 w_2 \ldots w_{i-1})$。对于$n$-gram语言模型，这个分值可以表示为$\funp{P}(w_i | w_1 w_2 \ldots w_{i-1})=\funp{P}(w_i | w_{i-n+1} \ldots w_{i-1})$
 %----------------------------------------------
 \begin{figure}[htp]
@@ -912,7 +917,7 @@ c(\cdot) & \textrm{当计算最高阶模型时}  \\
 \end{figure}
 %-------------------------------------------
-\parinterval 这样，语言模型的打分与解空间树的遍历就融合在一起了。于是，序列生成的问题可以被重新描述为：寻找所有单词序列组成的解空间树中权重总和最大的一条路径。在这个定义下，前面提到的两种枚举词序列的方法就是经典的{\small\bfnew{深度优先搜索}}\index{深度优先搜索}（Depth-first Search）\index{Depth-first Search}和{\small\bfnew{宽度优先搜索}}\index{宽度优先搜索}（Breadth-first Search）\index{Breadth-first Search}的雏形\upcite{even2011graph,tarjan1972depth}。在后面的内容中，从遍历解空间树的角度出发，可以对原始这些搜索策略的效率进行优化。
+\parinterval 这样，语言模型的打分与解空间树的遍历就融合在一起了。于是，序列生成的问题可以被重新描述为：寻找所有单词序列组成的解空间树中权重总和最大的一条路径。在这个定义下，前面提到的两种枚举词序列的方法就是经典的{\small\bfnew{深度优先搜索}}\index{深度优先搜索}（Depth-first Search）\index{Depth-first Search}和{\small\bfnew{宽度优先搜索}}\index{宽度优先搜索}（Breadth-first Search）\index{Breadth-first Search}的雏形\upcite{even2011graph,tarjan1972depth}。在后面的内容中，从遍历解空间树的角度出发，可以对这些原始的搜索策略的效率进行优化。
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -1038,7 +1043,7 @@ c(\cdot) & \textrm{当计算最高阶模型时}  \\
 \begin{adjustwidth}{1em}{}
 \begin{itemize}
 \vspace{0.5em}
-\item 在$n$-gram语言模型中，由于语料中往往存在大量的低频词以及未登录词，模型会产生不合理的概率预测结果。因此本章介绍了三种平滑方法，以解决上述问题。实际上，平滑方法是语言建模中的重要研究方向。除了上述三种方法之外，还有Jelinek–Mercer平滑\upcite{jelinek1980interpolated}、Katz 平滑\upcite{katz1987estimation}以及Witten–Bell平滑等等\upcite{bell1990text,witten1991the}。相关工作也对这些平滑方法进行了详细对比\upcite{chen1999empirical,goodman2001a}。
+\item 在$n$-gram语言模型中，由于语料中往往存在大量的低频词以及未登录词，模型会产生不合理的概率预测结果。因此本章介绍了三种平滑方法，以解决上述问题。实际上，平滑方法是语言建模中的重要研究方向。除了上文中介绍的三种平滑方法之外，还有如Jelinek–Mercer平滑\upcite{jelinek1980interpolated}、Katz 平滑\upcite{katz1987estimation}以及Witten–Bell平滑等等\upcite{bell1990text,witten1991the}的平滑方法。相关工作也对这些平滑方法进行了详细对比\upcite{chen1999empirical,goodman2001a}。
 \vspace{0.5em}
 \item 除了平滑方法，也有很多工作对$n$-gram语言模型进行改进。比如，对于形态学丰富的语言，可以考虑对单词的形态学变化进行建模。这类语言模型在一些机器翻译系统中也体现出了很好的潜力\upcite{kirchhoff2005improved,sarikaya2007joint,koehn2007factored}。此外，如何使用超大规模数据进行语言模型训练也是备受关注的研究方向。比如，有研究者探索了对超大语言模型进行压缩和存储的方法\upcite{federico2007efficient,federico2006how,heafield2011kenlm}。另一个有趣的方向是，利用随机存储算法对大规模语言模型进行有效存储\upcite{talbot2007smoothed,talbot2007randomised}，比如，在语言模型中使用Bloom\ Filter等随机存储的数据结构。
 \vspace{0.5em}

--- a/Chapter3/Figures/figure-examples-of-chinese-word-segmentation-based-on-1-gram-model.tex
+++ b/Chapter3/Figures/figure-examples-of-chinese-word-segmentation-based-on-1-gram-model.tex
@@ -39,24 +39,7 @@
 \node[rectangle,draw=ublue,thick,inner sep=0.2em,fill=white,drop shadow] [fit = (sentlabel) (sent)] (segsystem) {};
 \end{pgfonlayer}
-{\footnotesize
-{
-\node [anchor=west] (label1) at (0,6em) {实际上，通过学习我们得到了一个分词模型$\funp{P}(\cdot)$，给定任意的分词结果};
-\node [anchor=north west] (label1part2) at ([yshift=0.5em]label1.south west) {$W=w_1 w_2...w_n$，都能通过$\funp{P}(W)=\funp{P}(w_1) \cdot \funp{P}(w_2) \cdot ... \cdot \funp{P}(w_n)$ 计算这种分\hspace{0.13em} };
-\node [anchor=north west] (label1part3) at ([yshift=0.5em]label1part2.south west) {词的概率值};
-}
-\begin{pgfonlayer}{background}
-{
-\node[rectangle,fill=blue!10,thick,dotted,inner sep=0.2em] [fit = (label1) (label1part2) (label1part3)] (label1content) {};
-}
-\end{pgfonlayer}
-{
-\draw [-,thick,dotted] ([yshift=0.3em]modellabel.north) ..controls +(north:0.5) and +(south:0.5).. ([xshift=-3em]label1content.south);
-}
-}
 {\footnotesize
 {

--- a/Chapter3/chapter3.tex
+++ b/Chapter3/chapter3.tex
@@ -378,7 +378,7 @@ $计算这种切分的概率值。
 \begin{itemize}
 \vspace{0.5em}
-\item 隐含状态序列的概率计算：即给定模型（转移概率和发射概率），根据可见状态序列（抛硬币的结果）计算在该模型下得到这个结果的概率，这个问题的解决需要用到前后向算法\upcite{baum1970maximization}。
+\item 隐含状态序列的概率计算：即给定模型（转移概率和发射概率），根据可见状态序列（抛硬币的结果）计算在该模型下得到这个结果的概率，这个问题的求解需要用到前后向算法\upcite{baum1970maximization}。
 \vspace{0.5em}
 \item 参数学习：即给定硬币种类（隐含状态数量），根据多个可见状态序列（抛硬币的结果）估计模型的参数（转移概率），这个问题的求解需要用到EM算法\upcite{1977Maximum}。
 \vspace{0.5em}
@@ -398,12 +398,12 @@ $计算这种切分的概率值。
 \parinterval 一种简单的办法是使用相对频次估计得到转移概率和发射概率估计值。令$x_i$表示第$i$个位置的可见状态，$y_i$表示第$i$个位置的隐含状态，$\funp{P}(y_i|y_{i-1})$表示第$i-1$个位置到第$i$个位置的状态转移概率，$\funp{P}(x_i|y_{i}) $表示第$i$个位置的发射概率，于是有：
 \begin{eqnarray}
-\funp{P}(y_i|y_{i-1}) = \frac{{c}(y_{i-1},y_i)}{{c}(y_{i-1})}
+\funp{P}(y_i|y_{i-1}) &=& \frac{{c}(y_{i-1},y_i)}{{c}(y_{i-1})}
 \label{eq:3.3-1}
 \end{eqnarray}
 \begin{eqnarray}
-\funp{P}(x_i|y_{i}) = \frac{{c}(x_i,y_i)}{{c}(y_i)}
+\funp{P}(x_i|y_{i}) &=& \frac{{c}(x_i,y_i)}{{c}(y_i)}
 \label{eq:3.3-2}
 \end{eqnarray}
@@ -411,20 +411,20 @@ $计算这种切分的概率值。
 \parinterval 在获得转移概率和发射概率的基础上，对于一个句子进行命名实体识别可以被描述为：在观测序列$\seq{x}$（可见状态，即输入的词序列）的条件下，最大化标签序列$\seq{y}$（隐含状态，即标记序列）的概率，即：
 \begin{eqnarray}
-\hat{\seq{y}} = \arg\max_{\seq{y}}\funp{P}(\seq{y}|\seq{x})
+\hat{\seq{y}} &=& \arg\max_{\seq{y}}\funp{P}(\seq{y}|\seq{x})
 \label{eq:3.3-3}
 \end{eqnarray}
 \parinterval 根据贝叶斯定理，该概率被分解为$\funp{P}(\seq{y}|\seq{x})=\frac{\funp{P}(\seq{x},\seq{y})}{\funp{P}(\seq{x})}$，其中$\funp{P}(\seq{x})$是固定概率，因为$\seq{x}$在这个过程中是确定的不变量。因此只需考虑如何求解分子，即将求条件概率$\funp{P}(\seq{y}|\seq{x})$的问题转化为求联合概率$\funp{P}(\seq{y},\seq{x})$的问题：
 \begin{eqnarray}
-\hat{\seq{y}} = \arg\max_{\seq{y}}\funp{P}(\seq{x},\seq{y}) \label{eq:markov-sequence-argmax}
+\hat{\seq{y}} &=& \arg\max_{\seq{y}}\funp{P}(\seq{x},\seq{y}) \label{eq:markov-sequence-argmax}
 \label{eq:3.3-4}
 \end{eqnarray}
 \parinterval 将式\eqref{eq:joint-prob-xy}带入式\eqref{eq:markov-sequence-argmax}可以得到最终计算公式，如下：
 \begin{eqnarray}
-\hat{\seq{y}} = \arg\max_{\seq{y}}\prod_{i=1}^{m}\funp{P}(x_i|y_i)\funp{P}(y_i|y_{i-1})
+\hat{\seq{y}} &=& \arg\max_{\seq{y}}\prod_{i=1}^{m}\funp{P}(x_i|y_i)\funp{P}(y_i|y_{i-1})
 \label{eq:3.3-5}
 \end{eqnarray}
@@ -483,7 +483,7 @@ F(y_{i-1},y_i,\seq{x},i) & = & t(y_{i-1},y_i,\seq{x},i)+s(y_i,\seq{x},i)
 \parinterval 公式\eqref{eq:3.3-9}中的$Z(x)$即为上面提到的实现全局统计归一化的归一化因子，其计算方式为：
 \begin{eqnarray}
-Z(\seq{x})=\sum_{\seq{y}}\exp(\sum_{i=1}^m\sum_{j=1}^k\lambda_{j}F_{j}(y_{i-1},y_i,\seq{x},i))
+Z(\seq{x})&=&\sum_{\seq{y}}\exp(\sum_{i=1}^m\sum_{j=1}^k\lambda_{j}F_{j}(y_{i-1},y_i,\seq{x},i))
 \label{eq:3.3-10}
 \end{eqnarray}
@@ -533,9 +533,9 @@ Z(\seq{x})=\sum_{\seq{y}}\exp(\sum_{i=1}^m\sum_{j=1}^k\lambda_{j}F_{j}(y_{i-1},y
 \parinterval 具体来说，分类任务目标是训练一个可以根据输入数据预测离散标签的{\small\sffamily\bfseries{分类器}}\index{分类器}（Classifier\index{Classifier}），也可称为分类模型。在有监督的分类任务中\footnote{与之相对应的，还有无监督、半监督分类任务，不过这些内容不是本书讨论的重点。读者可以参看参考文献\upcite{周志华2016机器学习,李航2019统计学习方法}对相关概念进行了解。}，训练数据集合通常由形似$(\boldsymbol{x}_i,y_i)$的带标注数据构成，$\boldsymbol{x}_i=(x_{i1},x_{i2},\ldots,x_{ik})$作为分类器的输入数据（通常被称作一个训练样本），其中$x_{ij}$表示样本$\boldsymbol{x}_i$的第$j$个特征；$y_i$作为输入数据对应的{\small\sffamily\bfseries{标签}}\index{标签}（Label）\index{Label}，反映了输入数据对应的“类别”。若标签集合大小为$n$，则分类任务的本质是通过对训练数据集合的学习，建立一个从$k$ 维样本空间到$n$维标签空间的映射关系。更确切地说，分类任务的最终目标是学习一个条件概率分布$\funp{P}(y|\boldsymbol{x})$，这样对于输入$\boldsymbol{x}$可以找到概率最大的$y$作为分类结果输出。
-\parinterval 与概率图模型一样，分类模型中也依赖特征定义。其定义形式与\ref{sec3:feature}节的描述一致，这里不再赘述。分类任务一般根据类别数量分为二分类任务和多分类任务，二分类任务是最经典的分类任务，只需要对输出进行非零即一的预测。多分类任务则可以有多种处理手段，比如，可以将其“拆解”为多个二分类任务求解，或者直接让模型输出多个类别中的一个。在命名实体识别中，往往会使用多类别分类模型。比如，在BIO标注下，有三个类别（B、I和O）。一般来说，类别数量越大分类的难度也越大。比如，BIOES标注包含5个类别，因此使用同样的分类器，它要比BIO标注下的分类问题难度大。另一方面，更多的类别有助于准确的刻画目标问题。因此在实践中需要在类别数量和分类难度之间找到一种平衡。
+\parinterval 与概率图模型一样，分类模型中也依赖特征定义。其定义形式与\ref{sec3:feature}节的描述一致，这里不再赘述。分类任务一般根据类别数量分为二分类任务和多分类任务，二分类任务是最经典的分类任务，只需要对输出进行非零即一的预测。多分类任务则可以有多种处理手段，比如，可以将其“拆解”为多个二分类任务求解，或者直接让模型输出多个类别中的一个。在命名实体识别中，往往会使用多类别分类模型。比如，在BIO标注下，有三个类别（B、I和O）。一般来说，类别数量越大分类的难度也越大。比如，BIOES标注包含5个类别，因此使用同样的分类器，它要比BIO标注下的分类问题难度大。此外，更多的类别有助于准确的刻画目标问题。因此在实践中需要在类别数量和分类难度之间找到一种平衡。
-\parinterval 在机器翻译和语言建模中也会遇到类似的问题，比如，生成单词的过程可以被看做是一个分类问题，类别数量就是词表的大小。显然，词表越大可以覆盖更多样的单词及形态学变化，但是过大的词表里会包含很多低频词，其计算复杂度会显著增加。然而，过小的词表又无法包含足够多的单词。因此，在设计这类系统的时候对词表大小的选择（类别数量的选择）是十分重要的，往往要通过大量的实验得到最优的设置。
+\parinterval 在机器翻译和语言建模中也会遇到类似的问题，比如，生成单词的过程可以被看做是一个分类问题，类别数量就是词表的大小。显然，词表越大可以覆盖更多的单词和更多种类的单词形态学变化，但是过大的词表里会包含很多低频词，其计算复杂度会显著增加。然而，过小的词表又无法包含足够多的单词。因此，在设计这类系统的时候对词表大小的选择（类别数量的选择）是十分重要的，往往要通过大量的实验得到最优的设置。
 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -563,7 +563,7 @@ Z(\seq{x})=\sum_{\seq{y}}\exp(\sum_{i=1}^m\sum_{j=1}^k\lambda_{j}F_{j}(y_{i-1},y
 %    NEW SECTION
 %----------------------------------------------------------------------------------------
-\section{句法分析（短语结构分析）}
+\section{句法分析}
 \parinterval 前面已经介绍了什么叫做“词”以及如何对分词问题进行统计建模。同时，也介绍了如何对多个单词构成的命名实体进行识别。无论是分词还是命名实体识别都是句子浅层信息的一种表示。对于一个自然语言句子来说，它更深层次的结构信息可以通过更完整的句法结构来描述，而句法信息也是机器翻译和自然语言处理其他任务中常用的知识之一。
@@ -649,19 +649,19 @@ Z(\seq{x})=\sum_{\seq{y}}\exp(\sum_{i=1}^m\sum_{j=1}^k\lambda_{j}F_{j}(y_{i-1},y
 \parinterval 举例说明，假设有上下文无关文法$G=<N,\varSigma,R,S>$，可以用它描述一个简单汉语句法结构。其中非终结符集合为不同的汉语句法标记
 \begin{eqnarray}
-N=\{\textrm{NN},\textrm{VV},\textrm{NP},\textrm{VP},\textrm{IP}\} \nonumber
+N&=&\{\textrm{NN},\textrm{VV},\textrm{NP},\textrm{VP},\textrm{IP}\} \nonumber
 \label{eq:3.4-1}
 \end{eqnarray}
 \noindent 这里，\textrm{NN}代表名词，\textrm{VV}代表动词，\textrm{NP}代表名词短语，\textrm{VP}代表动词短语，\textrm{IP}代表单句。进一步，把终结符集合定义为
 \begin{eqnarray}
-\varSigma = \{\text{猫,喜欢,吃,鱼}\} \nonumber
+\varSigma &=& \{\text{猫,喜欢,吃,鱼}\} \nonumber
 \label{eq:3.4-2}
 \end{eqnarray}
 再定义起始符集合为
 \begin{eqnarray}
-S=\{\textrm{IP}\} \nonumber
+S&=&\{\textrm{IP}\} \nonumber
 \label{eq:3.4-3}
 \end{eqnarray}
@@ -800,7 +800,7 @@ s_0 \overset{r_1}{\Rightarrow} s_1 \overset{r_2}{\Rightarrow} s_2 \overset{r_3}{
 \parinterval 概率上下文无关文法与传统上下文无关文法的区别在于，每条规则都会有一个概率，描述规则生成的可能性。具体来说，规则$\funp{P}(\alpha \to \beta)$的概率可以被定义为：
 \begin{eqnarray}
-\funp{P}(\alpha \to \beta)=\funp{P}(\beta | \alpha)
+\funp{P}(\alpha \to \beta)&=&\funp{P}(\beta | \alpha)
 \label{eq:3.4-4}
 \end{eqnarray}
@@ -831,7 +831,7 @@ r_6: & & \textrm{VP} \to \textrm{VV}\ \textrm{NN} \nonumber
 \parinterval 新的问题又来了，如何得到规则的概率呢？这里仍然可以从数据中学习文法规则的概率。假设有人工标注的数据，它包括很多人工标注句法树的句法，称之为{\small\sffamily\bfseries{树库}}\index{树库}（Treebank）\index{Treebank}。然后，对于规则$\textrm{r}:\alpha \to \beta$可以使用基于频次的方法：
 \begin{eqnarray}
-\funp{P}(r)  = \frac{\text{规则$r$在树库中出现的次数}}{\alpha \text{在树库中出现的次数}}
+\funp{P}(r)  &=& \frac{\text{规则$r$在树库中出现的次数}}{\alpha \text{在树库中出现的次数}}
 \label{eq:3.4-8}
 \end{eqnarray}
@@ -879,7 +879,7 @@ r_6: & & \textrm{VP} \to \textrm{VV}\ \textrm{NN} \nonumber
 \vspace{0.5em}
 \item 在建模方面，本章描述了基于1-gram语言模型的分词、基于上下文无关文法的句法分析等，它们都是基于人工先验知识进行模型设计的思路。也就是，问题所表达的现象被“一步一步”生成出来。这是一种典型的生成式建模思想，它把要解决的问题看作一些观测结果的隐含变量（比如，句子是观测结果，分词结果是隐含在背后的变量），之后通过对隐含变量生成观测结果的过程进行建模，以达到对问题进行数学描述的目的。这类模型一般需要依赖一些独立性假设，假设的合理性对最终的性能有较大影响。相对于生成式模型，另一类方法是{\small\sffamily\bfseries{判别式模型}}\index{判别式模型}（Discriminative Model）\index{Discriminative Model}。本章序列标注内容中提到一些模型就是判别式模型，如条件随机场\upcite{lafferty2001conditional}。它直接描述了从隐含变量生成观测结果的过程，这样对问题的建模更加直接，同时这类模型可以更加灵活的引入不同的特征。判别模型在自然语言处理中也有广泛应用\upcite{ng2002discriminative,manning2008introduction,berger1996maximum,mitchell1996m,DBLP:conf/acl/OchN02}。 在本书的第七章也会使用到判别式模型。
 \vspace{0.5em}
-\item 此外，本章并没有对分词、句法分析中的预测问题进行深入介绍。比如，如何找到概率最大的分词结果？这部分可以直接借鉴第二章中介绍的搜索方法。比如，对于基于$n$-gram语言模型的分词方法，可以 使用动态规划\upcite{huang2008coling}。对于动态规划的使用条件不满足的情况，可以考虑使用更加复杂的搜索策略，并配合一定的剪枝方法。实际上，无论是基于$n$-gram 语言模型的分词还是简单的上下文无关文法都有高效的推断方法。比如，$n$-gram语言模型可以被视为概率有限状态自动机，因此可以直接使用成熟的自动机工具\upcite{mohri2008speech}。对于更复杂的句法分析问题，可以考虑使用移进- 规约方法来解决预测问题\upcite{aho1972theory}。
+\item 事实上，本章并没有对分词、句法分析中的预测问题进行深入介绍。比如，如何找到概率最大的分词结果？这个问题的解决可以直接借鉴{\chaptertwo}中介绍的搜索方法：对于基于$n$-gram 语言模型的分词方法，可以使用动态规划方法\upcite{huang2008coling}进行搜索；在不满足动态规划的使用条件时，可以考虑使用更加复杂的搜索策略，并配合一定的剪枝方法找到最终的分词结果。实际上，无论是基于$n$-gram 语言模型的分词还是简单的上下文无关文法都有高效的推断方法。比如，$n$-gram语言模型可以被视为概率有限状态自动机，因此可以直接使用成熟的自动机工具\upcite{mohri2008speech}。对于更复杂的句法分析问题，可以考虑使用移进- 规约方法来解决预测问题\upcite{aho1972theory}。
 \vspace{0.5em}
 \item 从自然语言处理的角度来看，词法分析和语法分析中的很多问题都是序列标注问题，例如本章介绍的分词和命名实体识别。此外序列标注还可以被扩展到词性标注\upcite{brants-2000-tnt}、组块识别\upcite{tsuruoka-tsujii-2005-chunk}、关键词抽取\upcite{li-etal-2003-news-oriented}、词义角色标注\upcite{chomsky1993lectures}等任务，本章着重介绍了传统的方法，前沿方法大多与深度学习相结合，感兴趣的读者可以自行了解，其中比较有代表性的使用双向长短时记忆网络对序列进行建模，之后于不同模型进行融合得到最终的结果，例如，与条件随机场相结合的模型（BiLSTM-CRF）\upcite{2015Bidirectional}、与卷积神经网络相结合的模型（BiLSTM-CNNs）\upcite{chiu2016named}、与简单的Softmax结构相结合的模型\upcite{vzukov2018named}等。此外，对于序列标注任务，模型性能很大程度上依赖对输入序列的表示能力，因此基于预训练语言模型的方法也非常流行\upcite{Li2020A}，如：BERT\upcite{devlin2019bert}、GPT\upcite{radford2018improving}、XLM\upcite{conneau2019unsupervised}等。
 \vspace{0.5em}

--- a/Chapter4/chapter4.tex
+++ b/Chapter4/chapter4.tex
@@ -109,7 +109,7 @@
 \subsection{评价策略}
-\parinterval 合理的评价指标是人工评价得以顺利进行的基础。机器译文质量的人工评价可以追溯到1966年，自然语言处理咨询委员会提出{\small\sffamily\bfseries{可理解度}}\index{可理解度}（Intelligibility）\index{Intelligibility}和忠诚度作为机器译文质量人工评价指标\upcite{DBLP:journals/mtcl/Carroll66}。1994 年，{\small\sffamily\bfseries{充分性}}\index{充分性}（Adequacy）\index{Adequacy}、流畅度和{\small\sffamily\bfseries{信息性}}\index{信息性}（Informativeness）\index{Informativeness}成为为ARPA MT\footnote{ARPA MT计划是美国高级研究计划局软件和智能系统技术处人类语言技术计划的一部分。}的人工评价标准\upcite{DBLP:conf/amta/WhiteOO94}。此后，有不少研究者提出了更多的机器译文质量人工评估指标，例如将{\small\sffamily\bfseries{清晰度}}\index{清晰度}（Clarity）\index{Clarity}和{\small\sffamily\bfseries{连贯性}}\index{连贯性}（Coherence）\index{Coherence}加入人工评价指标中\upcite{Miller:2005:MTS}。甚至有人将各种人工评价指标集中在一起，组成了尽可能全面的机器翻译评估框架\upcite{king2003femti}。
+\parinterval 合理的评价指标是人工评价得以顺利进行的基础。机器译文质量的人工评价可以追溯到1966年，自然语言处理咨询委员会提出{\small\sffamily\bfseries{可理解度}}\index{可理解度}（Intelligibility）\index{Intelligibility}和忠诚度作为机器译文质量人工评价指标\upcite{DBLP:journals/mtcl/Carroll66}。1994 年，{\small\sffamily\bfseries{充分性}}\index{充分性}（Adequacy）\index{Adequacy}、流畅度和{\small\sffamily\bfseries{信息性}}\index{信息性}（Informativeness）\index{Informativeness}成为ARPA MT\footnote{ARPA MT计划是美国高级研究计划局软件和智能系统技术处人类语言技术计划的一部分。}的人工评价标准\upcite{DBLP:conf/amta/WhiteOO94}。此后，有不少研究者提出了更多的机器译文质量人工评估指标，例如将{\small\sffamily\bfseries{清晰度}}\index{清晰度}（Clarity）\index{Clarity}和{\small\sffamily\bfseries{连贯性}}\index{连贯性}（Coherence）\index{Coherence}加入人工评价指标中\upcite{Miller:2005:MTS}。甚至有人将各种人工评价指标集中在一起，组成了尽可能全面的机器翻译评估框架\upcite{king2003femti}。
 \parinterval 人工评价的策略非常多。考虑不同的因素，往往会使用不同的评价方案，比如：
@@ -119,7 +119,7 @@
 \vspace{0.5em}
 \item {\small\sffamily\bfseries{评价者选择}}。理想情况下，评价者应同时具有源语言和目标语言的语言能力。但是，很多时候具备双语能力的评价者很难招募，因此这时会考虑使用目标语为母语的评价者。配合参考答案，单语评价者也可以准确地评价译文质量。
 \vspace{0.5em}
-\item {\small\sffamily\bfseries{多个系统同时评价}}。如果有多个不同系统的译文需要评价，可以直接使用每个系统单独打分的方法。但是，如果仅仅是想了解不同译文之间的相对好坏，也可以采用竞评的方式，即对于每个句子，对不同系统根据译文质量进行排序，这样做的效率会高于直接打分，而且评价准确性也能够得到保证。
+\item {\small\sffamily\bfseries{多个系统同时评价}}。如果有多个不同系统的译文需要评价，可以直接使用每个系统单独打分的方法。但是，如果仅仅是想了解不同译文之间的相对好坏，也可以采用竞评的方式：对每个待翻译的源语言句子，根据各个机器翻译系统输出的译文质量对所有待评价的机器翻译系统进行排序，这样做的效率会高于直接打分，而且评价准确性也能够得到保证。
 \vspace{0.5em}
 \item {\small\sffamily\bfseries{数据选择}}。评价数据一般需要根据目标任务进行采集，为了避免和系统训练数据重复，往往会搜集最新的数据。而且，评价数据的规模越大，评价结果越科学。常用的做法是搜集一定量的评价数据，之后从中采样出所需的数据。由于不同的采样会得到不同的评价集合，这样的方法可以复用多次，得到不同的测试集。
 \vspace{0.5em}
@@ -137,35 +137,32 @@
 \parinterval 除了对译文进行简单的打分，另一种经典的人工评价方法是{\small\sffamily\bfseries{相对排序}}\index{相对排序}（Relative Ranking，RR）\index{Relative Ranking}\upcite{DBLP:conf/wmt/Callison-BurchF07}。这种方法通过对不同机器翻译的译文质量进行相对排序得到最终的评价结果。举例来说：
-\parinterval （1）在每次评价过程中，若干个等待评价的机器翻译系统被分为5个一组，评价者被提供3个连续的源文片段和1组机器翻译系统的相应译文；
-\parinterval （2）评价者需要对本组的机器译文根据其质量进行排序，不过评价者并不需要一次性将5个译文排序，而是将其两两进行比较，判出胜负或是平局。在评价过程中，由于排序是两两一组进行的，为了评价的公平性，将采用排列组合的方式进行分组和比较，若共有$n$个机器翻译系统，则会为被分为 $\textrm{C}_n^5$组，组内每个系统都将与其他4个系统进行比较，由于需要针对3个源文片段进行评价对比，则意味着每个系统都需要被比较$\textrm{C}_n^5 \times 4 \times 3$次；
-\parinterval （3）最终根据多次比较的结果，对所有参与评价的系统进行总体排名。对于如何获取合理的总体排序，有三种常见的策略：
 \begin{itemize}
 \vspace{0.5em}
-\item {\small\sffamily\bfseries{根据系统胜出的次数进行排序}}\upcite{DBLP:conf/wmt/Callison-BurchK12}。以系统${S}_j$和系统${S}_k$为例，两个系统都被比较了$\textrm{C}_n^5 \times 4 \times 3$ 次，其中系统${S}_j$获胜20次，系统${S}_k$获胜30次，总体排名中系统${S}_k$优于系统${S}_j$。
+\item 在每次评价过程中，若干个等待评价的机器翻译系统被分为5个一组，评价者被提供3个连续的源文片段和1组机器翻译系统的相应译文；
 \vspace{0.5em}
-\item {\small\sffamily\bfseries{根据冲突次数进行排序}}\upcite{DBLP:conf/wmt/Lopez12}。第一种排序策略中存在冲突现象：例如在每次两两比较中，系统${S}_j$胜过系统${S}_k$ 的次数比系统${S}_j$不敌系统${S}_k$的次数多，若待评价系统仅有系统${S}_j$、${S}_k$，显然系统${S}_j$的排名高于系统${S}_k$。但当待评价系统很多时，可能系统${S}_j$在所有比较中获胜的次数低于系统${S}_k$，此时就出现了总体排序与局部排序不一致的冲突。因此，有研究者提出，能够与局部排序冲突最少的总体排序才是最合理的。令$O$表示一个对若干个系统的排序，该排序所对应的冲突定义为：
+\item 评价者需要对本组的机器译文根据其质量进行排序，不过评价者并不需要一次性将5个译文排序，而是将其两两进行比较，判出胜负或是平局。在评价过程中，由于排序是两两一组进行的，为了评价的公平性，将采用排列组合的方式进行分组和比较，若共有$n$个机器翻译系统，则会为被分为 $\textrm{C}_n^5$组，组内每个系统都将与其他4个系统进行比较，由于需要针对3个源文片段进行评价对比，则意味着每个系统都需要被比较$\textrm{C}_n^5 \times 4 \times 3$次；
+\vspace{0.5em}
+\item 最终根据多次比较的结果，对所有参与评价的系统进行总体排名。对于如何获取合理的总体排序，有三种常见的策略：
+    \begin{itemize}
+    \item {\small\sffamily\bfseries{根据系统胜出的次数进行排序}}\upcite{DBLP:conf/wmt/Callison-BurchK12}。以系统${S}_j$和系统${S}_k$为例，两个系统都被比较了$\textrm{C}_n^5 \times 4 \times 3$ 次，其中系统${S}_j$获胜20次，系统${S}_k$获胜30次，总体排名中系统${S}_k$优于系统${S}_j$。
+    \item  {\small\sffamily\bfseries{根据冲突次数进行排序}}\upcite{DBLP:conf/wmt/Lopez12}。第一种排序策略中存在冲突现象：例如在每次两两比较中，系统${S}_j$胜过系统${S}_k$ 的次数比系统${S}_j$不敌系统${S}_k$的次数多，若待评价系统仅有系统${S}_j$、${S}_k$，显然系统${S}_j$的排名高于系统${S}_k$。但当待评价系统很多时，可能系统${S}_j$在所有比较中获胜的次数低于系统${S}_k$，此时就出现了总体排序与局部排序不一致的冲突。因此，有研究者提出，能够与局部排序冲突最少的总体排序才是最合理的。令$O$表示一个对若干个系统的排序，该排序所对应的冲突定义为：
 \begin{eqnarray}
-\textrm{conflict}(O) = \sum\limits_{{{S}_j} \in O,{{S}_k} \in O,j \ne k} {{\textrm{max}}(0,\textrm{count}_{\textrm{win}}({{S}_j},{{S}_k}) - \textrm{count}_{\textrm{loss}}({{S}_j},{{S}_k}))}
+\textrm{conflict}(O) &=& \sum\limits_{{{S}_j} \in O,{{S}_k} \in O,j \ne k} {{\textrm{max}}(0,\textrm{count}_{\textrm{win}}({{S}_j},{{S}_k}) - \textrm{count}_{\textrm{loss}}({{S}_j},{{S}_k}))}
 \label{eq:4-1}
 \end{eqnarray}
-    其中，${S}_j$和${S}_k$是成对比较的两个系统，$\textrm{count}_{\textrm{win}}({S}_j,{S}_k)$和$\textrm{count}_{\textrm{loss}}({S}_j,{S}_k)$分别是${S}_j$、${S}_k$进行成对比较时系统${S}_j$ 胜利和失败的次数。而使得$\textrm{conflict}(O)$最低的$O$就是最终的系统排序结果。
+其中，${S}_j$和${S}_k$是成对比较的两个系统，$\textrm{count}_{\textrm{win}}({S}_j,{S}_k)$和$\textrm{count}_{\textrm{loss}}({S}_j,{S}_k)$分别是${S}_j$、${S}_k$进行成对比较时系统${S}_j$ 胜利和失败的次数。而使得$\textrm{conflict}(O)$最低的$O$就是最终的系统排序结果。
+    \item {\small\sffamily\bfseries{根据某系统最终获胜的期望进行排序}}\upcite{DBLP:conf/iwslt/Koehn12}。以系统${S}_j$为例，若共有$n$个待评价的系统，则进行总体排序时系统 ${S}_j$ 的得分为其最终获胜的期望，即：
-\vspace{0.5em}
-\item {\small\sffamily\bfseries{根据某系统最终获胜的期望进行排序}}\upcite{DBLP:conf/iwslt/Koehn12}。以系统${S}_j$为例，若共有$n$个待评价的系统，则进行总体排序时系统 ${S}_j$ 的得分为其最终获胜的期望，即：
 \begin{eqnarray}
-\textrm{score}({{S}_j}) = \frac{1}{n}\sum\limits_{k,k \ne j} {\frac{\textrm{count}_{\textrm{win}}({{S}_j},{{S}_k})}{{\textrm{count}_{\textrm{win}}({{S}_j},{{S}_k}) + \textrm{count}_{\textrm{loss}}({{S}_j},{{S}_k})}}}
+\textrm{score}({{S}_j}) &=& \frac{1}{n}\sum\limits_{k,k \ne j} {\frac{\textrm{count}_{\textrm{win}}({{S}_j},{{S}_k})}{{\textrm{count}_{\textrm{win}}({{S}_j},{{S}_k}) + \textrm{count}_{\textrm{loss}}({{S}_j},{{S}_k})}}}
 \label{eq:4-2}
 \end{eqnarray}
 根据公式\eqref{eq:4-2}可以看出，该策略去除了平局的影响。
+    \end{itemize}
 \vspace{0.5em}
 \end{itemize}
@@ -201,7 +198,7 @@
 \parinterval TER是一种典型的基于距离的评价方法，通过评定机器译文的译后编辑工作量来衡量机器译文质量。在这里“距离”被定义为将一个序列转换成另一个序列所需要的最少编辑操作次数，操作次数越多，距离越大，序列之间的相似性越低；相反距离越小，表示一个句子越容易改写成另一个句子，序列之间的相似性越高。TER 使用的编辑操作包括：增加、删除、替换和移位。其中增加、删除、替换操作计算得到的距离被称为编辑距离。TER根据错误率的形式给出评分：
 \begin{eqnarray}
-\textrm{score}= \frac{\textrm{edit}(o,g)}{l}
+\textrm{score}&=& \frac{\textrm{edit}(o,g)}{l}
 \label{eq:4-3}
 \end{eqnarray}
@@ -216,7 +213,7 @@
 \parinterval 在这个实例中，将机器译文序列转换为参考答案序列，需要进行两次替换操作，将“A” 替换为“The”，将“in” 替换为“on”。所以$\textrm{edit}(c,r)$ = 2，归一化因子$l$为参考答案的长度8（包括标点符号），所以该机器译文的TER 结果为2/8。
-\parinterval PER与TER的基本思想与WER相同，这三种方法的主要区别在于对“错误” 的定义和考虑的操作类型略有不同。WER使用的编辑操作包括：增加、删除、替换，由于没有移位操作，当机器译文出现词序问题时，会发生多次替代，因而一般会低估译文质量；而PER只考虑增加和删除两个动作，在不考虑词序的情况下，PER计算两个句子中出现相同单词的次数，根据翻译句子比参考答案长或短，其余操作无非是插入词或删除词，这样往往会高估译文质量。
+\parinterval PER与TER的基本思想与WER相同，这三种方法的主要区别在于对“错误” 的定义和考虑的操作类型略有不同。WER使用的编辑操作包括：增加、删除、替换，由于没有移位操作，当机器译文出现词序问题时，会发生多次替代，因而一般会低估译文质量；而PER只考虑增加和删除两个动作，在不考虑词序的情况下，PER计算两个句子中出现相同单词的次数，根据机器译文与参考答案的长度差距，其余操作无非是插入词或删除词，这样往往会高估译文质量。
 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -228,7 +225,7 @@
 \parinterval BLEU 的计算首先考虑待评价机器译文中$n$-gram在参考答案中的匹配率，称为{\small\sffamily\bfseries{$\bm{n}$-gram准确率}}\index{$\{n}$-gram准确率}（$n$-gram Precision）\index{$n$-gram Precision}。其计算方法如下：
 \begin{eqnarray}
-\funp{P}_{n} = \frac{{{\textrm{coun}}{{\textrm{t}}_{{\textrm{hit}}}}}}{{{\textrm{coun}}{{\textrm{t}}_{{\textrm{output}}}}}}
+\funp{P}_{n} &=& \frac{{{\textrm{coun}}{{\textrm{t}}_{{\textrm{hit}}}}}}{{{\textrm{coun}}{{\textrm{t}}_{{\textrm{output}}}}}}
 \label{eq:4-4}
 \end{eqnarray}
@@ -245,13 +242,13 @@
 \parinterval 令$N$表示考虑的最大$n$-gram的大小，则译文整体的准确率等于各$n$-gram的加权平均：
 \begin{eqnarray}
-{\funp{P}_{{\textrm{avg}}}} = \exp (\sum\limits_{n = 1}^N {{w_n} \cdot {{{\mathop{\log\funp{P}}\nolimits} }_n}} )
+{\funp{P}_{{\textrm{avg}}}} &=& \exp (\sum\limits_{n = 1}^N {{w_n} \cdot {{{\mathop{\log\funp{P}}\nolimits} }_n}} )
 \label{eq:4-5}
 \end{eqnarray}
 \parinterval 但是，该方法更倾向于对短句子打出更高的分数。一个极端的例子是译文只有很少的几个词，但是都命中答案，准确率很高可显然不是好的译文。因此，BLEU 引入{\small\sffamily\bfseries{短句惩罚因子}}\index{短句惩罚因子}（Brevity Penalty，BP）\index{Brevity Penalty}的概念，对短句进行惩罚:
 \begin{eqnarray}
-\textrm {BP} = \left\{ \begin{array}{l}
+\textrm {BP} &=& \left\{ \begin{array}{l}
 1\quad \quad \;\;c > r\\
 {\textrm{exp}}(1 - \frac{r}{c})\quad c \le r
 \end{array} \right.
@@ -260,7 +257,7 @@
 \noindent 其中，$c$表示机器译文的句子长度，$r$表示参考答案的句子长度。最终BLEU的计算公式为：
 \begin{eqnarray}
-\textrm {BLEU} = \textrm {BP} \cdot \exp(\sum\limits_{n = 1}^N {{w_n} \cdot {{{\mathop{\textrm {log}}\nolimits} }\funp{P}_n}} )
+\textrm {BLEU} &=& \textrm {BP} \cdot \exp(\sum\limits_{n = 1}^N {{w_n} \cdot {{{\mathop{\textrm {log}}\nolimits} }\funp{P}_n}} )
 \label{eq:4-7}
 \end{eqnarray}
@@ -291,11 +288,11 @@
 \parinterval 在Meteor方法中，首先在机器译文与参考答案之间建立单词之间的对应关系，再根据其对应关系计算准确率和召回率。
-\parinterval （1）单词之间的对应关系在建立过程中主要涉及三个模型，在对齐过程中依次使用这三个模型进行匹配：
 \begin{itemize}
 \vspace{0.5em}
-\item {\small\sffamily\bfseries{“绝对”匹配模型}}\index{“绝对”匹配模型}（Exact Model）\index{Exact Model}。绝对匹配模型在建立单词对应关系时，要求机器译文端的单词与参考答案端的单词完全一致，并且在参考答案端至多有1个单词与机器译文端的单词对应，否则会将其视为多种对应情况。对于实例\ref{eg:4-2}，使用“绝对”匹配模型，共有两种匹配结果，如图\ref{fig:4-3}所示。
+\item 在机器译文与参考答案之间建立单词之间的对应关系。单词之间的对应关系在建立过程中主要涉及三个模型，在对齐过程中依次使用这三个模型进行匹配：
+    \begin{itemize}
+    \item {\small\sffamily\bfseries{“绝对”匹配模型}}\index{“绝对”匹配模型}（Exact Model）\index{Exact Model}。绝对匹配模型在建立单词对应关系时，要求机器译文端的单词与参考答案端的单词完全一致，并且在参考答案端至多有1个单词与机器译文端的单词对应，否则会将其视为多种对应情况。对于实例\ref{eg:4-2}，使用“绝对”匹配模型，共有两种匹配结果，如图\ref{fig:4-3}所示。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -306,9 +303,7 @@
   \label{fig:4-3}
 \end{figure}
 %----------------------------------------------
+    \item  {\small\sffamily\bfseries{“波特词干”匹配模型}}\index{“波特词干”匹配模型}（Porter Stem Model）\index{Porter Stem Model}。该模型在“绝对”匹配结果的基础上，对尚未对齐的单词进行基于词干的匹配，只需机器译文端单词与参考答案端单词的词干相同即可，如上文中的“do”和“did”。对于图\ref{fig:4-3}中显示的词对齐结果，再使用“波特词干” 匹配模型，得到如图\ref{fig:4-4}所示的结果。
-\vspace{0.5em}
-\item {\small\sffamily\bfseries{“波特词干”匹配模型}}\index{“波特词干”匹配模型}（Porter Stem Model）\index{Porter Stem Model}。该模型在“绝对”匹配结果的基础上，对尚未对齐的单词进行基于词干的匹配，只需机器译文端单词与参考答案端单词的词干相同即可，如上文中的“do”和“did”。对于图\ref{fig:4-3}的结果，再使用“波特词干” 匹配模型，得到如图\ref{fig:4-4}所示的结果。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -318,9 +313,7 @@
    \label{fig:4-4}
 \end{figure}
 %----------------------------------------------
+    \item {\small\sffamily\bfseries{“同义词”匹配模型}}\index{“同义词”匹配模型}（WN Synonymy Model）\index{WN Synonymy Model}。该模型在前两个模型匹配结果的基础上，对尚未对齐的单词进行同义词的匹配，即基于WordNet词典匹配机器译文与参考答案中的同义词。如实例\ref{eg:4-2}中的“eat”和“have”。图\ref{fig:4-5}给出了一个真实的例子。
-\vspace{0.5em}
-\item {\small\sffamily\bfseries{“同义词”匹配模型}}\index{“同义词”匹配模型}（WN Synonymy Model）\index{WN Synonymy Model}。该模型在前两个模型匹配结果的基础上，对尚未对齐的单词进行同义词的匹配，即基于WordNet词典匹配机器译文与参考答案中的同义词。如上例中的“eat”和“have”。图\ref{fig:4-5}给出了一个真实的例子。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -330,11 +323,8 @@
    \label{fig:4-5}
 \end{figure}
 %----------------------------------------------
+    \end{itemize}
-\vspace{0.5em}
+经过上面的处理，可以得到机器译文与参考答案之间的单词对齐关系。下一步需要从中确定一个拥有最大的子集的对齐关系，即机器译文中被对齐的单词个数最多的对齐关系。但是在上例中的两种对齐关系子集基数相同，这种情况下，需要选择一个对齐关系中交叉现象出现最少的对齐关系。于是，最终的对齐关系如图\ref{fig:4-6}所示。
-\end{itemize}
-\parinterval 经过上面的处理，可以得到机器译文与参考答案之间的单词对齐关系。下一步需要从中确定一个拥有最大的子集的对齐关系，即机器译文中被对齐的单词个数最多的对齐关系。但是在上例中的两种对齐关系子集基数相同，这种情况下，需要选择一个对齐关系中交叉现象出现最少的对齐关系。于是，最终的对齐关系如图\ref{fig:4-6}所示。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -344,36 +334,39 @@
  	 \label{fig:4-6}
 \end{figure}
 %----------------------------------------------
+\vspace{0.5em}
+\item 在得到机器译文与参考答案的对齐关系后，需要基于对齐关系计算准确率和召回率。
-\parinterval （2）在得到机器译文与参考答案的对齐关系后，需要基于对齐关系计算准确率和召回率。
+准确率：机器译文中命中单词数与机器译文单词总数的比值。即：
-\parinterval 准确率：机器译文中命中单词数与机器译文单词总数的比值。即：
 \begin{eqnarray}
-\funp{P} = \frac {\textrm{count}_{\textrm{hit}}}{\textrm{count}_{\textrm{candidate}}}
+\funp{P} &=& \frac {\textrm{count}_{\textrm{hit}}}{\textrm{count}_{\textrm{candidate}}}
 \label{eq:4-8}
 \end{eqnarray}
-\parinterval 召回率：机器译文中命中单词数与参考答案单词总数的比值。即：
+召回率：机器译文中命中单词数与参考答案单词总数的比值。即：
 \begin{eqnarray}
-\funp{R} = \frac {\textrm{count}_{\textrm{hit}}}{\textrm{count}_{\textrm{reference}}}
+\funp{R} &=& \frac {\textrm{count}_{\textrm{hit}}}{\textrm{count}_{\textrm{reference}}}
 \label{eq:4-9}
 \end{eqnarray}
-\parinterval 接下来，计算机器译文的得分。利用{\small\sffamily\bfseries{调和均值}}\index{调和均值}（Harmonic-mean）\index{Harmonic-mean}将准确率和召回率结合起来，并加大召回率的重要性将其权重调大，例如将召回率的权重设置为9：
+\vspace{0.5em}
+\item 最后，计算机器译文的得分。利用{\small\sffamily\bfseries{调和均值}}\index{调和均值}（Harmonic-mean）\index{Harmonic-mean}将准确率和召回率结合起来，并加大召回率的重要性将其权重调大，例如将召回率的权重设置为9：
 \begin{eqnarray}
-{F_{\textrm mean}} = \frac {10\funp{PR}}{\funp{R+9P}}
+{F_{\textrm mean}} &=& \frac {10\funp{PR}}{\funp{R+9P}}
 \label{eq:4-10}
 \end{eqnarray}
+\vspace{0.5em}
+\end{itemize}
-\parinterval 在上文提到的评价指标中，无论是准确率、召回率还是$\textrm F_{mean}$，都是基于单个词汇信息衡量译文质量，而忽略了语序问题。为了将语序问题考虑进来，Meteor会考虑更长的匹配：将机器译文按照最长匹配长度分块，并对“块数”较多的机器译文给予惩罚。例如上例中，机器译文被分为了三个“块”——“Can I have it”、“like he”、“？”在这种情况下，看起来上例中的准确率、召回率都还不错，但最终会受到很严重的惩罚。这种罚分机制能够识别出机器译文中的词序问题，因为当待测译文词序与参考答案相差较大时，机器译文将会被分割得比较零散，这种惩罚机制的计算公式如式\eqref{eq:4-11}，其中$\textrm {count}_{\textrm{chunks}}$表示匹配的块数。
+\parinterval 在上文提到的评价指标中，无论是准确率、召回率还是$\textrm F_{mean}$，都是基于单个词汇信息衡量译文质量，而忽略了语序问题。为了将语序问题考虑进来，Meteor会考虑更长的匹配：将机器译文按照最长匹配长度分块，并对“块数”较多的机器译文给予惩罚。例如图\ref{fig:4-6}显示的最终词对齐结果中，机器译文被分为了三个“块”——“Can I have it”、“like he”、“？”在这种情况下，看起来上例中的准确率、召回率都还不错，但最终会受到很严重的惩罚。这种罚分机制能够识别出机器译文中的词序问题，因为当待测译文词序与参考答案相差较大时，机器译文将会被分割得比较零散，这种惩罚机制的计算公式如式\eqref{eq:4-11}，其中$\textrm {count}_{\textrm{chunks}}$表示匹配的块数。
 \begin{eqnarray}
-\textrm {Penalty} = 0.5 \cdot {\left({\frac{{\textrm {count}}_{\textrm {chunks}}}{\textrm {count}_{\textrm{hit}}}} \right)^3}
+\textrm {Penalty} &=& 0.5 \cdot {\left({\frac{{\textrm {count}}_{\textrm {chunks}}}{\textrm {count}_{\textrm{hit}}}} \right)^3}
 \label{eq:4-11}
 \end{eqnarray}
 \parinterval Meteor评价方法的最终评分为：
 \begin{eqnarray}
-\textrm {score} = { F_{\textrm mean}} \cdot {(1 - \textrm {Penalty})}
+\textrm {score} &=& { F_{\textrm mean}} \cdot {(1 - \textrm {Penalty})}
 \label{eq:4-12}
 \end{eqnarray}
@@ -456,7 +449,7 @@ His house is on the south bank of the river.
 \subsubsection{1.增大参考答案集}
-\parinterval BLEU、Meteor、TER等自动评价方法的结果往往与人工评价结果存在差距，一个主要原因是这些自动评价方法通过直接比对机器译文与有限的参考答案之间的“外在差异”，由于参考答案集可覆盖的人类译文数量过少，当机器译文本来十分合理但却未被包含在参考答案集中时，就会将其质量过分低估。
+\parinterval BLEU、Meteor、TER等自动评价方法的结果往往与人工评价结果存在差距。这些自动评价方法直接比对机器译文与有限数量的参考答案之间的“外在差异”，由于参考答案集可覆盖的人类译文数量过少，当机器译文本来十分合理但却未被包含在参考答案集中时，其质量就会被过分低估。
 \parinterval 针对这个问题，HyTER自动评价方法致力于得到所有可能译文的紧凑编码，从而实现自动评价过程中访问所有合理的译文\upcite{DBLP:conf/naacl/DreyerM12}。这种评价方法的原理非常简单直观：
@@ -474,7 +467,7 @@ His house is on the south bank of the river.
 \vspace{0.5em}
 \item 通过已有同义单元和附加单词的组合用于覆盖更大的语言片段。在生成参考答案时就是采用这种方式不断覆盖更大的语言片段，直到将所有可能的参考答案覆盖进去。例如可以将短语[THE-SUPPORT-RATE]与“the proposal”组合为“[THE-SUPPORT-RATE] for the proposal”。
 \vspace{0.5em}
-\item 利用同义单元的组合将所有所有合理的人类译文都编码出来。中文句子“对提案的支持率接近于0”翻译为英文，其可能的参考答案被编码成：
+\item 利用同义单元的组合将所有所有合理的人类译文都编码出来。将中文句子“对提案的支持率接近于 0”翻译为英文，图\ref{fig:4-7}展示了其参考答案的编码结果。
 \vspace{0.5em}
 \end{itemize}
@@ -487,7 +480,7 @@ His house is on the south bank of the river.
 \end{figure}
 %----------------------------------------------
-\parinterval 从上面的例子中可以看出，HyTER方法通过构造同义单元的方式，可以列举出译文中每个片段的所有可能的表达方式，从而增大参考答案的数量，上例中的每一条路径都代表一个参考答案。但是这种对参考答案集的编码方式存在问题，同义单元之间的组合往往存在一定的限制关系\upcite{DBLP:conf/tsd/BojarMTZ13}，使用HyTER方法会导致参考答案集中包含有错误的参考答案。
+\parinterval 从图\ref{fig:4-7}中可以看出，HyTER方法通过构造同义单元的方式，可以列举出译文中每个片段的所有可能的表达方式，从而增大参考答案的数量，图\ref{fig:4-7}中的每一条路径都代表一个参考答案。但是这种对参考答案集的编码方式存在问题，同义单元之间的组合往往存在一定的限制关系\upcite{DBLP:conf/tsd/BojarMTZ13}，使用HyTER方法会导致参考答案集中包含有错误的参考答案。
 \begin{example}
 将中文“市政府批准了一项新规定”分别翻译为英语和捷克语，使用HyTER构造的参考答案集分别如图\ref{fig:4-8}(a)和(b)所示\upcite{DBLP:conf/tsd/BojarMTZ13}：
@@ -551,13 +544,13 @@ His house is on the south bank of the river.
 \parinterval DREEM方法中选取了能够反映句子中使用的特定词汇的One-hot向量、能够反映词汇信息的词嵌入向量\upcite{bengio2003a}、能够反映句子的合成语义信息的{\small\sffamily\bfseries{递归自动编码}}\index{递归自动编码}（Recursive Auto-encoder Embedding，RAE）\index{Recursive Auto-encoder Embedding}，这三种表示级联在一起，最终形成句子的向量表示。在得到机器译文和参考答案的上述分布式表示后，利用余弦相似度和长度惩罚对机器译文质量进行评价。机器译文$o$和参考答案$g$之间的相似度如公式\eqref{eq:4-16}所示，其中${v_i}(o)$和${v_i}(g)$分别是机器译文和参考答案的向量表示中的第$i$ 个元素，$N$是向量表示的维度大小。
 \begin{eqnarray}
-\textrm {cos}(t,r) = \frac{{\sum\limits_{i = 1}^N {{v_i}(o) \cdot {v_i}(g)} }}{{\sqrt {\sum\limits_{i = 1}^N {v_i^2(o)} } \sqrt {\sum\limits_{i = 1}^N {v_i^2(g)} } }}
+\textrm {cos}(t,r) &=& \frac{{\sum\limits_{i = 1}^N {{v_i}(o) \cdot {v_i}(g)} }}{{\sqrt {\sum\limits_{i = 1}^N {v_i^2(o)} } \sqrt {\sum\limits_{i = 1}^N {v_i^2(g)} } }}
 \label{eq:4-16}
 \end{eqnarray}
 \parinterval 在此基础上，DREEM方法还引入了长度惩罚项，对与参考答案长度相差太多的机器译文进行惩罚，长度惩罚项如公式\eqref{eq:4-17}所示，其中${l_o}$和${l_g}$分别是机器译文和参考答案长度：
 \begin{eqnarray}
-\textrm{BP} = \left\{ \begin{array}{l}
+\textrm{BP} &=& \left\{ \begin{array}{l}
 \exp (1 - {{{l_g}} \mathord{\left/
 {\vphantom {{{l_g}} {{l_o}}}} \right.
 \kern-\nulldelimiterspace} {{l_o}}})\quad {l_o} < {l_g}\\
@@ -570,7 +563,7 @@ His house is on the south bank of the river.
 \parinterval 机器译文的最终得分如下，其中$\alpha$是一个需要手动设置的参数：
 \begin{eqnarray}
-\textrm{score}(o,g) = \textrm{cos}{^\alpha }(o,g) \times \textrm{BP}
+\textrm{score}(o,g) &=& \textrm{cos}{^\alpha }(o,g) \times \textrm{BP}
 \label{eq:4-18}
 \end{eqnarray}
@@ -579,7 +572,7 @@ His house is on the south bank of the river.
 \parinterval 在DREEM方法取得成功后，基于词嵌入的词对齐自动评价方法被提出\upcite{DBLP:journals/corr/MatsuoKS17}，该方法中先得到机器译文与参考答案的词对齐关系后，通过对齐关系中两者的词嵌入相似度来计算机器译文与参考答案的相似度，公式如下：
 \begin{eqnarray}
-\textrm{ASS}(o,g) = \frac{1}{{m \cdot l}}\sum\limits_{i = 1}^{m} {\sum\limits_{j = 1}^{l} {\varphi (o,g,i,j)} }
+\textrm{ASS}(o,g) &=& \frac{1}{{m \cdot l}}\sum\limits_{i = 1}^{m} {\sum\limits_{j = 1}^{l} {\varphi (o,g,i,j)} }
 \label{eq:4-19}
 \end{eqnarray}
@@ -623,7 +616,7 @@ His house is on the south bank of the river.
 \parinterval 目前在机器译文质量评价的领域中，有很多研究工作尝试比较各种有参考答案的自动评价方法（主要以BLEU、NIST等基于$n$-gram的方法为主）与人工评价方法的相关性。整体来看，这些方法与人工评价具有一定的相关性，自动评价结果能够较好地反映译文质量\upcite{coughlin2003correlating,doddington2002automatic}。
-\parinterval 但是也有相关研究指出，不应该对有参考答案的自动评价方法过于乐观，而应该存谨慎态度，因为目前的自动评价方法对于流利度的评价并不可靠，同时参考答案的体裁和风格往往会对自动评价结果产生很大影响\upcite{culy2003limits}。同时，有研究者提出，在机器翻译研究过程中，忽略实际的示例翻译而仅仅通过BLEU等自动评价方式得分的提高来表明机器翻译质量的提高是不可取的，因为BLEU的提高并不足以反映翻译质量的真正提高，而在另一些情况下，为了实现翻译质量的显著提高，并不需要提高BLEU\upcite{callison2006re}。
+\parinterval 但是也有相关研究指出，不应该对有参考答案的自动评价方法过于乐观，而应该存谨慎态度，因为目前的自动评价方法对于流利度的评价并不可靠，同时参考答案的体裁和风格往往会对自动评价结果产生很大影响\upcite{culy2003limits}。同时，有研究人员提出，机器翻译研究过程中，在忽略实际示例翻译的前提下，BLEU分数的提高并不意味着翻译质量的真正提高，而在一些情况下，为了实现翻译质量的显著提高，并不需要提高BLEU分数\upcite{callison2006re}。
 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -656,7 +649,7 @@ His house is on the south bank of the river.
 \parinterval 回到机器翻译的问题中来。一个更加基础的问题是：一个系统评价结果的变化在多大范围内是不显著的。利用假设检验的原理，这个问题可以被描述为：评价结果落在$[x-d,x+d]$区间的置信度是$1-\alpha$。换句话说，当系统性能落在$[x-d, x+d]$外，就可以说这个结果与原始的结果有显著性差异。这里$x$通常是系统译文的BLEU计算结果，$[x-d,x+d]$是其对应的置信区间。而$d$和$\alpha$有很多计算方法，比如，如果假设评价结果服从正态分布，可以简单的计算$d$。
 \begin{eqnarray}
-d=t \frac{s}{\sqrt{n}}
+d&=&t \frac{s}{\sqrt{n}}
 \label{eq:4-21}
 \end{eqnarray}
@@ -666,7 +659,7 @@ d=t \frac{s}{\sqrt{n}}
 \parinterval 最常用的方法是使用Bootstrap重采样技术\upcite{DBLP:books/sp/EfronT93}从一个固定测试集中采样不同的句子组成不同的测试集，之后在这些测试集上进行假设检验\upcite{DBLP:conf/emnlp/Koehn04}。此后，有工作指出了Bootstrap重采样方法存在隐含假设的不合理之处，并提出了使用近似随机化\upcite{noreen1989computer}方法计算自动评价方法统计显著性\upcite{DBLP:conf/acl/RiezlerM05}。另有研究工作着眼于研究自动评价结果差距大小、测试集规模、系统相似性等因素对统计显著性的影响，以及在不同领域的测试语料中计算的统计显著性是否具有通用性的问题\upcite{DBLP:conf/emnlp/Berg-KirkpatrickBK12}。
-\parinterval 在所有自然语言处理系统的结果对比中，显著性检验是十分必要的。很多时候不同系统性能的差异性很小，因此需要确定一些微小的进步是否是“真”的，还是只是一些随机事件。但是另一方面，从实践的角度看，当某个系统性能的提升达到一个绝对值，往往是显著的。比如，在机器翻译，BLEU提升0.5$\%$一般都是比较明显的进步。也有研究对这种观点进行了论证，也发现其中具有一定的科学性\upcite{DBLP:conf/emnlp/Berg-KirkpatrickBK12}。因此，在机器翻译系统研发中类似的方式也是可以采用的。
+\parinterval 在所有自然语言处理系统的结果对比中，显著性检验是十分必要的。很多时候不同系统性能的差异性很小，因此需要确定一些微小的进步是否是“真”的，还是只是一些随机事件。但是从实践的角度看，当某个系统性能的提升达到一个绝对值，这种性能提升效果往往是显著的。比如，在机器翻译，BLEU提升0.5$\%$一般都是比较明显的进步。也有研究对这种观点进行了论证，也发现其中具有一定的科学性\upcite{DBLP:conf/emnlp/Berg-KirkpatrickBK12}。因此，在机器翻译系统研发中类似的方式也是可以采用的。
 %----------------------------------------------------------------------------------------
 %    NEW SECTION
@@ -801,7 +794,7 @@ d=t \frac{s}{\sqrt{n}}
 \item 预测译文句子的后编辑工作量。在最近的研究中，句子级的质量评估一直在探索各种类型的离散或连续的后编辑标签。例如，通过测量以秒为单位的后编辑时间对译文句子进行评分；通过测量预测后编辑过程所需的击键数对译文句子进行评分；通过计算{\small\sffamily\bfseries{人工译后编辑距离}}\index{人工译后编辑距离}（Human Translation Error Rate，HTER）\index{Human Translation Error Rate}，即在后编辑过程中编辑（插入/删除/替换）数量与参考翻译长度的占比率对译文句子进行评分。HTER的计算公式为：
 \vspace{0.5em}
 \begin{eqnarray}
-\textrm{HTER}= \frac{\mbox{编辑操作数目}}{\mbox{翻译后编辑结果长度}}
+\textrm{HTER}&=& \frac{\mbox{编辑操作数目}}{\mbox{翻译后编辑结果长度}}
 \label{eq:4-20}
 \end{eqnarray}
@@ -844,7 +837,7 @@ Reference： A few days ago, {\red he} contacted the News Channel and said that 
 \parinterval 在文档级质量评估任务中，需要对译文文档做一些更细粒度的注释，注释内容包括错误位置、错误类型和错误的严重程度，最终在注释的基础上对译文文档质量进行评估。
-\parinterval 与更细粒度的词级和句子级的质量评价相比，文档级质量评估更加复杂。其难点之一在于文档级的质量评估过程中需要根据一些主观的质量标准去对文档进行评分，例如在注释的过程中，对于错误的严重程度并没有严格的界限和规定，只能靠评测人员主观判断，这就意味着随着出现主观偏差的注释的增多，文档级质量评估的参考价值会大打折扣。另一方面，根据所有注释（错误位置、错误类型及其严重程度）对整个文档进行评分本身就具有不合理性，因为译文中有些在抛开上下文环境的情况下可以并判定为“翻译得不错的”单词和句子，一旦被放在文档中的语境后就可能变得不合理，而某些在无语境条件下看起来翻译得“ 糟糕透了”的单词和句子，一旦被放在文档中的语境中可能会变得恰到好处。此外，构建一个质量评测模型势必需要大量的标注数据，而文档级质量评测所需要的带有注释的数据的获取代价相当高。
+\parinterval 与更细粒度的词级和句子级的质量评价相比，文档级质量评估更加复杂。其难点之一在于文档级的质量评估过程中需要根据一些主观的质量标准去对文档进行评分，例如在注释的过程中，对于错误的严重程度并没有严格的界限和规定，只能靠评测人员主观判断，这就意味着随着出现主观偏差的注释的增多，文档级质量评估的参考价值会大打折扣。另一方面，根据所有注释（错误位置、错误类型及其严重程度）对整个文档进行评分本身就具有不合理性，因为译文中有些在抛开上下文语境时可以并判定为“翻译得不错的”单词和句子，一旦被放在上下文语境中就可能变得不合理，而某些在无语境条件下看起来翻译得“ 糟糕透了”的单词和句子，一旦被放在文档中的语境中可能会变得恰到好处。此外，构建一个质量评测模型势必需要大量的标注数据，而文档级质量评测所需要的带有注释的数据的获取代价相当高。
 \parinterval 实际上，文档级质量评估与其它文档级自然语言处理任务面临的问题是一样的。由于数据稀缺，无论是系统研发，还是结果评价都面临很大挑战。这些问题也会在本书的{\chaptersixteen}和{\chapterseventeen}进行讨论。
@@ -889,7 +882,7 @@ Reference： A few days ago, {\red he} contacted the News Channel and said that 
 \vspace{0.5em}
 \item 句子级和文档级质量评估目前大多通过回归算法实现。由于在句子级和文档级的质量评估中，标签是使用连续数字（得分情况）表示的，因此回归算法是最合适的选择。最初的工作中，研究人员们多采用传统的机器学习回归算法\upcite{DBLP:conf/wmt/Bicici13a,DBLP:conf/wmt/SouzaGBTN14,DBLP:conf/wmt/HildebrandV13}，而近年来，研究人员则更青睐于使用神经网络方法进行句子级和文档级质量评估；
 \vspace{0.5em}
-\item 单词级和短语级质量评估多由分类算法实现。对于单词级质量评估任务中标记“OK”或“BAD”，这对应了经典的二分类问题，因此可以使用分类算法对其进行预测。自动分类算法在第三章已经涉及，质量评估中直接使用成熟的分类器即可。此外，使用神经网络方法进行分类也是不错的选择。
+\item 单词级和短语级质量评估多由分类算法实现。在单词级质量评估任务中，需要对每个位置的单词标记“OK”或“BAD”，这对应了经典的二分类问题，因此可以使用分类算法对其进行预测。自动分类算法在第三章已经涉及，质量评估中直接使用成熟的分类器即可。此外，使用神经网络方法进行分类也是不错的选择。
 \vspace{0.5em}
 \end{itemize}
@@ -912,7 +905,7 @@ Reference： A few days ago, {\red he} contacted the News Channel and said that 
 \vspace{0.5em}
 \end{itemize}
-\parinterval 需要注意的是，质量评估的应用模式还没有完全得到验证。这一方面是由于，质量评估的应用非常依赖与人的交互过程。但是，改变人的工作习惯是很困难的，因此质量评估系统在应用时往往需要很长的时间适应到场景中，或者说人也要适应质量评估系统的行为。另一方面，质量评估的很多应用场景还没有完全被发掘出来，需要更长的时间进行探索。
+\parinterval 需要注意的是，质量评估的应用模式还没有完全得到验证。这一方面是由于，质量评估的应用非常依赖与人的交互过程。但是，改变人的工作习惯是很困难的，因此质量评估系统在实际场景中的应用往往需要很长时间，或者说人也要适应质量评估系统的行为。另一方面，质量评估的很多应用场景还没有完全被发掘出来，需要更长的时间进行探索。
 %----------------------------------------------------------------------------------------
 %    NEW SECTION
@@ -931,7 +924,7 @@ Reference： A few days ago, {\red he} contacted the News Channel and said that 
 \vspace{0.5em}
 \item 译文质量的多角度评价。章节内主要介绍的几种经典方法如BLEU、TER、METEOR等，大都是从某个单一的角度计算机器译文和参考答案的相似性，如何对译文从多个角度进行综合评价是需要进一步思考的问题，\ref{Evaluation method of Multi Strategy fusion}节中介绍的多策略融合评价方法就可以看作是一种多角度评价方法，其思想是将各种评价方法下的译文得分通过某种方式进行组合，从而实现对译文的综合评价。译文质量多角度评价的另一种思路则是直接将BLEU、TER、Meteor等多种指标看做是某种特征，使用分类\upcite{kulesza2004learning,corston2001machine}、回归\upcite{albrecht2008regression}、排序\upcite{duh2008ranking}等机器学习手段形成一种综合度量。此外，也有相关工作专注于多等级的译文质量评价，使用聚类算法将大致译文按其质量分为不同等级，并对不同质量等级的译文按照不同权重组合几种不同的评价方法\upcite{chen2015multi}。
 \vspace{0.5em}
-\item 不同评价方法的应用场景有明显不同：人工评价主要用于需要对机器翻译系统进行准确的评估的场合。例如，在系统对比中利用人工评价方法对不同系统进行人工评价、给出最终排名，或上线机器翻译服务时对翻译品质进行详细的测试；有参考答案的自动评价则可以为机器翻译系统提供快速、相对可靠的评价。在机器翻译系统的快速研发过程中，一般都使用有参考答案的自动评价方法对最终模型的性能进行评估。有相关研究工作专注在机器翻译模型的训练过程中充分利用评价信息进行参数调优（如BLEU分数），其中比较有代表性的工作包括最小错误率训练\upcite{DBLP:conf/acl/Och03}、最小风险训练\upcite{DBLP:conf/acl/ShenCHHWSL16,he2012maximum}等。这部分内容可以参考{\chapterseven}和{\chapterthirteen}进行进一步阅读；无参考答案的质量评估主要用来对译文质量做出预测，经常被应用在是在一些无法提供参考译文的实时翻译场景中，例如人机交互过程、自动纠错、后编辑等\upcite{DBLP:conf/wmt/FreitagCR19}。
+\item 不同评价方法的应用场景有明显不同：人工评价主要用于需要对机器翻译系统进行准确的评估的场合。例如，在系统对比中利用人工评价方法对不同系统进行人工评价、给出最终排名，或上线机器翻译服务时对翻译品质进行详细的测试；有参考答案的自动评价则可以为机器翻译系统提供快速、相对可靠的评价。在机器翻译系统的快速研发过程中，一般都使用有参考答案的自动评价方法对最终模型的性能进行评估。有相关研究工作专注于在机器翻译模型的训练过程中利用评价信息（如BLEU分数）进行参数调优，其中比较有代表性的工作包括最小错误率训练\upcite{DBLP:conf/acl/Och03}、最小风险训练\upcite{DBLP:conf/acl/ShenCHHWSL16,he2012maximum}等。这部分内容可以参考{\chapterseven}和{\chapterthirteen}进行进一步阅读；无参考答案的质量评估主要用来对译文质量做出预测，经常被应用在一些无法提供参考译文的实时翻译场景中，例如人机交互过程、自动纠错、后编辑等\upcite{DBLP:conf/wmt/FreitagCR19}。
 \vspace{0.5em}
 \item 另一个比较值得关注的一个研究问题是如何使模型更加鲁棒，因为通常情况下，一个质量评估模型会受语种、评价策略等问题的约束，设计一个能应用于任何语种，同时从单词、短语、句子等各个等级对译文质量进行评估的模型是很有难度的。Biçici等人最先关注质量评估的鲁棒性问题，并设计开发了一种与语言无关的机器翻译性能预测器\upcite{DBLP:journals/mt/BiciciGG13}，此后又在该工作的基础上研究如何利用外在的、与语言无关的特征对译文进行句子级别的质量评估\upcite{DBLP:conf/wmt/BiciciW14}，该项研究的最终成果是一个与语言无关，可以从各个等级对译文质量进行评估的模型——RTMs（Referential Translation Machines）\upcite{DBLP:conf/wmt/BiciciLW15a}。
 \vspace{0.5em}

--- a/Chapter5/Figures/figure-a-more-detailed-explanation-of-formula-3.40.tex
+++ b/Chapter5/Figures/figure-a-more-detailed-explanation-of-formula-3.40.tex
@@ -9,8 +9,8 @@
 \begin{tikzpicture}
 \node [anchor=west,inner sep=2pt,minimum height=2em] (eq1) at (0,0) {$f(s_u|t_v)$};
-\node [anchor=west,inner sep=2pt] (eq2) at ([xshift=-2pt]eq1.east) {$=$};
+\node [anchor=west,inner sep=2pt] (eq2) at ([xshift=-1pt]eq1.east) {$=$};
-\node [anchor=west,inner sep=2pt,minimum height=2em] (eq3) at ([xshift=-2pt]eq2.east) {$\lambda_{t_v}^{-1}$};
+\node [anchor=west,inner sep=2pt,minimum height=2em] (eq3) at ([xshift=-1pt]eq2.east) {$\lambda_{t_v}^{-1}$};
 \node [anchor=west,inner sep=2pt,minimum height=3.0em] (eq4) at ([xshift=-3pt]eq3.east) {\footnotesize{$\frac{\varepsilon}{(l+1)^{m}} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i)$}};
 \node [anchor=west,inner sep=2pt,minimum height=3.0em] (eq5) at ([xshift=1pt]eq4.east) {\footnotesize{$\sum\limits_{j=1}^{m} \delta(s_j,s_u) \sum\limits_{i=0}^{l} \delta(t_i,t_v)$}};
 \node [anchor=west,inner sep=2pt,minimum height=3.0em] (eq6) at ([xshift=1pt]eq5.east) {$\frac{f(s_u|t_v)}{\sum_{i=0}^{l}f(s_u|t_i)}$};
@@ -27,7 +27,7 @@
 }
 {
-\node [anchor=south west,inner sep=2pt] (label1) at (eq4.north west) {{\scriptsize{翻译概率$\textrm{P}(\mathbf{s}|\mathbf{t})$}}};
+\node [anchor=south west,inner sep=2pt] (label1) at (eq4.north west) {{\scriptsize{翻译概率$\funp{P}(\seq{s}|\seq{t})$}}};
 }
 {
 \node [anchor=south west,inner sep=2pt] (label2) at (eq5.north west) {{\scriptsize{配对的总次数}}};

--- a/Chapter5/Figures/figure-different-alignment-comparison.tex
+++ b/Chapter5/Figures/figure-different-alignment-comparison.tex
@@ -13,7 +13,7 @@
    \draw [-] (s2.south) -- (t2.north);
    \node [anchor=center,draw=ublue,circle,thick,fill=white,inner sep=1pt,circular drop shadow={shadow xshift=0.1em,shadow yshift=-0.1em}] (mark) at ([xshift=0.8em,yshift=-0.7em]s2.south east) {{\color{ugreen} \tiny{\textbf{Yes}}}};
    }
-    \node [anchor=center] (labela) at ([xshift=2em,yshift=-1.0em]t1.south) {\scriptsize{(a)}};
+    \node [anchor=center] (labela) at ([xshift=2em,yshift=-1.0em]t1.south) {\small{(a)}};
    \end{scope}
    \begin{scope}[xshift=2.3in]
    {\small
@@ -25,7 +25,7 @@
    \draw [-] (s1.south) -- (t2.north);
    \node [anchor=center,draw=ublue,circle,thick,fill=white,inner sep=1.5pt,circular drop shadow={shadow xshift=0.1em,shadow yshift=-0.1em}] (mark) at ([xshift=0.8em,yshift=-0.7em]s2.south east) {{\color{red} \tiny{\textbf{No}}}};
    }
-        \node [anchor=center] (labelb) at ([xshift=2em,yshift=-1.0em]t1.south) {\scriptsize{(b)}};
+        \node [anchor=center] (labelb) at ([xshift=2em,yshift=-1.0em]t1.south) {\small{(b)}};
    \end{scope}
    \begin{scope}[xshift=4.1in]
    {\small
@@ -37,7 +37,7 @@
    \draw [-] (s2.south) -- ([yshift=-0.2em]t1.north);
    \node [anchor=center,draw=ublue,circle,thick,fill=white,inner sep=1pt,circular drop shadow={shadow xshift=0.1em,shadow yshift=-0.1em}] (mark) at ([xshift=0.8em,yshift=-0.7em]s2.south east) {{\color{ugreen} \tiny{\textbf{Yes}}}};
    }
-    \node [anchor=center] (labelc) at ([xshift=2em,yshift=-1.0em]t1.south) {\scriptsize{(c)}};
+    \node [anchor=center] (labelc) at ([xshift=2em,yshift=-1.0em]t1.south) {\small{(c)}};
    \end{scope}
    \end{tikzpicture}

--- a/Chapter5/Figures/figure-different-translation-candidate-space.tex
+++ b/Chapter5/Figures/figure-different-translation-candidate-space.tex
@@ -23,13 +23,13 @@
 \node [draw,dashed,ublue,fill=blue!10,thick,anchor=center,circle,minimum size=18pt] (t6) at ([xshift=3em]t2.east) {};
 \node [draw,dashed,ublue,fill=blue!10,thick,anchor=center,circle,minimum size=18pt] (t7) at ([xshift=3em]t4.east) {};
-\draw [->,thick,] (s.north east) .. controls +(north east:1em) and +(north west:1em).. (t1.north west) node[pos=0.5,below] {\tiny{P ($\seq{t}_1|\seq{s}$)=0.1}};
+\draw [->,thick] (s.north east) .. controls +(north east:1em) and +(north west:1em).. (t1.north west) node[pos=0.5,below] {\tiny{$\funp{P} (\seq{t}_1|\seq{s})=0.1$}};
-\draw [->,thick,] (s.60) .. controls +(50:4em) and +(west:1em).. (t2.west) node[pos=0.5,below] {\tiny{P($\seq{t}_2|\seq{s}$)=0.2}};
+\draw [->,thick] (s.60) .. controls +(50:4em) and +(west:1em).. (t2.west) node[pos=0.5,below] {\tiny{$\funp{P}(\seq{t}_2|\seq{s})=0.2$}};
-\draw [->,thick,] (s.north) .. controls +(70:4em) and +(west:1em).. (t3.west) node[pos=0.5,above,xshift=-1em] {\tiny{P($\seq{t}_3|\seq{s}$)=0.3}};
+\draw [->,thick] (s.north) .. controls +(70:4em) and +(west:1em).. (t3.west) node[pos=0.5,above,xshift=-1em] {\tiny{$\funp{P}(\seq{t}_3|\seq{s})=0.3$}};
-\draw [->,thick,] (s.south east) .. controls +(300:3em) and +(south west:1em).. (t4.south west) node[pos=0.5,below] {\tiny{P($\seq{t}_4|\seq{s}$)=0.1}};
+\draw [->,thick] (s.south east) .. controls +(300:3em) and +(south west:1em).. (t4.south west) node[pos=0.5,below] {\tiny{$\funp{P}(\seq{t}_4|\seq{s})=0.1$}};
-\node [anchor=center] (foot1) at ([xshift=3.8em,yshift=-3em]s1.south) {\footnotesize{人的翻译候选空间}};
+\node [anchor=center] (foot1) at ([xshift=3.8em,yshift=-3.5em]s1.south) {\small{（a） 人的翻译候选空间}};
-\node [anchor=center] (foot2) at ([xshift=7em,yshift=-3em]s.south) {\footnotesize{机器的翻译候选空间}};
+\node [anchor=center] (foot2) at ([xshift=7em,yshift=-3.5em]s.south) {\small{（b） 机器的翻译候选空间}};
 \end{tikzpicture}

--- a/Chapter5/Figures/figure-example-translation-alignment.tex
+++ b/Chapter5/Figures/figure-example-translation-alignment.tex
@@ -4,9 +4,9 @@
-\begin{tabular}{| l | l |}
+\begin{tabular}{| c | c |}
 \hline
-& {\footnotesize{$\prod\limits_{(j,i) \in \hat{A}} \funp{P}(s_j,t_i)$} } \\ \hline
+\rule{0pt}{15pt} 源语言句子“我对你感到满意”的不同翻译结果& {\footnotesize{$\prod\limits_{(j,i) \in \hat{A}} \funp{P}(s_j,t_i)$} } \\ \hline
 \begin{tikzpicture}

--- a/Chapter5/Figures/figure-greedy-mt-decoding-process-1.tex
+++ b/Chapter5/Figures/figure-greedy-mt-decoding-process-1.tex
@@ -63,29 +63,29 @@
 \node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=4.5em] (t53) at ([yshift=-0.2em]t52.south) {satisfies};
 }
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt11) at (t11.east) {{\color{white} .4}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt11) at (t11.east) {{\color{white} 0.4}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt12) at (t12.east) {{\color{white} .3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt12) at (t12.east) {{\color{white} 0.3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt13) at (t13.east) {{\color{white} .1}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt13) at (t13.east) {{\color{white} 0.1}};
 }
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt21) at (t21.east) {{\color{white} .3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt21) at (t21.east) {{\color{white} 0.3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt22) at (t22.east) {{\color{white} .3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt22) at (t22.east) {{\color{white} 0.3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt23) at (t23.east) {{\color{white} .2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt23) at (t23.east) {{\color{white} 0.2}};
 }
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt31) at (t31.east) {{\color{white} .7}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt31) at (t31.east) {{\color{white} 0.7}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt32) at (t32.east) {{\color{white} .3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt32) at (t32.east) {{\color{white} 0.3}};
 }
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt41) at (t41.east) {{\color{white} .4}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt41) at (t41.east) {{\color{white} 0.4}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt42) at (t42.east) {{\color{white} .2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt42) at (t42.east) {{\color{white} 0.2}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt43) at (t43.east) {{\color{white} .1}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt43) at (t43.east) {{\color{white} 0.1}};
 }
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt51) at (t51.east) {{\color{white} .3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt51) at (t51.east) {{\color{white} 0.3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt52) at (t52.east) {{\color{white} .2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt52) at (t52.east) {{\color{white} 0.2}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt53) at (t53.east) {{\color{white} .2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt53) at (t53.east) {{\color{white} 0.2}};
 }
 }
 {\scriptsize
@@ -108,7 +108,7 @@
 \draw [-] (glabel.south west) -- ([xshift=3.5in]glabel.south west);
 \node [anchor=center,rotate=90] (hlabel2) at ([xshift=-1.3em,yshift=-8.5em]glabel.west) {\tiny{$h$存放临时翻译结果}};
-\node [anchor=north west] (foot1) at ([xshift=0.0em,yshift=-18.0em]translabel.south west) {\scriptsize{(a)\; 4:$h = \phi$}};
+\node [anchor=north west] (foot1) at ([xshift=4.0em,yshift=-18.0em]translabel.south west) {\small{(a)\; 4:$h = \phi$}};
 }
 \end{scope}
@@ -173,34 +173,34 @@
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt11) at (t11.east) {{\color{white} .4}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt11) at (t11.east) {{\color{white} 0.4}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt12) at (t12.east) {{\color{white} .3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt12) at (t12.east) {{\color{white} 0.3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt13) at (t13.east) {{\color{white} .1}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt13) at (t13.east) {{\color{white} 0.1}};
 }
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt21) at (t21.east) {{\color{white} .3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt21) at (t21.east) {{\color{white} 0.3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt22) at (t22.east) {{\color{white} .3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt22) at (t22.east) {{\color{white} 0.3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt23) at (t23.east) {{\color{white} .2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt23) at (t23.east) {{\color{white} 0.2}};
 }
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt31) at (t31.east) {{\color{white} .7}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt31) at (t31.east) {{\color{white} 0.7}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt32) at (t32.east) {{\color{white} .3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt32) at (t32.east) {{\color{white} 0.3}};
 }
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt41) at (t41.east) {{\color{white} .4}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt41) at (t41.east) {{\color{white} 0.4}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt42) at (t42.east) {{\color{white} .2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt42) at (t42.east) {{\color{white} 0.2}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt43) at (t43.east) {{\color{white} .1}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt43) at (t43.east) {{\color{white} 0.1}};
 }
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt51) at (t51.east) {{\color{white} .3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt51) at (t51.east) {{\color{white} 0.3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt52) at (t52.east) {{\color{white} .2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt52) at (t52.east) {{\color{white} 0.2}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt53) at (t53.east) {{\color{white} .2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt53) at (t53.east) {{\color{white} 0.2}};
 }
 }
@@ -233,7 +233,7 @@
 \draw [-] (glabel.south west) -- ([xshift=3.5in]glabel.south west);
 \node [anchor=center,rotate=90] (hlabel2) at ([xshift=-1.3em,yshift=-8.5em]glabel.west) {\tiny{$h$存放临时翻译结果}};
-\node [anchor=north west] (foot2) at ([xshift=0.0em,yshift=-18.0em]translabel.south west) {\scriptsize{(b)\; 6: \textbf{if} $used[j]=$ \textbf{false} \textbf{then}}};
+\node [anchor=north west] (foot2) at ([xshift=-4.0em,yshift=-18.0em]translabel.south west) {\small{(b)\; 6: \textbf{if} $used[j]=$ \textbf{false} \textbf{then}}};
 }
 {%大大的join
 \node [anchor=center,draw=ublue,circle,thick,fill=white,inner sep=2.5pt,circular drop shadow={shadow xshift=0.1em,shadow yshift=-0.1em}] (join) at ([xshift=4em,yshift=-1em]hlabel.north east) {\tiny{\textsc{Join}}};

--- a/Chapter5/Figures/figure-greedy-mt-decoding-process-3.tex
+++ b/Chapter5/Figures/figure-greedy-mt-decoding-process-3.tex
@@ -63,34 +63,34 @@
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt11) at (t11.east) {{\color{white} .4}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt11) at (t11.east) {{\color{white} 0.4}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt12) at (t12.east) {{\color{white} .3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt12) at (t12.east) {{\color{white} 0.3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt13) at (t13.east) {{\color{white} .1}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt13) at (t13.east) {{\color{white} 0.1}};
 }
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt21) at (t21.east) {{\color{white} .3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt21) at (t21.east) {{\color{white} 0.3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt22) at (t22.east) {{\color{white} .3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt22) at (t22.east) {{\color{white} 0.3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt23) at (t23.east) {{\color{white} .2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt23) at (t23.east) {{\color{white} 0.2}};
 }
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt31) at (t31.east) {{\color{white} .7}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt31) at (t31.east) {{\color{white} 0.7}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt32) at (t32.east) {{\color{white} .3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt32) at (t32.east) {{\color{white} 0.3}};
 }
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt41) at (t41.east) {{\color{white} .4}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt41) at (t41.east) {{\color{white} 0.4}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt42) at (t42.east) {{\color{white} .2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt42) at (t42.east) {{\color{white} 0.2}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt43) at (t43.east) {{\color{white} .1}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt43) at (t43.east) {{\color{white} 0.1}};
 }
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt51) at (t51.east) {{\color{white} .3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt51) at (t51.east) {{\color{white} 0.3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt52) at (t52.east) {{\color{white} .2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt52) at (t52.east) {{\color{white} 0.2}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt53) at (t53.east) {{\color{white} .2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt53) at (t53.east) {{\color{white} 0.2}};
 }
 }
@@ -126,7 +126,7 @@
 \node [anchor=center,rotate=90] (hlabel2) at ([xshift=-0.7em,yshift=-7.5em]glabel.west) {\tiny{$h$存放临时翻译结果}};
 }
-\node [anchor=north west] (foot1) at ([xshift=0.0em,yshift=-12.3em]translabel.south west) {\scriptsize{(c)\; 7:  $h = h \cup \textrm{\textsc{Join}}(best,\pi[j])$}};
+\node [anchor=north west] (foot1) at ([xshift=-2.0em,yshift=-12.3em]translabel.south west) {\small{(c)\; 7:  $h = h \cup \textrm{\textsc{Join}}(best,\pi[j])$}};
 {%大大的join
 \node [anchor=center,draw=ublue,circle,thick,fill=white,inner sep=2.5pt,circular drop shadow={shadow xshift=0.1em,shadow yshift=-0.1em}] (join) at ([xshift=4em,yshift=-1em]hlabel.north east) {\tiny{\textsc{Join}}};
 }
@@ -228,34 +228,34 @@
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt11) at (t11.east) {{\color{white} .4}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt11) at (t11.east) {{\color{white} 0.4}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt12) at (t12.east) {{\color{white} .3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt12) at (t12.east) {{\color{white} 0.3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt13) at (t13.east) {{\color{white} .1}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt13) at (t13.east) {{\color{white} 0.1}};
 }
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt21) at (t21.east) {{\color{white} .3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt21) at (t21.east) {{\color{white} 0.3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt22) at (t22.east) {{\color{white} .3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt22) at (t22.east) {{\color{white} 0.3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt23) at (t23.east) {{\color{white} .2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt23) at (t23.east) {{\color{white} 0.2}};
 }
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt31) at (t31.east) {{\color{white} .7}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt31) at (t31.east) {{\color{white} 0.7}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt32) at (t32.east) {{\color{white} .3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt32) at (t32.east) {{\color{white} 0.3}};
 }
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt41) at (t41.east) {{\color{white} .4}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt41) at (t41.east) {{\color{white} 0.4}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt42) at (t42.east) {{\color{white} .2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt42) at (t42.east) {{\color{white} 0.2}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt43) at (t43.east) {{\color{white} .1}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt43) at (t43.east) {{\color{white} 0.1}};
 }
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt51) at (t51.east) {{\color{white} .3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt51) at (t51.east) {{\color{white} 0.3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt52) at (t52.east) {{\color{white} .2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt52) at (t52.east) {{\color{white} 0.2}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt53) at (t53.east) {{\color{white} .2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt53) at (t53.east) {{\color{white} 0.2}};
 }
 }
@@ -283,7 +283,7 @@
 \draw [-] (glabel.south west) -- ([xshift=3.5in]glabel.south west);
 \node [anchor=center,rotate=90] (hlabel2) at ([xshift=-0.7em,yshift=-7.5em]glabel.west) {\tiny{$h$存放临时翻译结果}};
-\node [anchor=north west] (foot2) at ([xshift=0.0em,yshift=-23.0em]translabel.south west) {\scriptsize{(d)\; 8: $best = \textrm{\textsc{PruneForTop1}}(h)$}};
+\node [anchor=north west] (foot2) at ([xshift=-5.0em,yshift=-23.0em]translabel.south west) {\small{(d)\; 8: $best = \textrm{\textsc{PruneForTop1}}(h)$}};
 }

--- a/Chapter5/Figures/figure-ibm-model-iteration-process-diagram.tex
+++ b/Chapter5/Figures/figure-ibm-model-iteration-process-diagram.tex
@@ -5,8 +5,8 @@
 \begin{tikzpicture}
 \node [anchor=west,inner sep=2pt,fill=red!20,minimum height=3em] (eq1) at (0,0) {$f(s_u|t_v)$};
-\node [anchor=west,inner sep=2pt] (eq2) at ([xshift=-2pt]eq1.east) {$=$};
+\node [anchor=west,inner sep=2pt] (eq2) at ([xshift=-1pt]eq1.east) {$=$};
-\node [anchor=west,inner sep=2pt] (eq3) at ([xshift=-2pt]eq2.east) {$\lambda_{t_v}^{-1}$};
+\node [anchor=west,inner sep=2pt] (eq3) at ([xshift=-1pt]eq2.east) {$\lambda_{t_v}^{-1}$};
 \node [anchor=west,inner sep=2pt] (eq4) at ([xshift=-2pt]eq3.east) {$\frac{\varepsilon}{(l+1)^{m}}$};
 \node [anchor=west,inner sep=2pt,fill=red!20,minimum height=3em] (eq5) at ([xshift=-2pt]eq4.east) {\footnotesize{$\prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i)$}};
 \node [anchor=west,inner sep=2pt] (eq6) at ([xshift=-2pt]eq5.east) {\footnotesize{$\sum\limits_{j=1}^{m} \delta(s_j,s_u) \sum\limits_{i=0}^{l} \delta(t_i,t_v)$}};

--- a/Chapter5/Figures/figure-process-of-machine-translation.tex
+++ b/Chapter5/Figures/figure-process-of-machine-translation.tex
@@ -17,31 +17,31 @@
 \draw [->,very thick,ublue] (s5.south) -- ([yshift=-0.7em]s5.south);
 {\small
-\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (t11) at ([yshift=-1em]s1.south) {I};
+\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.6em,minimum width=2.5em] (t11) at ([yshift=-1em]s1.south) {I};
-\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (t12) at ([yshift=-0.2em]t11.south) {me};
+\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.6em,minimum width=2.5em] (t12) at ([yshift=-0.8em]t11.south) {me};
-\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (t13) at ([yshift=-0.2em]t12.south) {I'm};
+\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.6em,minimum width=2.5em] (t13) at ([yshift=-0.8em]t12.south) {I'm};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl11) at (t11.north west) {\tiny{{\color{white} \textbf{1}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl12) at (t12.north west) {\tiny{{\color{white} \textbf{1}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl13) at (t13.north west) {\tiny{{\color{white} \textbf{1}}}};
-\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=2.5em] (t21) at ([yshift=-1em]s2.south) {to};
+\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.6em,minimum width=2.5em] (t21) at ([yshift=-1em]s2.south) {to};
-\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=2.5em] (t22) at ([yshift=-0.2em]t21.south) {with};
+\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.6em,minimum width=2.5em] (t22) at ([yshift=-0.8em]t21.south) {with};
-\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=2.5em] (t23) at ([yshift=-0.2em]t22.south) {for};
+\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.6em,minimum width=2.5em] (t23) at ([yshift=-0.8em]t22.south) {for};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl21) at (t21.north west) {\tiny{{\color{white} \textbf{2}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl22) at (t22.north west) {\tiny{{\color{white} \textbf{2}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl23) at (t23.north west) {\tiny{{\color{white} \textbf{2}}}};
-\node [anchor=north,inner sep=2pt,fill=blue!20,minimum height=1.5em,minimum width=2.5em] (t31) at ([yshift=-1em]s3.south) {you};
+\node [anchor=north,inner sep=2pt,fill=blue!20,minimum height=1.6em,minimum width=2.5em] (t31) at ([yshift=-1em]s3.south) {you};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl31) at (t31.north west) {\tiny{{\color{white} \textbf{3}}}};
-\node [anchor=north,inner sep=2pt,fill=orange!20,minimum height=1.5em,minimum width=3em] (t41) at ([yshift=-1em]s4.south) {$\phi$};
+\node [anchor=north,inner sep=2pt,fill=orange!20,minimum height=1.6em,minimum width=3em] (t41) at ([yshift=-1em]s4.south) {$\phi$};
-\node [anchor=north,inner sep=2pt,fill=orange!20,minimum height=1.5em,minimum width=3em] (t42) at ([yshift=-0.2em]t41.south) {feel};
+\node [anchor=north,inner sep=2pt,fill=orange!20,minimum height=1.6em,minimum width=3em] (t42) at ([yshift=-0.8em]t41.south) {feel};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl41) at (t41.north west) {\tiny{{\color{white} \textbf{4}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl42) at (t42.north west) {\tiny{{\color{white} \textbf{4}}}};
-\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=4.5em] (t51) at ([yshift=-1em]s5.south) {satisfy};
+\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.6em,minimum width=4.5em] (t51) at ([yshift=-1em]s5.south) {satisfy};
-\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=4.5em] (t52) at ([yshift=-0.2em]t51.south) {satisfied};
+\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.6em,minimum width=4.5em] (t52) at ([yshift=-0.8em]t51.south) {satisfied};
-\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=4.5em] (t53) at ([yshift=-0.2em]t52.south) {satisfies};
+\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.6em,minimum width=4.5em] (t53) at ([yshift=-0.8em]t52.south) {satisfies};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl51) at (t51.north west) {\tiny{{\color{white} \textbf{5}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl52) at (t52.north west) {\tiny{{\color{white} \textbf{5}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl53) at (t53.north west) {\tiny{{\color{white} \textbf{5}}}};
@@ -51,22 +51,22 @@
 {\tiny
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt11) at (t11.east) {{\color{white} \textbf{P=.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt11) at (t11.south) {{\color{white} \textbf{$\seq{P}$=0.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt12) at (t12.east) {{\color{white} \textbf{P=.2}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt12) at (t12.south) {{\color{white} \textbf{$\seq{P}$=0.2}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt13) at (t13.east) {{\color{white} \textbf{P=.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt13) at (t13.south) {{\color{white} \textbf{$\seq{P}$=0.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt21) at (t21.east) {{\color{white} \textbf{P=.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt21) at (t21.south) {{\color{white} \textbf{$\seq{P}$=0.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt22) at (t22.east) {{\color{white} \textbf{P=.3}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt22) at (t22.south) {{\color{white} \textbf{$\seq{P}$=0.3}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt23) at (t23.east) {{\color{white} \textbf{P=.3}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt23) at (t23.south) {{\color{white} \textbf{$\seq{P}$=0.3}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt31) at (t31.east) {{\color{white} \textbf{P=1}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt31) at (t31.south) {{\color{white} \textbf{$\seq{P}$=1}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt41) at (t41.east) {{\color{white} \textbf{P=.5}}};
+\node [anchor=north,inner sep=1pt,minimum width=5em,fill=black] (pt41) at (t41.south) {{\color{white} \textbf{$\seq{P}$=0.5}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt42) at (t42.east) {{\color{white} \textbf{P=.5}}};
+\node [anchor=north,inner sep=1pt,minimum width=5em,fill=black] (pt42) at (t42.south) {{\color{white} \textbf{$\seq{P}$=0.5}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt51) at (t51.east) {{\color{white} \textbf{P=.5}}};
+\node [anchor=north,inner sep=1pt,minimum width=7.5em,fill=black] (pt51) at (t51.south) {{\color{white} \textbf{$\seq{P}$=0.5}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt52) at (t52.east) {{\color{white} \textbf{P=.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=7.5em,fill=black] (pt52) at (t52.south) {{\color{white} \textbf{$\seq{P}$=0.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt53) at (t53.east) {{\color{white} \textbf{P=.1}}};
+\node [anchor=north,inner sep=1pt,minimum width=7.5em,fill=black] (pt53) at (t53.south) {{\color{white} \textbf{$\seq{P}$=0.1}}};
 }
 }
@@ -76,23 +76,23 @@
 \begin{scope}
 {\small
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (ft11) at ([yshift=-1.2in]t11.west) {I'm};
+\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (ft11) at ([yshift=-1.5in]t11.west) {I'm};
-\node [anchor=center,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=5em] (ft12) at ([xshift=5.0em]ft11.center) {satisfied};
+\node [anchor=center,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=4.5em] (ft12) at ([xshift=5.0em]ft11.center) {satisfied};
 \node [anchor=center,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=2.5em] (ft13) at ([xshift=5.0em]ft12.center) {with};
 \node [anchor=center,inner sep=2pt,fill=blue!20,minimum height=1.5em,minimum width=2.5em] (ft14) at ([xshift=4.0em]ft13.center) {you};
 {
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (ft21) at ([yshift=-2em]ft11.west) {I'm};
+\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (ft21) at ([yshift=-3em]ft11.west) {I'm};
-\node [anchor=center,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=5em] (ft22) at ([xshift=5.0em]ft21.center) {satisfy};
+\node [anchor=center,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=4.5em] (ft22) at ([xshift=5.0em]ft21.center) {satisfy};
 \node [anchor=center,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=2.5em] (ft23) at ([xshift=5.0em]ft22.center) {to};
 \node [anchor=center,inner sep=2pt,fill=blue!20,minimum height=1.5em,minimum width=2.5em] (ft24) at ([xshift=4.0em]ft23.center) {you};
 }
 {
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (ft31) at ([yshift=-2em]ft21.west) {I'm};
+\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=1.6em,minimum width=2.5em] (ft31) at ([yshift=-3em]ft21.west) {I'm};
-\node [anchor=center,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=5em] (ft32) at ([xshift=5.0em]ft31.center) {satisfy};
+\node [anchor=center,inner sep=2pt,fill=purple!20,minimum height=1.6em,minimum width=4.5em] (ft32) at ([xshift=5.0em]ft31.center) {satisfy};
-\node [anchor=center,inner sep=2pt,fill=blue!20,minimum height=1.5em,minimum width=2.5em] (ft33) at ([xshift=5.0em]ft32.center) {you};
+\node [anchor=center,inner sep=2pt,fill=blue!20,minimum height=1.6em,minimum width=2.5em] (ft33) at ([xshift=5.0em]ft32.center) {you};
-\node [anchor=center,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=2.5em] (ft34) at ([xshift=4.0em]ft33.center) {to};
+\node [anchor=center,inner sep=2pt,fill=green!20,minimum height=1.6em,minimum width=2.5em] (ft34) at ([xshift=4.0em]ft33.center) {to};
 }
 \node [anchor=north west,inner sep=1pt,fill=black] (ftl11) at (ft11.north west) {\tiny{{\color{white} \textbf{1}}}};
@@ -117,20 +117,20 @@
 {\tiny
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.5em,fill=black] (pft11) at (ft11.east) {{\color{white} \textbf{P=.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pft11) at (ft11.south) {{\color{white} \textbf{$\seq{P}$=0.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.5em,fill=black] (pft12) at (ft12.east) {{\color{white} \textbf{P=.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=7.5em,fill=black] (pft12) at (ft12.south) {{\color{white} \textbf{$\seq{P}$=0.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.5em,fill=black] (pft13) at (ft13.east) {{\color{white} \textbf{P=.3}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pft13) at (ft13.south) {{\color{white} \textbf{$\seq{P}$=0.3}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.5em,fill=black] (pft14) at (ft14.east) {{\color{white} \textbf{P=1}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pft14) at (ft14.south) {{\color{white} \textbf{$\seq{P}$=1}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.5em,fill=black] (pft21) at (ft21.east) {{\color{white} \textbf{P=.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pft21) at (ft21.south) {{\color{white} \textbf{$\seq{P}$=0.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.5em,fill=black] (pft22) at (ft22.east) {{\color{white} \textbf{P=.1}}};
+\node [anchor=north,inner sep=1pt,minimum width=7.5em,fill=black] (pft22) at (ft22.south) {{\color{white} \textbf{$\seq{P}$=0.1}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.5em,fill=black] (pft23) at (ft23.east) {{\color{white} \textbf{P=.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pft23) at (ft23.south) {{\color{white} \textbf{$\seq{P}$=0.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.5em,fill=black] (pft24) at (ft24.east) {{\color{white} \textbf{P=1}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pft24) at (ft24.south) {{\color{white} \textbf{$\seq{P}$=1}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.5em,fill=black] (pft31) at (ft31.east) {{\color{white} \textbf{P=.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pft31) at (ft31.south) {{\color{white} \textbf{$\seq{P}$=0.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.5em,fill=black] (pft32) at (ft32.east) {{\color{white} \textbf{P=.1}}};
+\node [anchor=north,inner sep=1pt,minimum width=7.5em,fill=black] (pft32) at (ft32.south) {{\color{white} \textbf{$\seq{P}$=0.1}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.5em,fill=black] (pft33) at (ft33.east) {{\color{white} \textbf{P=1}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pft33) at (ft33.south) {{\color{white} \textbf{$\seq{P}$=1}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.5em,fill=black] (pft34) at (ft34.east) {{\color{white} \textbf{P=.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pft34) at (ft34.south) {{\color{white} \textbf{$\seq{P}$=0.4}}};
 }
 }
@@ -146,34 +146,34 @@
 \end{pgfonlayer}
 {
-\node [anchor=west,inner sep=2pt,minimum height=1.5em,minimum width=2.5em] (ft41) at ([yshift=-2em]ft31.west) {...};
+\node [anchor=west,inner sep=2pt,minimum height=1.5em,minimum width=2.5em] (ft41) at ([yshift=-3em]ft31.west) {...};
 }
 {
-\node [anchor=west,inner sep=2pt,minimum height=1.5em,minimum width=2.5em] (ft42) at ([yshift=-2em]ft32.west) {\scriptsize{{所有翻译单元都是概率化的}}};
+\node [anchor=west,inner sep=2pt,minimum height=1.5em,minimum width=2.5em] (ft42) at ([yshift=-3em]ft32.west) {\scriptsize{{所有翻译单元都是概率化的}}};
-\node [anchor=west,inner sep=1pt,fill=black] (ft43) at (ft42.east) {{\color{white} \tiny{{P=概率}}}};
+\node [anchor=west,inner sep=1pt,fill=black] (ft43) at (ft42.east) {{\color{white} \tiny{{$\seq{P}$=概率}}}};
 }
 }
 \end{scope}
 \begin{scope}
 {\footnotesize
-\node [anchor=east] (label4) at ([yshift=0.4em]ft11.west) {翻译就是一条};
+\node [anchor=east] (label4) at ([yshift=0.0em]ft11.west) {翻译就是一条};
 \node [anchor=north west] (label4part2) at ([yshift=0.7em]label4.south west) {译文选择路径};
 }
 {\footnotesize
-\node [anchor=east] (label5) at ([yshift=0.4em]ft21.west) {不同的译文对};
+\node [anchor=east] (label5) at ([yshift=0.0em]ft21.west) {不同的译文对};
 \node [anchor=north west] (label5part2) at ([yshift=0.7em]label5.south west) {应不同的路径};
 }
 {\footnotesize
-\node [anchor=east] (label6) at ([yshift=0.4em]ft31.west) {单词翻译的词};
+\node [anchor=east] (label6) at ([yshift=0.0em]ft31.west) {单词翻译的词};
 \node [anchor=north west] (label6part2) at ([yshift=0.7em]label6.south west) {序也可能不同};
 }
 {\footnotesize
-\node [anchor=east] (label7) at ([yshift=0.4em]ft41.west) {可能的翻译路};
+\node [anchor=east] (label7) at ([yshift=0.0em]ft41.west) {可能的翻译路};
 \node [anchor=north west] (label7part2) at ([yshift=0.7em]label7.south west) {径非常多};
 }
@@ -181,14 +181,14 @@
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{scope}
 {
-\draw[decorate,thick,decoration={brace,amplitude=5pt}] ([yshift=8em,xshift=2.0em]t53.south east) -- ([xshift=2.0em]t53.south east) node [pos=0.5,right,xshift=0.5em,yshift=2.0em] (label2) {\footnotesize{{从双语数}}};
+\draw[decorate,thick,decoration={brace,amplitude=5pt}] ([yshift=9em,xshift=2.0em]t53.south east) -- ([yshift=-0.5em,xshift=2.0em]t53.south east) node [pos=0.5,right,xshift=0.5em,yshift=2.0em] (label2) {\footnotesize{{从双语数}}};
 \node [anchor=north west] (label2part2) at ([yshift=0.3em]label2.south west) {\footnotesize{{据中自动}}};
 \node [anchor=north west] (label2part3) at ([yshift=0.3em]label2part2.south west) {\footnotesize{{学习词典}}};
 \node [anchor=north west] (label2part4) at ([yshift=0.3em]label2part3.south west) {\footnotesize{{（训练）}}};
 }
 {
-\draw[decorate,thick,decoration={brace,amplitude=5pt}] ([yshift=-1.0em,xshift=6.2em]t53.south west) -- ([yshift=-10.5em,xshift=6.2em]t53.south west) node [pos=0.5,right,xshift=0.5em,yshift=2.0em] (label3) {\footnotesize{{利用概率}}};
+\draw[decorate,thick,decoration={brace,amplitude=5pt}] ([yshift=-2.0em,xshift=6.2em]t53.south west) -- ([yshift=-14.5em,xshift=6.2em]t53.south west) node [pos=0.5,right,xshift=0.5em,yshift=2.0em] (label3) {\footnotesize{{利用概率}}};
 \node [anchor=north west] (label3part2) at ([yshift=0.3em]label3.south west) {\footnotesize{{化的词典}}};
 \node [anchor=north west] (label3part3) at ([yshift=0.3em]label3part2.south west) {\footnotesize{{进行翻译}}};
 \node [anchor=north west] (label3part4) at ([yshift=0.3em]label3part3.south west) {\footnotesize{{（解码）}}};
@@ -202,11 +202,11 @@
 \node [anchor=west] (score1) at ([xshift=1.5em]ft14.east) {\footnotesize{P=0.042}};
 \node [anchor=west] (score2) at ([xshift=1.5em]ft24.east) {\footnotesize{P=0.006}};
 \node [anchor=west] (score3) at ([xshift=1.5em]ft34.east) {\footnotesize{P=0.003}};
-\node [anchor=south] (scorelabel) at ([xshift=-2.0em]score1.north) {\scriptsize{{\color{black}{率给每个译文赋予一个模型得分}}}};
+\node [anchor=south] (scorelabel) at ([xshift=-3.0em]score1.north) {\scriptsize{{\color{black}{率给每个译文赋予一个模型得分}}}};
 \node [anchor=south] (scorelabel2) at ([yshift=-0.5em]scorelabel.north) {\scriptsize{{\color{black}{系统综合单词概率和语言模型概}}}};
 }
 {
-\node [anchor=north] (scorelabel2) at (score3.south) {\scriptsize{{选择得分}}};
+\node [anchor=north] (scorelabel2) at ([yshift=-1.5em]score3.south) {\scriptsize{{选择得分}}};
 \node [anchor=north west] (scorelabel2part2) at ([xshift=-0.5em,yshift=0.5em]scorelabel2.south west) {\scriptsize{{最高的译文}}};
 \node [anchor=center,draw=ublue,circle,thick,fill=white,inner sep=1pt,circular drop shadow={shadow xshift=0.05em,shadow yshift=-0.05em}] (head1) at ([xshift=0.3em]score1.east) {\scriptsize{{\color{ugreen} {ok}}}};
 }
@@ -216,10 +216,10 @@
 \begin{scope}
 {
-\draw [->,ultra thick,ublue,line width=2pt,opacity=0.7] ([xshift=-0.5em,yshift=-0.3em]t13.west) -- ([xshift=0.8em,yshift=-0.3em]t13.east) -- ([xshift=-0.2em,yshift=-0.3em]t21.west) -- ([xshift=0.8em,yshift=-0.3em]t21.east) -- ([xshift=-0.2em,yshift=-0.3em]t31.west) -- ([xshift=0.8em,yshift=-0.3em]t31.east) -- ([xshift=-0.2em,yshift=-0.3em]t41.west) -- ([xshift=0.8em,yshift=-0.3em]t41.east) -- ([xshift=-0.2em,yshift=-0.3em]t51.west) -- ([xshift=1.2em,yshift=-0.3em]t51.east);
+\draw [->,ultra thick,ublue,line width=2pt,opacity=0.7] ([xshift=-0.5em,yshift=-0.42em]t13.west) -- ([xshift=0.8em,yshift=-0.42em]t13.east) -- ([xshift=-0.2em,yshift=-0.42em]t21.west) -- ([xshift=0.8em,yshift=-0.42em]t21.east) -- ([xshift=-0.2em,yshift=-0.42em]t31.west) -- ([xshift=0.8em,yshift=-0.42em]t31.east) -- ([xshift=-0.2em,yshift=-0.42em]t41.west) -- ([xshift=0.8em,yshift=-0.42em]t41.east) -- ([xshift=-0.2em,yshift=-0.42em]t51.west) -- ([xshift=1.2em,yshift=-0.42em]t51.east);
 }
-\draw [->,ultra thick,red,line width=2pt,opacity=0.7] ([xshift=-0.5em,yshift=-0.5em]t13.west) -- ([xshift=0.8em,yshift=-0.5em]t13.east) -- ([xshift=-0.2em,yshift=-0.5em]t22.west) -- ([xshift=0.8em,yshift=-0.5em]t22.east) -- ([xshift=-0.2em,yshift=-0.5em]t31.west) -- ([xshift=0.8em,yshift=-0.5em]t31.east) -- ([xshift=-0.2em,yshift=-0.5em]t41.west) -- ([xshift=0.8em,yshift=-0.5em]t41.east) -- ([xshift=-0.2em,yshift=-0.5em]t52.west) -- ([xshift=1.2em,yshift=-0.5em]t52.east);
+\draw [->,ultra thick,red,line width=2pt,opacity=0.7] ([xshift=-0.5em,yshift=-0.62em]t13.west) -- ([xshift=0.8em,yshift=-0.62em]t13.east) -- ([xshift=-0.2em,yshift=-0.62em]t22.west) -- ([xshift=0.8em,yshift=-0.62em]t22.east) -- ([xshift=-0.2em,yshift=-0.62em]t31.west) -- ([xshift=0.8em,yshift=-0.62em]t31.east) -- ([xshift=-0.2em,yshift=-0.62em]t41.west) -- ([xshift=0.8em,yshift=-0.62em]t41.east) -- ([xshift=-0.2em,yshift=-0.62em]t52.west) -- ([xshift=1.2em,yshift=-0.62em]t52.east);
 \end{scope}

--- a/Chapter5/Figures/figure-scores-of-different-translation_model&language_model.tex
+++ b/Chapter5/Figures/figure-scores-of-different-translation_model&language_model.tex
 %%% outline
 %-------------------------------------------------------------------------
-\begin{tabular}{| l | l |}
+\begin{tabular}{| c | l |}
 \hline
-& {\footnotesize{$\prod\limits_{(j,i) \in \hat{A}} \funp{P}(s_j,t_i)$} \color{red}{{\footnotesize{$\times\funp{P}_{\textrm{lm}}(\mathbf{t})$}}}} \\ \hline
+\rule{0pt}{15pt} 源语言句子“我对你感到满意”的不同翻译结果& {\footnotesize{$\prod\limits_{(j,i) \in \hat{A}} \funp{P}(s_j,t_i)$} \color{red}{{\footnotesize{$\times\funp{P}_{\textrm{lm}}(\mathbf{t})$}}}} \\ \hline
 \begin{tikzpicture}

--- a/Chapter5/Figures/figure-zh-en-translation-sentence-pairs&word-alignment-connection.tex
+++ b/Chapter5/Figures/figure-zh-en-translation-sentence-pairs&word-alignment-connection.tex
@@ -11,7 +11,7 @@
 \node [anchor=west] (s3) at ([xshift=0.5em]s2.east) {你\footnotesize{$_3$}};
 \node [anchor=west] (s4) at ([xshift=0.5em]s3.east) {感到\footnotesize{$_4$}};
 \node [anchor=west] (s5) at ([xshift=0.5em]s4.east) {满意\footnotesize{$_5$}};
-\node [anchor=east] (s) at (s1.west) {$\mathbf{s}=$};
+\node [anchor=east] (s) at (s1.west) {$\seq{s}=$};
 \end{scope}
 \begin{scope}[yshift=-3.0em]
@@ -20,7 +20,7 @@
 \node [anchor=west] (t3) at ([xshift=0.3em,yshift=0.1em]t2.east) {satisfied\footnotesize{$_3$}};
 \node [anchor=west] (t4) at ([xshift=0.3em]t3.east) {with\footnotesize{$_4$}};
 \node [anchor=west] (t5) at ([xshift=0.3em,yshift=-0.2em]t4.east) {you\footnotesize{$_5$}};
-\node [anchor=east] (t) at ([xshift=-0.3em]t1.west) {$\mathbf{t}=$};
+\node [anchor=east] (t) at ([xshift=-0.3em]t1.west) {$\seq{t}=$};
 \end{scope}

--- a/Chapter5/chapter5.tex
+++ b/Chapter5/chapter5.tex
@@ -136,7 +136,7 @@ IBM模型由Peter F. Brown等人于上世纪九十年代初提出\upcite{DBLP:jo
 \parinterval 对于第一个问题，可以给计算机一个翻译词典，这样计算机可以发挥计算方面的优势，尽可能多地把翻译结果拼装出来。比如，可以把每个翻译结果看作是对单词翻译的拼装，这可以被形象地比作贯穿多个单词的一条路径，计算机所做的就是尽可能多地生成这样的路径。图\ref{fig:5-4}中蓝色和红色的折线就分别表示了两条不同的译文选择路径，区别在于“满意”和“对”的翻译候选是不一样的，蓝色折线选择的是“satisfy”和“to”，而红色折线是“satisfied”和“with”。换句话说，不同的译文对应不同的路径（即使词序不同也会对应不同的路径）。
-\parinterval 对于第二个问题，尽管机器能够找到很多译文选择路径，但它并不知道哪些路径是好的。说地再直白一些，简单地枚举路径实际上就是一个体力活，没有太多的智能。因此计算机还需要再聪明一些，运用它的能够“掌握”的知识判断翻译结果的好与坏。这一步是最具挑战的，当然也有很多思路。在统计机器翻译中，这个问题被定义为：设计一种统计模型，它可以给每个译文一个可能性，而这个可能性越高表明译文越接近人工翻译。
+\parinterval 对于第二个问题，尽管机器能够找到很多译文选择路径，但它并不知道哪些路径是好的。说地再直白一些，简单地枚举路径实际上就是一个体力活，没有太多的智能。因此计算机还需要再聪明一些，运用它的能够“掌握”的知识判断翻译结果的好与坏。这一步是最具挑战的，当然也有很多思路来解决这个问题。在统计机器翻译中，这个问题被定义为：设计一种统计模型，它可以给每个译文一个可能性，而这个可能性越高表明译文越接近人工翻译。
 \parinterval 如图\ref{fig:5-4}所示，每个单词翻译候选的右侧黑色框里的数字就是单词的翻译概率，使用这些单词的翻译概率，可以得到整句译文的概率（用符号$\funp{P}$表示）。这样，就用概率化的模型描述了每个翻译候选的可能性。基于这些翻译候选的可能性，机器翻译系统可以对所有的翻译路径进行打分，比如，图\ref{fig:5-4}中第一条路径的分数为0.042，第二条是0.006，以此类推。最后，系统可以选择分数最高的路径作为源语言句子的最终译文。
@@ -262,7 +262,7 @@ $\seq{t}$ = machine\; \underline{translation}\; is\; a\; process\; of\; generati
 \begin{eqnarray}
 \funp{P}(\text{机器},\text{translation}; \seq{s},\seq{t})  & = & \frac{2}{121} \\
 \funp{P}(\text{机器},\text{look}; \seq{s},\seq{t})  & =  & \frac{0}{121}
-\label{eq:5-3}
+\label{eq:5-4}
 \end{eqnarray}
 \noindent 注意，由于“look”没有出现在数据中，因此$\funp{P}(\text{机器},\text{look}; \seq{s},\seq{t})=0$。这时，可以使用{\chaptertwo}介绍的平滑算法赋予它一个非零的值，以保证在后续的步骤中整个翻译模型不会出现零概率的情况。
@@ -275,11 +275,11 @@ $\seq{t}$ = machine\; \underline{translation}\; is\; a\; process\; of\; generati
 \parinterval 如果有更多的句子，上面的方法同样适用。假设，有$K$个互译句对$\{(\seq{s}^{[1]},\seq{t}^{[1]})$,...,\\$(\seq{s}^{[K]},\seq{t}^{[K]})\}$。仍然可以使用基于相对频次的方法估计翻译概率$\funp{P}(x,y)$，具体方法如下:
 \begin{eqnarray}
-\funp{P}(x,y)  =  \frac{{\sum_{k=1}^{K} c(x,y;\seq{s}^{[k]},\seq{t}^{[k]})}}{\sum_{k=1}^{K}{{\sum_{x',y'} c(x',y';\seq{s}^{[k]},\seq{t}^{[k]})}}}
+\funp{P}(x,y)  &=&  \frac{{\sum_{k=1}^{K} c(x,y;\seq{s}^{[k]},\seq{t}^{[k]})}}{\sum_{k=1}^{K}{{\sum_{x',y'} c(x',y';\seq{s}^{[k]},\seq{t}^{[k]})}}}
-\label{eq:5-4}
+\label{eq:5-5}
 \end{eqnarray}
-\parinterval 与公式\eqref{eq:5-1}相比，公式\eqref{eq:5-4}的分子、分母都多了一项累加符号$\sum_{k=1}^{K} \cdot$，它表示遍历语料库中所有的句对。换句话说，当计算词的共现次数时，需要对每个句对上的计数结果进行累加。从统计学习的角度，使用更大规模的数据进行参数估计可以提高结果的可靠性。计算单词的翻译概率也是一样，在小规模的数据上看，很多翻译现象的特征并不突出，但是当使用的数据量增加到一定程度，翻译的规律会很明显的体现出来。
+\parinterval 与公式\eqref{eq:5-1}相比，公式\eqref{eq:5-5}的分子、分母都多了一项累加符号$\sum_{k=1}^{K} \cdot$，它表示遍历语料库中所有的句对。换句话说，当计算词的共现次数时，需要对每个句对上的计数结果进行累加。从统计学习的角度，使用更大规模的数据进行参数估计可以提高结果的可靠性。计算单词的翻译概率也是一样，在小规模的数据上看，很多翻译现象的特征并不突出，但是当使用的数据量增加到一定程度，翻译的规律会很明显的体现出来。
 \parinterval 举个例子，实例\ref{eg:5-2}展示了一个由两个句对构成的平行语料库。
@@ -303,10 +303,10 @@ $\seq{t}^{[2]}$ = So\; ,\; what\; is\; human\; \underline{translation}\; ?
                                                                            & = & \frac{4 + 1}{|\seq{s}^{[1]}| \times |\seq{t}^{[1]}| + |\seq{s}^{[2]}| \times |\seq{t}^{[2]}|} \nonumber \\
                                                                            & = & \frac{4 + 1}{11 \times 11 + 5 \times 7} \nonumber \\
                                                                            & = & \frac{5}{156}
-\label{eq:5-5}
+\label{eq:5-6}
 \end{eqnarray}
 }
-\parinterval 公式\eqref{eq:5-5}所展示的计算过程很简单，分子是两个句对中“翻译”和“translation”共现次数的累计，分母是两个句对的源语言单词和目标语言单词的组合数的累加。显然，这个方法也很容易推广到处理更多句子的情况。
+\parinterval 公式\eqref{eq:5-6}所展示的计算过程很简单，分子是两个句对中“翻译”和“translation”共现次数的累计，分母是两个句对的源语言单词和目标语言单词的组合数的累加。显然，这个方法也很容易推广到处理更多句子的情况。
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -323,14 +323,14 @@ $\seq{t}^{[2]}$ = So\; ,\; what\; is\; human\; \underline{translation}\; ?
 \subsubsection{1. 基础模型}
-\parinterval 计算句子级翻译概率并不简单。因为自然语言非常灵活，任何数据无法覆盖足够多的句子，因此，无法像公式\eqref{eq:5-4}一样直接用简单计数的方式对句子的翻译概率进行估计。这里，采用一个退而求其次的方法：找到一个函数$g(\seq{s},\seq{t})\ge 0$来模拟翻译概率对译文可能性进行估计。可以定义一个新的函数$g(\seq{s},\seq{t})$，令其满足：给定$\seq{s}$，翻译结果$\seq{t}$出现的可能性越大，$g(\seq{s},\seq{t})$的值越大；$\seq{t}$出现的可能性越小，$g(\seq{s},\seq{t})$的值越小。换句话说，$g(\seq{s},\seq{t})$和翻译概率$\funp{P}(\seq{t}|\seq{s})$呈正相关。如果存在这样的函数$g(\seq{s},\seq{t}
+\parinterval 计算句子级翻译概率并不简单。因为自然语言非常灵活，任何数据无法覆盖足够多的句子，因此，无法像公式\eqref{eq:5-5}一样直接用简单计数的方式对句子的翻译概率进行估计。这里，采用一个退而求其次的方法：找到一个函数$g(\seq{s},\seq{t})\ge 0$来模拟翻译概率对译文可能性进行估计。可以定义一个新的函数$g(\seq{s},\seq{t})$，令其满足：给定$\seq{s}$，翻译结果$\seq{t}$出现的可能性越大，$g(\seq{s},\seq{t})$的值越大；$\seq{t}$出现的可能性越小，$g(\seq{s},\seq{t})$的值越小。换句话说，$g(\seq{s},\seq{t})$和翻译概率$\funp{P}(\seq{t}|\seq{s})$呈正相关。如果存在这样的函数$g(\seq{s},\seq{t}
 )$，可以利用$g(\seq{s},\seq{t})$近似表示$\funp{P}(\seq{t}|\seq{s})$，如下：
 \begin{eqnarray}
-\funp{P}(\seq{t}|\seq{s})  \equiv  \frac{g(\seq{s},\seq{t})}{\sum_{\seq{t}'}g(\seq{s},\seq{t}')}
+\funp{P}(\seq{t}|\seq{s}) & \equiv & \frac{g(\seq{s},\seq{t})}{\sum_{\seq{t}'}g(\seq{s},\seq{t}')}
-\label{eq:5-6}
+\label{eq:5-7}
 \end{eqnarray}
-\parinterval 公式\eqref{eq:5-6}相当于在函数$g(\cdot)$上做了归一化，这样等式右端的结果具有一些概率的属性，比如，$0 \le \frac{g(\seq{s},\seq{t})}{\sum_{\seq{t'}}g(\seq{s},\seq{t'})} \le 1$。具体来说，对于源语言句子$\seq{s}$，枚举其所有的翻译结果，并把所对应的函数$g(\cdot)$相加作为分母，而分子是某个翻译结果$\seq{t}$所对应的$g(\cdot)$的值。
+\parinterval 公式\eqref{eq:5-7}相当于在函数$g(\cdot)$上做了归一化，这样等式右端的结果具有一些概率的属性，比如，$0 \le \frac{g(\seq{s},\seq{t})}{\sum_{\seq{t'}}g(\seq{s},\seq{t'})} \le 1$。具体来说，对于源语言句子$\seq{s}$，枚举其所有的翻译结果，并把所对应的函数$g(\cdot)$相加作为分母，而分子是某个翻译结果$\seq{t}$所对应的$g(\cdot)$的值。
 \parinterval 上述过程初步建立了句子级翻译模型，并没有直接求$\funp{P}(\seq{t}|\seq{s})$，而是把问题转化为对$g(\cdot)$的设计和计算上。但是，面临着两个新的问题：
@@ -338,13 +338,13 @@ $\seq{t}^{[2]}$ = So\; ,\; what\; is\; human\; \underline{translation}\; ?
 \vspace{0.5em}
 \item 如何定义函数$g(\seq{s},\seq{t})$？即，在知道单词翻译概率的前提下，如何计算$g(\seq{s},\seq{t})$；
 \vspace{0.5em}
-\item 公式\eqref{eq:5-6}中分母$\sum_{seq{t'}}g(\seq{s},{\seq{t}'})$需要累加所有翻译结果的$g(\seq{s},{\seq{t}'})$，但枚举所有${\seq{t}'}$是不现实的。
+\item 公式\eqref{eq:5-7}中分母$\sum_{seq{t'}}g(\seq{s},{\seq{t}'})$需要累加所有翻译结果的$g(\seq{s},{\seq{t}'})$，但枚举所有${\seq{t}'}$是不现实的。
 \vspace{0.5em}
 \end{itemize}
 \parinterval  当然，这里最核心的问题还是函数$g(\seq{s},\seq{t})$的定义。而第二个问题其实不需要解决，因为机器翻译只关注于可能性最大的翻译结果，即$g(\seq{s},\seq{t})$的计算结果最大时对应的译文。这个问题会在后面进行讨论。
-\parinterval 回到设计$g(\seq{s},\seq{t})$的问题上。这里，采用“大题小作”的方法，这个技巧在{\chaptertwo}已经进行了充分的介绍。具体来说，直接建模句子之间的对应比较困难，但可以利用单词之间的对应来描述句子之间的对应关系。这就用到了上一小节所介绍的单词翻译概率。
+\parinterval 回到设计$g(\seq{s},\seq{t})$的问题上。这里，采用“大题小作”的方法，这个技巧在{\chaptertwo}已经进行了充分的介绍。具体来说，直接建模句子之间的对应比较困难，但可以利用单词之间的对应来描述句子之间的对应关系。这就用到了\ref{chapter5.2.3}小节所介绍的单词翻译概率。
 \parinterval 首先引入一个非常重要的概念\ \dash \ {\small\sffamily\bfseries{词对齐}}\index{词对齐}（Word Alignment）\index{Word Alignment}，它是统计机器翻译中最核心的概念之一。词对齐描述了平行句对中单词之间的对应关系，它体现了一种观点：本质上句子之间的对应是由单词之间的对应表示的。当然，这个观点在神经机器翻译或者其他模型中可能会有不同的理解，但是翻译句子的过程中考虑词级的对应关系是符合人类对语言的认知的。
@@ -362,15 +362,15 @@ $\seq{t}^{[2]}$ = So\; ,\; what\; is\; human\; \underline{translation}\; ?
 \parinterval 对于句对$(\seq{s},\seq{t})$，假设可以得到最优词对齐$\widehat{A}$，于是可以使用单词翻译概率计算$g(\seq{s},\seq{t})$，如下
 \begin{eqnarray}
-g(\seq{s},\seq{t}) = \prod_{(j,i)\in \widehat{A}}\funp{P}(s_j,t_i)
+g(\seq{s},\seq{t}) &= &\prod_{(j,i)\in \widehat{A}}\funp{P}(s_j,t_i)
-\label{eq:5-7}
+\label{eq:5-8}
 \end{eqnarray}
 \noindent 其中$g(\seq{s},\seq{t})$被定义为句子$\seq{s}$中的单词和句子$\seq{t}$中的单词的翻译概率的乘积，并且这两个单词之间必须有词对齐连接。$\funp{P}(s_j,t_i)$表示具有词对齐连接的源语言单词$s_j$和目标语言单词$t_i$的单词翻译概率。以图\ref{fig:5-7}中的句对为例，其中“我”与“I”、“对”与“with”、“你” 与“you”等相互对应，可以把它们的翻译概率相乘得到$g(\seq{s},\seq{t})$的计算结果，如下：
 \begin{eqnarray}
 {g(\seq{s},\seq{t})}&= &  \funp{P}(\textrm{我,I}) \times \funp{P}(\textrm{对,with}) \times \funp{P}(\textrm{你,you}) \times \nonumber \\
          &    & \funp{P}(\textrm{感到, am}) \times \funp{P}(\textrm{满意,satisfied})
-\label{eq:5-8}
+\label{eq:5-9}
 \end{eqnarray}
 \parinterval  显然，如果每个词对齐连接所对应的翻译概率变大，那么整个句子翻译的得分也会提高。也就是说，词对齐越准确，翻译模型的打分越高，$\seq{s}$和$\seq{t}$之间存在翻译关系的可能性越大。
@@ -381,7 +381,7 @@ g(\seq{s},\seq{t}) = \prod_{(j,i)\in \widehat{A}}\funp{P}(s_j,t_i)
 \subsubsection{2. 生成流畅的译文}
-\parinterval 公式\eqref{eq:5-7}定义的$g(\seq{s},\seq{t})$存在的问题是没有考虑词序信息。这里用一个简单的例子说明这个问题。如图\ref{fig:5-8}所示，源语言句子“我 对 你 感到 满意”有两个翻译结果，第一个翻译结果是“I am satisfied with you”，第二个是“I with you am satisfied”。虽然这两个译文包含的目标语单词是一样的，但词序存在很大差异。比如，它们都选择了“satisfied”作为源语单词“满意”的译文，但是在第一个翻译结果中“satisfied”处于第3个位置，而第二个结果中处于最后的位置。显然第一个翻译结果更符合英语的表达习惯，翻译的质量更高。遗憾的是，对于有明显差异的两个译文，公式\eqref{eq:5-7}计算得到的函数$g(\cdot)$的值却是一样的。
+\parinterval 公式\eqref{eq:5-8}定义的$g(\seq{s},\seq{t})$存在的问题是没有考虑词序信息。这里用一个简单的例子说明这个问题。如图\ref{fig:5-8}所示，源语言句子“我 对 你 感到 满意”有两个翻译结果，第一个翻译结果是“I am satisfied with you”，第二个是“I with you am satisfied”。虽然这两个译文包含的目标语单词是一样的，但词序存在很大差异。比如，它们都选择了“satisfied”作为源语单词“满意”的译文，但是在第一个翻译结果中“satisfied”处于第3个位置，而第二个结果中处于最后的位置。显然第一个翻译结果更符合英语的表达习惯，翻译的质量更高。遗憾的是，对于有明显差异的两个译文，公式\eqref{eq:5-8}计算得到的函数$g(\cdot)$的值却是一样的。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -398,18 +398,18 @@ g(\seq{s},\seq{t}) = \prod_{(j,i)\in \widehat{A}}\funp{P}(s_j,t_i)
 \begin{eqnarray}
 \funp{P}_{\textrm{lm}}(\seq{t}) & = & \funp{P}_{\textrm{lm}}(t_1...t_l) \nonumber \\
                                           & =  & \funp{P}(t_1)\times \funp{P}(t_2|t_1)\times \funp{P}(t_3|t_2)\times ... \times \funp{P}(t_l|t_{l-1})
-\label{eq:5-9}
+\label{eq:5-10}
 \end{eqnarray}
 \noindent  其中，$\seq{t}=t_1...t_l$表示由$l$个单词组成的句子，$\funp{P}_{\textrm{lm}}(\seq{t})$表示语言模型给句子$\seq{t}$的打分。具体而言，$\funp{P}_{\textrm{lm}}(\seq{t})$被定义为$\funp{P}(t_i|t_{i-1})(i=1,2,...,l)$的连乘\footnote{为了确保数学表达的准确性，本书中定义$\funp{P}(t_1|t_0) \equiv \funp{P}(t_1)$}，其中$\funp{P}(t_i|t_{i-1})(i=1,2,...,l)$表示前面一个单词为$t_{i-1}$时，当前单词为$t_i$的概率。语言模型的训练方法可以参看{\chaptertwo}相关内容。
-\parinterval 回到建模问题上来。既然语言模型可以帮助系统度量每个译文的流畅度，那么可以使用它对翻译进行打分。一种简单的方法是把语言模型$\funp{P}_{\textrm{lm}}{(\seq{t})}$ 和公式\eqref{eq:5-7}中的$g(\seq{s},\seq{t})$相乘，这样就得到了一个新的$g(\seq{s},\seq{t})$，它同时考虑了翻译准确性（$\prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)}$）和流畅度（$\funp{P}_{\textrm{lm}}(\seq{t})$）:
+\parinterval 回到建模问题上来。既然语言模型可以帮助系统度量每个译文的流畅度，那么可以使用它对翻译进行打分。一种简单的方法是把语言模型$\funp{P}_{\textrm{lm}}{(\seq{t})}$ 和公式\eqref{eq:5-8}中的$g(\seq{s},\seq{t})$相乘，这样就得到了一个新的$g(\seq{s},\seq{t})$，它同时考虑了翻译准确性（$\prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)}$）和流畅度（$\funp{P}_{\textrm{lm}}(\seq{t})$）:
 \begin{eqnarray}
-g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times  \funp{P}_{\textrm{lm}}(\seq{t})
+g(\seq{s},\seq{t}) & \equiv & \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times  \funp{P}_{\textrm{lm}}(\seq{t})
-\label{eq:5-10}
+\label{eq:5-11}
 \end{eqnarray}
-\parinterval 如图\ref{fig:5-9}所示，语言模型$\funp{P}_{\textrm{lm}}(\seq{t})$分别给$\seq{t}^{'}$和$\seq{t}^{”}$赋予0.0107和0.0009的概率，这表明句子$\seq{t}^{'}$更符合英文的表达，这与期望是相吻合的。它们再分别乘以$\prod_{j,i \in \widehat{A}}{\funp{P}(s_j},t_i)$的值，就得到公式\eqref{eq:5-10}定义的函数$g(\cdot)$的值。显然句子$\seq{t}^{'}$的分数更高。至此，完成了对函数$g(\seq{s},\seq{t})$的一个简单定义，把它带入公式\eqref{eq:5-6}就得到了同时考虑准确性和流畅性的句子级统计翻译模型。
+\parinterval 如图\ref{fig:5-9}所示，语言模型$\funp{P}_{\textrm{lm}}(\seq{t})$分别给$\seq{t}^{'}$和$\seq{t}^{”}$赋予0.0107和0.0009的概率，这表明句子$\seq{t}^{'}$更符合英文的表达，这与期望是相吻合的。它们再分别乘以$\prod_{j,i \in \widehat{A}}{\funp{P}(s_j},t_i)$的值，就得到公式\eqref{eq:5-11}定义的函数$g(\cdot)$的值。显然句子$\seq{t}^{'}$的分数更高。至此，完成了对函数$g(\seq{s},\seq{t})$的一个简单定义，把它带入公式\eqref{eq:5-7}就得到了同时考虑准确性和流畅性的句子级统计翻译模型。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -430,23 +430,23 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \parinterval 解码是指在得到翻译模型后，对于新输入的句子生成最佳译文的过程。具体来说，当给定任意的源语言句子$\seq{s}$，解码系统要找到翻译概率最大的目标语译文$\hat{\seq{t}}$。这个过程可以被形式化描述为：
 \begin{eqnarray}
-\widehat{\seq{t}}=\argmax_{\seq{t}} \funp{P}(\seq{t}|\seq{s})
+\widehat{\seq{t}}&=&\argmax_{\seq{t}} \funp{P}(\seq{t}|\seq{s})
-\label{eq:5-11}
+\label{eq:5-12}
 \end{eqnarray}
-\noindent  其中$\argmax_{\seq{t}} \funp{P}(\seq{t}|\seq{s})$表示找到使$\funp{P}(\seq{t}|\seq{s})$达到最大时的译文$\seq{t}$。结合上一小节中关于$\funp{P}(\seq{t}|\seq{s})$的定义，把公式\eqref{eq:5-6}带入公式\eqref{eq:5-11}得到：
+\noindent  其中$\argmax_{\seq{t}} \funp{P}(\seq{t}|\seq{s})$表示找到使$\funp{P}(\seq{t}|\seq{s})$达到最大时的译文$\seq{t}$。结合\ref{sec:sentence-level-translation}小节中关于$\funp{P}(\seq{t}|\seq{s})$的定义，把公式\eqref{eq:5-7}带入公式\eqref{eq:5-12}得到：
 \begin{eqnarray}
-\widehat{\seq{t}}=\argmax_{\seq{t}}\frac{g(\seq{s},\seq{t})}{\sum_{\seq{t}^{'}g(\seq{s},\seq{t}^{'})}}
+\widehat{\seq{t}}&=&\argmax_{\seq{t}}\frac{g(\seq{s},\seq{t})}{\sum_{\seq{t}^{'}g(\seq{s},\seq{t}^{'})}}
-\label{eq:5-12}
+\label{eq:5-13}
 \end{eqnarray}
-\parinterval 在公式\eqref{eq:5-12}中，可以发现${\sum_{\seq{t}^{'}g(\seq{s},\seq{t}^{'})}}$是一个关于$\seq{s}$的函数，当给定源语句$\seq{s}$时，它是一个常数，而且$g(\cdot) \ge 0$，因此${\sum_{\seq{t}^{'}g(\seq{s},\seq{t}^{'})}}$不影响对$\widehat{\seq{t}}$的求解，也不需要计算。基于此，公式\eqref{eq:5-12}可以被化简为：
+\parinterval 在公式\eqref{eq:5-13}中，可以发现${\sum_{\seq{t}^{'}g(\seq{s},\seq{t}^{'})}}$是一个关于$\seq{s}$的函数，当给定源语句$\seq{s}$时，它是一个常数，而且$g(\cdot) \ge 0$，因此${\sum_{\seq{t}^{'}g(\seq{s},\seq{t}^{'})}}$不影响对$\widehat{\seq{t}}$的求解，也不需要计算。基于此，公式\eqref{eq:5-13}可以被化简为：
 \begin{eqnarray}
-\widehat{\seq{t}}=\argmax_{\seq{t}}g(\seq{s},\seq{t})
+\widehat{\seq{t}}&=&\argmax_{\seq{t}}g(\seq{s},\seq{t})
-\label{eq:5-13}
+\label{eq:5-14}
 \end{eqnarray}
-\parinterval 公式\eqref{eq:5-13}定义了解码的目标，剩下的问题是实现$\argmax$，以快速准确地找到最佳译文$\widehat{\seq{t}}$。但是，简单遍历所有可能的译文并计算$g(\seq{s},\seq{t})$ 的值是不可行的，因为所有潜在译文构成的搜索空间是十分巨大的。为了理解机器翻译的搜索空间的规模，假设源语言句子$\seq{s}$有$m$个词，每个词有$n$个可能的翻译候选。如果从左到右一步步翻译每个源语言单词，那么简单的顺序翻译会有$n^m$种组合。如果进一步考虑目标语单词的任意调序，每一种对翻译候选进行选择的结果又会对应$m!$种不同的排序。因此，源语句子$\seq{s}$至少有$n^m \cdot m!$ 个不同的译文。
+\parinterval 公式\eqref{eq:5-14}定义了解码的目标，剩下的问题是实现$\argmax$，以快速准确地找到最佳译文$\widehat{\seq{t}}$。但是，简单遍历所有可能的译文并计算$g(\seq{s},\seq{t})$ 的值是不可行的，因为所有潜在译文构成的搜索空间是十分巨大的。为了理解机器翻译的搜索空间的规模，假设源语言句子$\seq{s}$有$m$个词，每个词有$n$个可能的翻译候选。如果从左到右一步步翻译每个源语言单词，那么简单的顺序翻译会有$n^m$种组合。如果进一步考虑目标语单词的任意调序，每一种对翻译候选进行选择的结果又会对应$m!$种不同的排序。因此，源语句子$\seq{s}$至少有$n^m \cdot m!$ 个不同的译文。
 \parinterval $n^{m}\cdot m!$是什么样的概念呢？如表\ref{tab:5-2}所示，当$m$和$n$分别为2和10时，译文只有200个，不算多。但是当$m$和$n$分别为20和10时，即源语言句子的长度20，每个词有10个候选译文，系统会面对$2.4329 \times 10^{38}$个不同的译文，这几乎是不可计算的。
@@ -479,18 +479,7 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \end{figure}
 %----------------------------------------------
-\parinterval 图\ref{fig:5-10}给出了贪婪解码算法的伪代码。其中$\pi$保存所有源语单词的候选译文，$\pi[j]$表示第$j$个源语单词的翻译候选的集合，$best$保存当前最好的翻译结果，$h$保存当前步生成的所有译文候选。算法的主体有两层循环，在内层循环中如果第$j$个源语单词没有被翻译过，则用$best$和它的候选译文$\pi[j]$生成新的翻译，再存于$h$中，即操作$h=h\cup{\textrm{Join}(best,\pi[j])}$。外层循环再从$h$中选择得分最高的结果存于$best$中，即操作$best=\textrm{PruneForTop1}(h)$；同时标识相应的源语单词已翻译，即$used[best.j]=true$。
+\parinterval 图\ref{fig:5-10}给出了贪婪解码算法的伪代码。其中$\pi$保存所有源语单词的候选译文，$\pi[j]$表示第$j$个源语单词的翻译候选的集合，$best$保存当前最好的翻译结果，$h$保存当前步生成的所有译文候选。算法的主体有两层循环，在内层循环中如果第$j$个源语单词没有被翻译过，则用$best$和它的候选译文$\pi[j]$生成新的翻译，再存于$h$中，即操作$h=h\cup{\textrm{Join}(best,\pi[j])}$。外层循环再从$h$中选择得分最高的结果存于$best$中，即操作$best=\textrm{PruneForTop1}(h)$；同时标记相应的源语言单词状态为已翻译，即$used[best.j]=true$。
-%----------------------------------------------
-%\begin{figure}[htp]
-%    \centering
-%\subfigure{\input{./Chapter5/Figures/figure-greedy-mt-decoding-process-1}}
-%\subfigure{\input{./Chapter5/Figures/greedy-mt-decoding-process-3}}
-%\setlength{\belowcaptionskip}{14.0em}
-    %\caption{贪婪的机器翻译解码过程实例}
-    %\label{fig:5-11}
-%\end{figure}
-%----------------------------------------------
 %----------------------------------------------
 \begin{figure}[htp]
@@ -542,22 +531,22 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \parinterval 举个例子，对于汉译英的翻译任务，英语句子$\seq{t}$可以被看作是汉语句子$\seq{s}$加入噪声通过信道后得到的结果。换句话说，汉语句子经过噪声-信道传输时发生了变化，在信道的输出端呈现为英语句子。于是需要根据观察到的汉语特征，通过概率$\funp{P}(\seq{t}|\seq{s})$猜测最为可能的英语句子。这个找到最可能的目标语句（信源）的过程也被称为
 {\small\sffamily\bfseries{解码}}（Decoding）。直到今天，解码这个概念也被广泛地使用在机器翻译及相关任务中。这个过程也可以表述为：给定输入$\seq{s}$，找到最可能的输出$\seq{t}$，使得$\funp{P}(\seq{t}|\seq{s})$达到最大：
 \begin{eqnarray}
-\widehat{\seq{t}}=\argmax_{\seq{t}}\funp{P}(\seq{t}|\seq{s})
+\widehat{\seq{t}}&=&\argmax_{\seq{t}}\funp{P}(\seq{t}|\seq{s})
-\label{eq:5-14}
+\label{eq:5-15}
 \end{eqnarray}
-\parinterval 公式\eqref{eq:5-14}的核心内容之一是定义$\funp{P}(\seq{t}|\seq{s})$。在IBM模型中，可以使用贝叶斯准则对$\funp{P}(\seq{t}|\seq{s})$进行如下变换：
+\parinterval 公式\eqref{eq:5-15}的核心内容之一是定义$\funp{P}(\seq{t}|\seq{s})$。在IBM模型中，可以使用贝叶斯准则对$\funp{P}(\seq{t}|\seq{s})$进行如下变换：
 \begin{eqnarray}
 \funp{P}(\seq{t}|\seq{s}) & = &\frac{\funp{P}(\seq{s},\seq{t})}{\funp{P}(\seq{s})} \nonumber \\
                       & = & \frac{\funp{P}(\seq{s}|\seq{t})\funp{P}(\seq{t})}{\funp{P}(\seq{s})}
-\label{eq:5-15}
+\label{eq:5-16}
 \end{eqnarray}
-\parinterval 公式\eqref{eq:5-15}把$\seq{s}$到$\seq{t}$的翻译概率转化为$\frac{\funp{P}(\seq{s}|\seq{t})\textrm{P(t)}}{\funp{P}(\seq{s})}$，它包括三个部分：
+\parinterval 公式\eqref{eq:5-16}把$\seq{s}$到$\seq{t}$的翻译概率转化为$\frac{\funp{P}(\seq{s}|\seq{t})\textrm{P(t)}}{\funp{P}(\seq{s})}$，它包括三个部分：
 \begin{itemize}
 \vspace{0.5em}
-\item 第一部分是由译文$\seq{t}$到源语言句子$\seq{s}$的翻译概率$\funp{P}(\seq{s}|\seq{t})$，也被称为翻译模型。它表示给定目标语句$\seq{t}$生成源语句$\seq{s}$的概率。需要注意是翻译的方向已经从$\funp{P}(\seq{t}|\seq{s})$转向了$\funp{P}(\seq{s}|\seq{t})$，但无须刻意地区分，可以简单地理解为翻译模型刻画了$\seq{s}$和$\seq{t}$的翻译对应程度；
+\item 第一部分是由译文$\seq{t}$到源语言句子$\seq{s}$的翻译概率$\funp{P}(\seq{s}|\seq{t})$，也被称为翻译模型。它表示给定目标语句$\seq{t}$生成源语句$\seq{s}$的概率。需要注意是翻译的方向已经从$\funp{P}(\seq{t}|\seq{s})$转向了$\funp{P}(\seq{s}|\seq{t})$，但无须刻意地区分，可以简单地理解为翻译模型描述了$\seq{s}$和$\seq{t}$的翻译对应程度；
 \vspace{0.5em}
 \item 第二部分是$\funp{P}(\seq{t})$，也被称为语言模型。它表示的是目标语言句子$\seq{t}$出现的可能性；
 \vspace{0.5em}
@@ -570,14 +559,14 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \widehat{\seq{t}} & = & \argmax_{\seq{t}} \funp{P}(\seq{t}|\seq{s}) \nonumber \\
          & = & \argmax_{\seq{t}} \frac{\funp{P}(\seq{s}|\seq{t})\funp{P}(\seq{t})}{\funp{P}(\seq{s})} \nonumber \\
          & = & \argmax_{\seq{t}} \funp{P}(\seq{s}|\seq{t})\funp{P}(\seq{t})
-\label{eq:5-16}
+\label{eq:5-17}
 \end{eqnarray}
-\parinterval 公式\eqref{eq:5-16}展示了IBM模型最基础的建模方式，它把模型分解为两项：（反向）翻译模型$\funp{P}(\seq{s}|\seq{t})$和语言模型$\funp{P}(\seq{t})$。一个很自然的问题是：直接用$\funp{P}(\seq{t}|\seq{s})$定义翻译问题不就可以了吗，为什么要用$\funp{P}(\seq{s}|\seq{t})$和$\funp{P}(\seq{t})$的联合模型？从理论上来说，正向翻译模型$\funp{P}(\seq{t}|\seq{s})$和反向翻译模型$\funp{P}(\seq{s}|\seq{t})$的数学建模可以是一样的，因为我们只需要在建模的过程中把两个语言调换即可。使用$\funp{P}(\seq{s}|\seq{t})$和$\funp{P}(\seq{t})$的联合模型的意义在于引入了语言模型，它可以很好地对译文的流畅度进行评价，确保结果是通顺的目标语言句子。
+\parinterval 公式\eqref{eq:5-17}展示了IBM模型最基础的建模方式，它把模型分解为两项：（反向）翻译模型$\funp{P}(\seq{s}|\seq{t})$和语言模型$\funp{P}(\seq{t})$。仔细观察公式\eqref{eq:5-17}的推导过程，我们很容易发现一个问题：直接用$\funp{P}(\seq{t}|\seq{s})$定义翻译问题不就可以了吗，为什么要用$\funp{P}(\seq{s}|\seq{t})$和$\funp{P}(\seq{t})$的联合模型？从理论上来说，正向翻译模型$\funp{P}(\seq{t}|\seq{s})$和反向翻译模型$\funp{P}(\seq{s}|\seq{t})$的数学建模可以是一样的，因为我们只需要在建模的过程中把两个语言调换即可。使用$\funp{P}(\seq{s}|\seq{t})$和$\funp{P}(\seq{t})$的联合模型的意义在于引入了语言模型，它可以很好地对译文的流畅度进行评价，确保结果是通顺的目标语言句子。
-\parinterval 可以回忆一下\ref{sec:sentence-level-translation}节中讨论的问题，如果只使用翻译模型可能会造成一个局面：译文的单词都和源语言单词对应的很好，但是由于语序的问题，读起来却不像人说的话。从这个角度说，引入语言模型是十分必要的。这个问题在Brown等人的论文中也有讨论\upcite{DBLP:journals/coling/BrownPPM94}，他们提到单纯使用$\funp{P}(\seq{s}|\seq{t})$会把概率分配给一些翻译对应比较好但是不合法的目标语句子，而且这部分概率可能会很大，影响模型的决策。这也正体现了IBM模型的创新之处，作者用数学技巧把$\funp{P}(\seq{t})$引入进来，保证了系统的输出是通顺的译文。语言模型也被广泛使用在语音识别等领域以保证结果的流畅性，甚至应用的历史比机器翻译要长得多，这里的方法也有借鉴相关工作的味道。
+\parinterval 可以回忆一下\ref{sec:sentence-level-translation}节中讨论的问题，如果只使用翻译模型可能会造成一个局面：译文的单词都和源语言单词对应的很好，但是由于语序的问题，读起来却不像人说的话。从这个角度说，引入语言模型是十分必要的。这个问题在Brown等人的论文中也有讨论\upcite{DBLP:journals/coling/BrownPPM94}，他们提到单纯使用$\funp{P}(\seq{s}|\seq{t})$会把概率分配给一些翻译对应比较好但是不通顺甚至不合逻辑的目标语言句子，而且这部分概率可能会很大，影响模型的决策。这也正体现了IBM模型的创新之处，作者用数学技巧把$\funp{P}(\seq{t})$引入进来，保证了系统的输出是通顺的译文。语言模型也被广泛使用在语音识别等领域以保证结果的流畅性，甚至应用的历史比机器翻译要长得多，这里的方法也有借鉴相关工作的味道。
-实际上，在机器翻译中引入语言模型是一个很深刻的概念。在IBM模型之后相当长的时间里，语言模型一直是机器翻译各个部件中最重要的部分。对译文连贯性的建模也是所有系统中需要包含的内容（即使隐形体现）。
+实际上，在机器翻译中引入语言模型这个概念十分重要。在IBM模型之后相当长的时间里，语言模型一直是机器翻译各个部件中最重要的部分。对译文连贯性的建模也是所有系统中需要包含的内容（即使隐形体现）。
 %----------------------------------------------------------------------------------------
 %    NEW SECTION
@@ -585,7 +574,7 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \section{统计机器翻译的三个基本问题}
-\parinterval 公式\eqref{eq:5-16}给出了统计机器翻译的数学描述。为了实现这个过程，面临着三个基本问题：
+\parinterval 公式\eqref{eq:5-17}给出了统计机器翻译的数学描述。为了实现这个过程，面临着三个基本问题：
 \begin{itemize}
 \vspace{0.5em}
@@ -597,13 +586,13 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \vspace{0.5em}
 \end{itemize}
-\parinterval 为了理解以上的问题，可以先回忆一下\ref{sec:sentence-level-translation}小节中的公式\eqref{eq:5-10}，即$g(\seq{s},\seq{t})$函数的定义，它用于评估一个译文的好与坏。如图\ref{fig:5-14}所示，$g(\seq{s},\seq{t})$函数与公式\eqref{eq:5-16}的建模方式非常一致，即$g(\seq{s},\seq{t})$函数中红色部分描述译文$\seq{t}$的可能性大小，对应翻译模型$\funp{P}(\seq{s}|\seq{t})$；蓝色部分描述译文的平滑或流畅程度，对应语言模型$\funp{P}(\seq{t})$。尽管这种对应并不十分严格的，但也可以看出在处理机器翻译问题上，很多想法的本质是一样的。
+\parinterval 为了理解以上的问题，可以先回忆一下\ref{sec:sentence-level-translation}小节中的公式\eqref{eq:5-11}，即$g(\seq{s},\seq{t})$函数的定义，它用于评估一个译文的好与坏。如图\ref{fig:5-14}所示，$g(\seq{s},\seq{t})$函数与公式\eqref{eq:5-17}的建模方式非常一致，即$g(\seq{s},\seq{t})$函数中红色部分描述译文$\seq{t}$的可能性大小，对应翻译模型$\funp{P}(\seq{s}|\seq{t})$；蓝色部分描述译文的平滑或流畅程度，对应语言模型$\funp{P}(\seq{t})$。尽管这种对应并不十分严格的，但也可以看出在处理机器翻译问题上，很多想法的本质是一样的。
 %----------------------------------------------
 \begin{figure}[htp]
    \centering
 \input{./Chapter5/Figures/figure-correspondence-between-ibm-model&formula-1.13}
-    \caption{IBM模型与公式\eqref{eq:5-10}的对应关系}
+    \caption{IBM模型与公式\eqref{eq:5-11}的对应关系}
    \label{fig:5-14}
 \end{figure}
 %----------------------------------------------
@@ -656,13 +645,13 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \parinterval 直接准确估计$\funp{P}(\seq{s}|\seq{t})$很难，训练数据只能覆盖整个样本空间非常小的一部分，绝大多数句子在训练数据中一次也没出现过。为了解决这个问题，IBM模型假设：句子之间的对应可以由单词之间的对应进行表示。于是，翻译句子的概率可以被转化为词对齐生成的概率：
 \begin{eqnarray}
-\funp{P}(\seq{s}|\seq{t})= \sum_{\seq{a}}\funp{P}(\seq{s},\seq{a}|\seq{t})
+\funp{P}(\seq{s}|\seq{t})&=& \sum_{\seq{a}}\funp{P}(\seq{s},\seq{a}|\seq{t})
-\label{eq:5-17}
+\label{eq:5-18}
 \end{eqnarray}
-\parinterval 公式\eqref{eq:5-17}使用了简单的全概率公式把$\funp{P}(\seq{s}|\seq{t})$进行展开。通过访问$\seq{s}$和$\seq{t}$之间所有可能的词对齐$\seq{a}$，并把对应的对齐概率进行求和，得到了$\seq{t}$到$\seq{s}$的翻译概率。这里，可以把词对齐看作翻译的隐含变量，这样从$\seq{t}$到$\seq{s}$的生成就变为从$\seq{t}$同时生成$\seq{s}$和隐含变量$\seq{a}$的问题。引入隐含变量是生成式模型常用的手段，通过使用隐含变量，可以把较为困难的端到端学习问题转化为分步学习问题。
+\parinterval 公式\eqref{eq:5-18}使用了简单的全概率公式把$\funp{P}(\seq{s}|\seq{t})$进行展开。通过访问$\seq{s}$和$\seq{t}$之间所有可能的词对齐$\seq{a}$，并把对应的对齐概率进行求和，得到了$\seq{t}$到$\seq{s}$的翻译概率。这里，可以把词对齐看作翻译的隐含变量，这样从$\seq{t}$到$\seq{s}$的生成就变为从$\seq{t}$同时生成$\seq{s}$和隐含变量$\seq{a}$的问题。引入隐含变量是生成式模型常用的手段，通过使用隐含变量，可以把较为困难的端到端学习问题转化为分步学习问题。
-\parinterval 举个例子说明公式\eqref{eq:5-17}的实际意义。如图\ref{fig:5-17}所示，可以把从“谢谢\ 你”到“thank you”的翻译分解为9种可能的词对齐。因为源语言句子$\seq{s}$有2个词，目标语言句子$\seq{t}$加上空标记$t_0$共3个词，因此每个源语言单词有3个可能对齐的位置，整个句子共有$3\times3=9$种可能的词对齐。
+\parinterval 举个例子说明公式\eqref{eq:5-18}的实际意义。如图\ref{fig:5-17}所示，可以把从“谢谢\ 你”到“thank you”的翻译分解为9种可能的词对齐。因为源语言句子$\seq{s}$有2个词，目标语言句子$\seq{t}$加上空标记$t_0$共3个词，因此每个源语言单词有3个可能对齐的位置，整个句子共有$3\times3=9$种可能的词对齐。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -675,11 +664,11 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \parinterval 接下来的问题是如何定义$\funp{P}(\seq{s},\seq{a}|\seq{t})$\ \dash \ 即定义词对齐的生成概率。但是，隐含变量$\seq{a}$仍然很复杂，因此直接定义$\funp{P}(\seq{s},\seq{a}|\seq{t})$也很困难，在IBM模型中，为了化简问题，$\funp{P}(\seq{s},\seq{a}|\seq{t})$被进一步分解。使用链式法则，可以得到：
 \begin{eqnarray}
-\funp{P}(\seq{s},\seq{a}|\seq{t})=\funp{P}(m|\seq{t})\prod_{j=1}^{m}{\funp{P}(a_j|\seq{a}{}_1^{j-1},\seq{s}{}_1^{j-1},m,\seq{t})\funp{P}(s_j|\seq{a}{}_1^{j},\seq{s}{}_1^{j-1},m,\seq{t})}
+\funp{P}(\seq{s},\seq{a}|\seq{t})&=&\funp{P}(m|\seq{t})\prod_{j=1}^{m}{\funp{P}(a_j|\seq{a}{}_1^{j-1},\seq{s}{}_1^{j-1},m,\seq{t})\funp{P}(s_j|\seq{a}{}_1^{j},\seq{s}{}_1^{j-1},m,\seq{t})}
-\label{eq:5-18}
+\label{eq:5-19}
 \end{eqnarray}
-\noindent  其中$s_j$和$a_j$分别表示第$j$个源语言单词及第$j$个源语言单词对齐到的目标位置，\seq{s}${{}_1^{j-1}}$表示前$j-1$个源语言单词（即\seq{s}${}_1^{j-1}=s_1...s_{j-1}$），\seq{a}${}_1^{j-1}$表示前$j-1$个源语言的词对齐（即\seq{a}${}_1^{j-1}=a_1...a_{j-1}$），$m$表示源语句子的长度。公式\eqref{eq:5-18}将$\funp{P}(\seq{s},\seq{a}|\seq{t})$分解为四个部分，具体含义如下：
+\noindent  其中$s_j$和$a_j$分别表示第$j$个源语言单词及第$j$个源语言单词对齐到的目标位置，\seq{s}${{}_1^{j-1}}$表示前$j-1$个源语言单词（即\seq{s}${}_1^{j-1}=s_1...s_{j-1}$），\seq{a}${}_1^{j-1}$表示前$j-1$个源语言的词对齐（即\seq{a}${}_1^{j-1}=a_1...a_{j-1}$），$m$表示源语句子的长度。公式\eqref{eq:5-19}将$\funp{P}(\seq{s},\seq{a}|\seq{t})$分解为四个部分，具体含义如下：
 \begin{itemize}
 \vspace{0.5em}
@@ -694,7 +683,7 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \end{itemize}
 \parinterval 换句话说，当求$\funp{P}(\seq{s},\seq{a}|\seq{t})$时，首先根据译文$\seq{t}$确定源语言句子$\seq{s}$的长度$m$；当知道源语言句子有多少个单词后，循环$m$次，依次生成第1个到第$m$个源语言单词；当生成第$j$个源语言单词时，要先确定它是由哪个目标语译文单词生成的，即确定生成的源语言单词对应的译文单词的位置；当知道了目标语译文单词的位置，就能确定第$j$个位置的源语言单词。
-\parinterval 需要注意的是公式\eqref{eq:5-18}定义的模型并没有做任何化简和假设，也就是说公式的左右两端是严格相等的。在后面的内容中会看到，这种将一个整体进行拆分的方法可以有助于分步骤化简并处理问题。
+\parinterval 需要注意的是公式\eqref{eq:5-19}定义的模型并没有做任何化简和假设，也就是说公式的左右两端是严格相等的。在后面的内容中会看到，这种将一个整体进行拆分的方法可以有助于分步骤化简并处理问题。
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -702,7 +691,7 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \subsection{基于词对齐的翻译实例}
-\parinterval 用前面图\ref{fig:5-16}中例子来对公式\eqref{eq:5-18}进行说明。例子中，源语言句子“在\ \ 桌子\ \ 上”目标语译文“on the table”之间的词对齐为$\seq{a}=\{\textrm{1-0, 2-3, 3-1}\}$。 公式\eqref{eq:5-18}的计算过程如下：
+\parinterval 用前面图\ref{fig:5-16}中例子来对公式\eqref{eq:5-19}进行说明。例子中，源语言句子“在\ \ 桌子\ \ 上”目标语译文“on the table”之间的词对齐为$\seq{a}=\{\textrm{1-0, 2-3, 3-1}\}$。 公式\eqref{eq:5-19}的计算过程如下：
 \begin{itemize}
 \vspace{0.5em}
@@ -724,7 +713,7 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 &&{\funp{P}(s_2=\textrm{桌子} \mid \textrm{\{1-0, 2-3\}},\textrm{在},3,\textrm{$t_0$ on the table}) {\times}} \nonumber \\
 &&{\funp{P}(a_3=1 \mid \textrm{\{1-0, 2-3\}},\textrm{在\ \ 桌子},3,\textrm{$t_0$ on the table}) {\times}} \nonumber \\
 &&{\funp{P}(s_3=\textrm{上} \mid \textrm{\{1-0, 2-3, 3-1\}},\textrm{在\ \ 桌子},3,\textrm{$t_0$ on the table})  }
-\label{eq:5-19}
+\label{eq:5-20}
 \end{eqnarray}
 %----------------------------------------------------------------------------------------
@@ -732,14 +721,14 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 %----------------------------------------------------------------------------------------
 \sectionnewpage
-\section{IBM模型1}
+\section{IBM模型1}\label{IBM-model1}
-\parinterval 公式\eqref{eq:5-17}和公式\eqref{eq:5-18}把翻译问题定义为对译文和词对齐同时进行生成的问题。其中有两个问题：
+\parinterval 公式\eqref{eq:5-18}和公式\eqref{eq:5-19}把翻译问题定义为对译文和词对齐同时进行生成的问题。其中有两个问题：
 \begin{itemize}
 \vspace{0.3em}
-\item 首先，公式\eqref{eq:5-17}的右端（$ \sum_{\seq{a}}\funp{P}(\seq{s},\seq{a}|\seq{t})$）要求对所有的词对齐概率进行求和，但是词对齐的数量随着句子长度是呈指数增长，如何遍历所有的对齐$\seq{a}$？
+\item 首先，公式\eqref{eq:5-18}的右端（$ \sum_{\seq{a}}\funp{P}(\seq{s},\seq{a}|\seq{t})$）要求对所有的词对齐概率进行求和，但是词对齐的数量随着句子长度是呈指数增长，如何遍历所有的对齐$\seq{a}$？
 \vspace{0.3em}
-\item 其次，公式\eqref{eq:5-18}虽然对词对齐的问题进行了描述，但是模型中的很多参数仍然很复杂，如何计算$\funp{P}(m|\seq{t})$、$\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t})$ 和$\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\seq{t})$？
+\item 其次，公式\eqref{eq:5-19}虽然对词对齐的问题进行了描述，但是模型中的很多参数仍然很复杂，如何计算$\funp{P}(m|\seq{t})$、$\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t})$ 和$\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\seq{t})$？
 \vspace{0.3em}
 \end{itemize}
@@ -749,37 +738,37 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 %    NEW SUB-SECTION
 %----------------------------------------------------------------------------------------
 \vspace{-0.5em}
-\subsection{IBM模型1}
+\subsection{IBM模型1的建模}
-\parinterval IBM模型1对公式\eqref{eq:5-18}中的三项进行了简化。具体方法如下：
+\parinterval IBM模型1对公式\eqref{eq:5-19}中的三项进行了简化。具体方法如下：
 \begin{itemize}
 \item 假设$\funp{P}(m|\seq{t})$为常数$\varepsilon$，即源语言句子长度的生成概率服从均匀分布，如下：
 \begin{eqnarray}
-\funp{P}(m|\seq{t})\; \equiv \; \varepsilon
+\funp{P}(m|\seq{t})& \equiv & \varepsilon
-\label{eq:5-20}
+\label{eq:5-21}
 \end{eqnarray}
-\item 对齐概率$\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t})$仅依赖于译文长度$l$，即每个词对齐连接的生成概率也服从均匀分布。换句话说，对于任何源语言位置$j$对齐到目标语言任何位置都是等概率的。比如译文为“on the table”，再加上$t_0$共4个位置，相应的，任意源语单词对齐到这4个位置的概率是一样的。具体描述如下：
+\item 对齐概率$\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t})$仅依赖于译文长度$l$，即每个词对齐连接的生成概率也服从均匀分布。换句话说，对于任意源语言位置$j$对齐到目标语言任意位置都是等概率的。比如译文为“on the table”，再加上$t_0$共4个位置，相应的，任意源语单词对齐到这4个位置的概率是一样的。具体描述如下：
 \begin{eqnarray}
-\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t}) \equiv \frac{1}{l+1}
+\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t})& \equiv &  \frac{1}{l+1}
-\label{eq:5-21}
+\label{eq:5-22}
 \end{eqnarray}
 \item 源语单词$s_j$的生成概率$\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\seq{t})$仅依赖与其对齐的译文单词$t_{a_j}$，即词汇翻译概率$f(s_j|t_{a_j})$。此时词汇翻译概率满足$\sum_{s_j}{f(s_j|t_{a_j})}=1$。比如在图\ref{fig:5-18}表示的例子中，源语单词“上”出现的概率只和与它对齐的单词“on”有关系，与其他单词没有关系。
 \begin{eqnarray}
-\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\seq{t}) \equiv f(s_j|t_{a_j})
+\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\seq{t})& \equiv &  f(s_j|t_{a_j})
-\label{eq:5-22}
+\label{eq:5-23}
 \end{eqnarray}
-用一个简单的例子对公式\eqref{eq:5-22}进行说明。比如，在图\ref{fig:5-18}中，“桌子”对齐到“table”，可被描述为$f(s_2 |t_{a_2})=f(\textrm{“桌子”}|\textrm{“table”})$，表示给定“table”翻译为“桌子”的概率。通常，$f(s_2 |t_{a_2})$被认为是一种概率词典，它反应了两种语言词汇一级的对应关系。
+用一个简单的例子对公式\eqref{eq:5-23}进行说明。比如，在图\ref{fig:5-18}中，“桌子”对齐到“table”，可被描述为$f(s_2 |t_{a_2})=f(\textrm{“桌子”}|\textrm{“table”})$，表示给定“table”翻译为“桌子”的概率。通常，$f(s_2 |t_{a_2})$被认为是一种概率词典，它反应了两种语言词汇一级的对应关系。
 \end{itemize}
-\parinterval 将上述三个假设和公式\eqref{eq:5-18}代入公式\eqref{eq:5-17}中，得到$\funp{P}(\seq{s}|\seq{t})$的表达式：
+\parinterval 将上述三个假设和公式\eqref{eq:5-19}代入公式\eqref{eq:5-18}中，得到$\funp{P}(\seq{s}|\seq{t})$的表达式：
 \begin{eqnarray}
 \funp{P}(\seq{s}|\seq{t}) & = &  \sum_{\seq{a}}{\funp{P}(\seq{s},\seq{a}|\seq{t})} \nonumber \\
                        & = &  \sum_{\seq{a}}{\funp{P}(m|\seq{t})}\prod_{j=1}^{m}{\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t})\funp{P}(s_j |a_1^j,s_1^{j-1},m,\seq{t})} \nonumber \\
                        & = &  \sum_{\seq{a}}{\varepsilon}\prod_{j=1}^{m}{\frac{1}{l+1}f(s_j|t_{a_j})} \nonumber \\
                        & = & \sum_{\seq{a}}{\frac{\varepsilon}{(l+1)^m}}\prod_{j=1}^{m}f(s_j|t_{a_j})
-\label{eq:5-23}
+\label{eq:5-24}
 \end{eqnarray}
 %----------------------------------------------
@@ -791,19 +780,19 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \end{figure}
 %----------------------------------------------
-\parinterval 在公式\eqref{eq:5-23}中，需要遍历所有的词对齐，即$ \sum_{\seq{a}}{\cdot}$。但这种表示不够直观，因此可以把这个过程重新表示为如下形式：
+\parinterval 在公式\eqref{eq:5-24}中，需要遍历所有的词对齐，即$ \sum_{\seq{a}}{\cdot}$。但这种表示不够直观，因此可以把这个过程重新表示为如下形式：
 \begin{eqnarray}
-\funp{P}(\seq{s}|\seq{t})={\sum_{a_1=0}^{l}\cdots}{\sum_{a_m=0}^{l}\frac{\varepsilon}{(l+1)^m}}{\prod_{j=1}^{m}f(s_j|t_{a_j})}
+\funp{P}(\seq{s}|\seq{t})&=&{\sum_{a_1=0}^{l}\cdots}{\sum_{a_m=0}^{l}\frac{\varepsilon}{(l+1)^m}}{\prod_{j=1}^{m}f(s_j|t_{a_j})}
-\label{eq:5-24}
+\label{eq:5-25}
 \end{eqnarray}
-\parinterval 公式\eqref{eq:5-24}分为两个主要部分。第一部分：遍历所有的对齐$\seq{a}$。其中$\seq{a}$由$\{a_1,...,a_m\}$\\ 组成，每个$a_j\in \{a_1,...,a_m\}$从译文的开始位置$(0)$循环到截止位置$(l)$。如图\ref{fig:5-19}表示的例子，描述的是源语单词$s_3$从译文的开始$t_0$遍历到结尾$t_3$，即$a_3$的取值范围。第二部分: 对于每个$\seq{a}$累加对齐概率$\funp{P}(\seq{s},a| \seq{t})=\frac{\varepsilon}{(l+1)^m}{\prod_{j=1}^{m}f(s_j|t_{a_j})}$。
+\parinterval 公式\eqref{eq:5-25}分为两个主要部分。第一部分：遍历所有的对齐$\seq{a}$。其中$\seq{a}$由$\{a_1,...,a_m\}$\\ 组成，每个$a_j\in \{a_1,...,a_m\}$从译文的开始位置$(0)$循环到截止位置$(l)$。如图\ref{fig:5-19}表示的例子，描述的是源语单词$s_3$从译文的开始$t_0$遍历到结尾$t_3$，即$a_3$的取值范围。第二部分: 对于每个$\seq{a}$累加对齐概率$\funp{P}(\seq{s},a| \seq{t})=\frac{\varepsilon}{(l+1)^m}{\prod_{j=1}^{m}f(s_j|t_{a_j})}$。
 %----------------------------------------------
 \begin{figure}[htp]
    \centering
 \input{./Chapter5/Figures/figure-formula-3.25-part-1-example}
-    \caption{公式{\eqref{eq:5-24}}第一部分实例}
+    \caption{公式{\eqref{eq:5-25}}第一部分实例}
    \label{fig:5-19}
 \end{figure}
 %----------------------------------------------
@@ -816,36 +805,36 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \subsection{解码及计算优化}\label{decoding&computational-optimization}
-\parinterval 如果模型参数给定，可以使用IBM模型1对新的句子进行翻译。比如，可以使用\ref{sec:simple-decoding}节描述的解码方法搜索最优译文。在搜索过程中，只需要通过公式\eqref{eq:5-24}计算每个译文候选的IBM模型翻译概率。但是，公式\eqref{eq:5-24}的高计算复杂度导致这些模型很难直接使用。以IBM模型1为例，这里把公式\eqref{eq:5-24}重写为：
+\parinterval 如果模型参数给定，可以使用IBM模型1对新的句子进行翻译。比如，可以使用\ref{sec:simple-decoding}节描述的解码方法搜索最优译文。在搜索过程中，只需要通过公式\eqref{eq:5-25}计算每个译文候选的IBM模型翻译概率。但是，公式\eqref{eq:5-25}的高计算复杂度导致这些模型很难直接使用。以IBM模型1为例，这里把公式\eqref{eq:5-25}重写为：
 \begin{eqnarray}
-\funp{P}(\seq{s}| \seq{t}) = \frac{\varepsilon}{(l+1)^{m}} \underbrace{\sum\limits_{a_1=0}^{l} ... \sum\limits_{a_m=0}^{l}}_{(l+1)^m\textrm{次循环}} \underbrace{\prod\limits_{j=1}^{m} f(s_j|t_{a_j})}_{m\textrm{次循环}}
+\funp{P}(\seq{s}| \seq{t}) &=& \frac{\varepsilon}{(l+1)^{m}} \underbrace{\sum\limits_{a_1=0}^{l} ... \sum\limits_{a_m=0}^{l}}_{(l+1)^m\textrm{次循环}} \underbrace{\prod\limits_{j=1}^{m} f(s_j|t_{a_j})}_{m\textrm{次循环}}
-\label{eq:5-27}
+\label{eq:5-26}
 \end{eqnarray}
 \noindent 可以看到，遍历所有的词对齐需要$(l+1)^m$次循环，遍历所有源语言位置累计$f(s_j|t_{a_j})$需要$m$次循环，因此这个模型的计算复杂度为$O((l+1)^m m)$。当$m$较大时，计算这样的模型几乎是不可能的。不过，经过仔细观察，可以发现公式右端的部分有另外一种计算方法，如下：
 \begin{eqnarray}
-\sum\limits_{a_1=0}^{l} ... \sum\limits_{a_m=0}^{l} \prod\limits_{j=1}^{m} f(s_j|t_{a_j}) = \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i)
+\sum\limits_{a_1=0}^{l} ... \sum\limits_{a_m=0}^{l} \prod\limits_{j=1}^{m} f(s_j|t_{a_j}) &=& \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i)
-\label{eq:5-28}
+\label{eq:5-27}
 \end{eqnarray}
-\noindent  公式\eqref{eq:5-28}的技巧在于把若干个乘积的加法（等式左手端）转化为若干加法结果的乘积（等式右手端），这样省去了多次循环，把$O((l+1)^m m)$的计算复杂度降为$O((l+1)m)$。此外，公式\eqref{eq:5-28}相比公式\eqref{eq:5-27}的另一个优点在于，公式\eqref{eq:5-28}中乘法的数量更少，因为现代计算机中乘法运算的代价要高于加法，因此公式\eqref{eq:5-28}的计算机实现效率更高。图\ref{fig:5-21} 对这个过程进行了进一步解释。
+\noindent  公式\eqref{eq:5-27}的技巧在于把若干个乘积的加法（等式左手端）转化为若干加法结果的乘积（等式右手端），这样省去了多次循环，把$O((l+1)^m m)$的计算复杂度降为$O((l+1)m)$。此外，公式\eqref{eq:5-27}相比公式\eqref{eq:5-26}的另一个优点在于，公式\eqref{eq:5-27}中乘法的数量更少，因为现代计算机中乘法运算的代价要高于加法，因此公式\eqref{eq:5-27}的计算机实现效率更高。图\ref{fig:5-21} 对这个过程进行了进一步解释。
 %----------------------------------------------
 \begin{figure}[htp]
    \centering
 \input{./Chapter5/Figures/figure-example-of-formula3.29}
-   \caption{$\sum\limits_{a_1=0}^{l} ... \sum\limits_{a_m=0}^{l} \prod\limits_{j=1}^{m} f(s_j|t_{a_j}) = \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i)$的实例}
+   \caption{$\sum\limits_{a_1=0}^{l} ... \sum\limits_{a_m=0}^{l} \prod\limits_{j=1}^{m} f(s_j|t_{a_j}) \; = \; \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i)$的实例}
   \label{fig:5-21}
 \end{figure}
 %----------------------------------------------
-\parinterval 接着，利用公式\eqref{eq:5-28}的方式，可以把公式\eqref{eq:5-24}重写表示为：
+\parinterval 接着，利用公式\eqref{eq:5-27}的方式，可以把公式\eqref{eq:5-25}重写表示为：
 \begin{eqnarray}
-\textrm{IBM模型1：\ \ \ \ } \funp{P}(\seq{s}| \seq{t}) & = & \frac{\varepsilon}{(l+1)^{m}} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \label{eq:5-64}
+\textrm{IBM模型1：\ \ \ \ } \funp{P}(\seq{s}| \seq{t}) & = & \frac{\varepsilon}{(l+1)^{m}} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i)
-\label{eq:5-29}
+\label{eq:5-28}
 \end{eqnarray}
-公式\eqref{eq:5-64}是IBM模型1的最终表达式，在解码和训练中可以被直接使用。
+公式\eqref{eq:5-28}是IBM模型1的最终表达式，在解码和训练中可以被直接使用。
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -874,18 +863,18 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \parinterval 在IBM模型中，优化的目标函数被定义为$\funp{P}(\seq{s}| \seq{t})$。也就是，对于给定的句对$(\seq{s},\seq{t})$，最大化翻译概率$\funp{P}(\seq{s}| \seq{t})$。 这里用符号$\funp{P}_{\theta}(\seq{s}|\seq{t})$表示模型由参数$\theta$决定，模型训练可以被描述为对目标函数$\funp{P}_{\theta}(\seq{s}|\seq{t})$的优化过程：
 \begin{eqnarray}
-\widehat{\theta}=\argmax_{\theta}\funp{P}_{\theta}(\seq{s}|\seq{t})
+\widehat{\theta}&=&\argmax_{\theta}\funp{P}_{\theta}(\seq{s}|\seq{t})
-\label{eq:5-30}
+\label{eq:5-29}
 \end{eqnarray}
 \noindent 其中，$\argmax_{\theta}$表示求最优参数的过程（或优化过程）。
-\parinterval 公式\eqref{eq:5-30}实际上也是一种基于极大似然的模型训练方法。这里，可以把$\funp{P}_{\theta}(\seq{s}|\seq{t})$看作是模型对数据描述的一个似然函数，记作$L(\seq{s},\seq{t};\theta)$。也就是，优化目标是对似然函数的优化：$\{\widehat{\theta}\}=\{\argmax_{\theta \in \Theta}L(\seq{s},\seq{t};\theta)\}$，其中\{$\widehat{\theta}$\} 表示可能有多个结果，$\Theta$表示参数空间。
+\parinterval 公式\eqref{eq:5-29}实际上也是一种基于极大似然的模型训练方法。这里，可以把$\funp{P}_{\theta}(\seq{s}|\seq{t})$看作是模型对数据描述的一个似然函数，记作$L(\seq{s},\seq{t};\theta)$。也就是，优化目标是对似然函数的优化：$\{\widehat{\theta}\}=\{\argmax_{\theta \in \Theta}L(\seq{s},\seq{t};\theta)\}$，其中\{$\widehat{\theta}$\} 表示可能有多个结果，$\Theta$表示参数空间。
-\parinterval 回到IBM模型的优化问题上。以IBM模型1为例，优化的目标是最大化翻译概率$\funp{P}(\seq{s}| \seq{t})$。使用公式\eqref{eq:5-64} ，可以把这个目标表述为：
+\parinterval 回到IBM模型的优化问题上。以IBM模型1为例，优化的目标是最大化翻译概率$\funp{P}(\seq{s}| \seq{t})$。使用公式\eqref{eq:5-28} ，可以把这个目标表述为：
 \begin{eqnarray}
 &                    & \textrm{max}\Big(\frac{\varepsilon}{(l+1)^m}\prod_{j=1}^{m}\sum_{i=0}^{l}{f({s_j|t_i})}\Big) \nonumber \\
-& \textrm{s.t.} & \textrm{任意单词} t_{y}:\;\sum_{s_x}{f(s_x|t_y)}=1 \nonumber
+& \textrm{s.t.} & \textrm{任意单词} t_{y}:\;\sum_{s_x}{f(s_x|t_y)} = 1 \nonumber
 \label{eq:5-31}
 \end{eqnarray}
 \noindent 其中，$\textrm{max}(\cdot)$表示最大化，$\frac{\varepsilon}{(l+1)^m}\prod_{j=1}^{m}\sum_{i=0}^{l}{f({s_j|t_i})}$是目标函数，$f({s_j|t_i})$是模型的参数，$\sum_{s_x}{f(s_x|t_y)}=1$是优化的约束条件，以保证翻译概率满足归一化的要求。需要注意的是$\{f(s_x |t_y)\}$对应了很多参数，每个源语言单词和每个目标语单词的组合都对应一个参数$f(s_x |t_y)$。
@@ -898,11 +887,11 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \parinterval 可以看到，IBM模型的参数训练问题本质上是带约束的目标函数优化问题。由于目标函数是可微分函数，解决这类问题的一种常用手法是把带约束的优化问题转化为不带约束的优化问题。这里用到了{\small\sffamily\bfseries{拉格朗日乘数法}}\index{拉格朗日乘数法}（Lagrange Multiplier Method）\index{The Lagrange Multiplier Method}，它的基本思想是把含有$n$个变量和$m$个约束条件的优化问题转化为含有$n+m$个变量的无约束优化问题。
-\parinterval 这里的目标是$\max(\funp{P}_{\theta}(\seq{s}|\seq{t}))$，约束条件是对于任意的目标语单词$t_y$有\\$\sum_{s_x}{\funp{P}(s_x|t_y)}=1$。根据拉格朗日乘数法，可以把上述优化问题重新定义最大化如下拉格朗日函数：
+\parinterval 这里的目标是$\max(\funp{P}_{\theta}(\seq{s}|\seq{t}))$，约束条件是对于任意的目标语单词$t_y$有\\$\sum_{s_x}{\funp{P}(s_x|t_y)}=1$。根据拉格朗日乘数法，可以把上述优化问题重新定义为最大化如下拉格朗日函数的问题：
 \vspace{-0.5em}
 \begin{eqnarray}
-L(f,\lambda)=\frac{\varepsilon}{(l+1)^m}\prod_{j=1}^{m}\sum_{i=0}^{l}{f(s_j|t_i)}-\sum_{t_y}{\lambda_{t_y}(\sum_{s_x}{f(s_x|t_y)}-1)}
+L(f,\lambda)&=&\frac{\varepsilon}{(l+1)^m}\prod_{j=1}^{m}\sum_{i=0}^{l}{f(s_j|t_i)}-\sum_{t_y}{\lambda_{t_y}(\sum_{s_x}{f(s_x|t_y)}-1)}
-\label{eq:5-32}
+\label{eq:5-30}
 \end{eqnarray}
 \vspace{-0.3em}
@@ -922,29 +911,30 @@ L(f,\lambda)=\frac{\varepsilon}{(l+1)^m}\prod_{j=1}^{m}\sum_{i=0}^{l}{f(s_j|t_i)
 \frac{\partial L(f,\lambda)}{\partial f(s_u|t_v)}& = & \frac{\partial \big[ \frac{\varepsilon}{(l+1)^{m}} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \big]}{\partial f(s_u|t_v)} - \nonumber \\
                                                                     &     & \frac{\partial \big[ \sum_{t_y} \lambda_{t_y} (\sum_{s_x} f(s_x|t_y) -1) \big]}{\partial f(s_u|t_v)} \nonumber \\
                                                                     & =  & \frac{\varepsilon}{(l+1)^{m}} \cdot \frac{\partial \big[ \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \big]}{\partial f(s_u|t_v)} - \lambda_{t_v}
-\label{eq:5-33}
+\label{eq:5-31}
 \end{eqnarray}
 \noindent 这里$s_u$和$t_v$分别表示源语言和目标语言词表中的某一个单词。为了求$\frac{\partial \big[ \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \big]}{\partial f(s_u|t_v)}$，这里引入一个辅助函数。令$g(z)=\alpha z^{\beta}$ 为变量$z$ 的函数，显然，
 $\frac{\partial g(z)}{\partial z} = \alpha \beta z^{\beta-1} = \frac{\beta}{z}\alpha z^{\beta} = \frac{\beta}{z} g(z)$。这里可以把$\prod_{j=1}^{m} \sum_{i=0}^{l} f(s_j|t_i)$看做$g(z)=\alpha z^{\beta}$的实例。首先，令$z=\sum_{i=0}^{l}f(s_u|t_i)$，注意$s_u$为给定的源语单词。然后，把$\beta$定义为$\sum_{i=0}^{l}f(s_u|t_i)$在$\prod_{j=1}^{m} \sum_{i=0}^{l} f(s_j|t_i)$ 中出现的次数，即源语句子中与$s_u$相同的单词的个数。
 \begin{eqnarray}
-\beta=\sum_{j=1}^{m} \delta(s_j,s_u)
+\beta &=& \sum_{j=1}^{m} \delta(s_j,s_u)
-\label{eq:5-34}
+\label{eq:5-32}
 \end{eqnarray}
 \noindent 其中，当$x=y$时，$\delta(x,y)=1$，否则为0。
 \parinterval 根据$\frac{\partial g(z)}{\partial z} = \frac{\beta}{z} g(z)$，可以得到
 \begin{eqnarray}
-\frac{\partial g(z)}{\partial z} = \frac{\partial \big[ \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \big]}{\partial \big[ \sum\limits_{i=0}^{l}f(s_u|t_i) \big]} = \frac{\sum\limits_{j=1}^{m} \delta(s_j,s_u)}{\sum\limits_{i=0}^{l}f(s_u|t_i)} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i)
+\frac{\partial g(z)}{\partial z}& =& \frac{\partial \big[ \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \big]}{\partial \big[ \sum\limits_{i=0}^{l}f(s_u|t_i) \big]} \nonumber \\
-\label{eq:5-35}
+& = &\frac{\sum\limits_{j=1}^{m} \delta(s_j,s_u)}{\sum\limits_{i=0}^{l}f(s_u|t_i)} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i)
+\label{eq:5-33}
 \end{eqnarray}
 \parinterval 根据$\frac{\partial g(z)}{\partial z}$和$\frac{\partial z}{\partial f}$计算的结果，可以得到
 \begin{eqnarray}
 {\frac{\partial \big[ \prod_{j=1}^{m} \sum_{i=0}^{l} f(s_j|t_i) \big]}{\partial f(s_u|t_v)}}& =& {{\frac{\partial \big[ \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \big]}{\partial \big[ \sum\limits_{i=0}^{l}f(s_u|t_i) \big]}} \cdot{\frac{\partial \big[ \sum\limits_{i=0}^{l}f(s_u|t_i) \big]}{\partial f(s_u|t_v)}}} \nonumber \\
 & = &{\frac{\sum\limits_{j=1}^{m} \delta(s_j,s_u)}{\sum\limits_{i=0}^{l}f(s_u|t_i)} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \cdot \sum\limits_{i=0}^{l} \delta(t_i,t_v)}
-\label{eq:5-36}
+\label{eq:5-34}
 \end{eqnarray}
 \parinterval 将$\frac{\partial \big[ \prod_{j=1}^{m} \sum_{i=0}^{l} f(s_j|t_i) \big]}{\partial f(s_u|t_v)}$进一步代入$\frac{\partial L(f,\lambda)}{\partial f(s_u|t_v)}$，得到$L(f,\lambda)$的导数
@@ -952,22 +942,22 @@ $\frac{\partial g(z)}{\partial z} = \alpha \beta z^{\beta-1} = \frac{\beta}{z}\a
 & &{\frac{\partial L(f,\lambda)}{\partial f(s_u|t_v)}}\nonumber \\
 &=&{\frac{\varepsilon}{(l+1)^{m}} \cdot \frac{\partial \big[ \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_{a_j}) \big]}{\partial f(s_u|t_v)} - \lambda_{t_v}}\nonumber \\
 &=&{\frac{\varepsilon}{(l+1)^{m}} \frac{\sum_{j=1}^{m} \delta(s_j,s_u) \cdot \sum_{i=0}^{l} \delta(t_i,t_v)}{\sum_{i=0}^{l}f(s_u|t_i)} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) - \lambda_{t_v}}
-\label{eq:5-37}
+\label{eq:5-35}
 \end{eqnarray}
 \parinterval 令$\frac{\partial L(f,\lambda)}{\partial f(s_u|t_v)}=0$，有
 \begin{eqnarray}
-f(s_u|t_v) = \frac{\lambda_{t_v}^{-1} \varepsilon}{(l+1)^{m}} \cdot \frac{\sum\limits_{j=1}^{m} \delta(s_j,s_u) \cdot \sum\limits_{i=0}^{l} \delta(t_i,t_v)}{\sum\limits_{i=0}^{l}f(s_u|t_i)} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \cdot f(s_u|t_v)
+f(s_u|t_v) &=& \frac{\lambda_{t_v}^{-1} \varepsilon}{(l+1)^{m}} \cdot \frac{\sum\limits_{j=1}^{m} \delta(s_j,s_u) \cdot \sum\limits_{i=0}^{l} \delta(t_i,t_v)}{\sum\limits_{i=0}^{l}f(s_u|t_i)} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \cdot f(s_u|t_v)
-\label{eq:5-38}
+\label{eq:5-36}
 \end{eqnarray}
 \parinterval 将上式稍作调整得到下式：
 \begin{eqnarray}
-f(s_u|t_v) = \lambda_{t_v}^{-1} \frac{\varepsilon}{(l+1)^{m}} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \sum\limits_{j=1}^{m} \delta(s_j,s_u) \sum\limits_{i=0}^{l} \delta(t_i,t_v) \frac{f(s_u|t_v) }{\sum\limits_{i=0}^{l}f(s_u|t_i)}
+f(s_u|t_v) &=& \lambda_{t_v}^{-1} \frac{\varepsilon}{(l+1)^{m}} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \sum\limits_{j=1}^{m} \delta(s_j,s_u) \sum\limits_{i=0}^{l} \delta(t_i,t_v) \frac{f(s_u|t_v) }{\sum\limits_{i=0}^{l}f(s_u|t_i)}
-\label{eq:5-39}
+\label{eq:5-37}
 \end{eqnarray}
-\parinterval  可以看出，这不是一个计算$f(s_u|t_v)$的解析式，因为等式右端仍含有$f(s_u|t_v)$。不过它蕴含着一种非常经典的方法\ $\dash$\ {\small\sffamily\bfseries{期望最大化}}\index{期望最大化}（Expectation Maximization）\index{Expectation Maximization}方法，简称EM方法（或算法）。使用EM方法可以利用上式迭代地计算$f(s_u|t_v)$，使其最终收敛到最优值。EM方法的思想是：用当前的参数，求似然函数的期望，之后最大化这个期望同时得到新的一组参数的值。对于IBM模型来说，其迭代过程就是反复使用公式\eqref{eq:5-39}，具体如图\ref{fig:5-24}所示。
+\parinterval  可以看出，这不是一个计算$f(s_u|t_v)$的解析式，因为等式右端仍含有$f(s_u|t_v)$。不过它蕴含着一种非常经典的方法\ $\dash$\ {\small\sffamily\bfseries{期望最大化}}\index{期望最大化}（Expectation Maximization）\index{Expectation Maximization}方法，简称EM方法（或算法）。使用EM方法可以利用式\ref{eq:5-37}迭代地计算$f(s_u|t_v)$，使其最终收敛到最优值。EM方法的思想是：用当前的参数，求似然函数的期望，之后最大化这个期望同时得到新的一组参数的值。对于IBM模型来说，其迭代过程就是反复使用公式\eqref{eq:5-37}，具体如图\ref{fig:5-24}所示。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -978,22 +968,22 @@ f(s_u|t_v) = \lambda_{t_v}^{-1} \frac{\varepsilon}{(l+1)^{m}} \prod\limits_{j=1}
 \end{figure}
 %----------------------------------------------
-\parinterval 为了化简$f(s_u|t_v)$的计算，在此对公式\eqref{eq:5-39}进行了重新组织，见图\ref{fig:5-25}。其中，红色部分表示翻译概率P$(\seq{s}|\seq{t})$；蓝色部分表示$(s_u,t_v)$ 在句对$(\seq{s},\seq{t})$中配对的总次数，即“$t_v$翻译为$s_u$”在所有对齐中出现的次数；绿色部分表示$f(s_u|t_v)$对于所有的$t_i$的相对值，即“$t_v$翻译为$s_u$”在所有对齐中出现的相对概率；蓝色与绿色部分相乘表示“$t_v$翻译为$s_u$”这个事件出现次数的期望的估计，称之为{\small\sffamily\bfseries{期望频次}}\index{期望频次}（Expected Count）\index{Expected Count}。
+\parinterval 为了化简$f(s_u|t_v)$的计算，在此对公式\eqref{eq:5-37}进行了重新组织，见图\ref{fig:5-25}。其中，红色部分表示翻译概率P$(\seq{s}|\seq{t})$；蓝色部分表示$(s_u,t_v)$ 在句对$(\seq{s},\seq{t})$中配对的总次数，即“$t_v$翻译为$s_u$”在所有对齐中出现的次数；绿色部分表示$f(s_u|t_v)$对于所有的$t_i$的相对值，即“$t_v$翻译为$s_u$”在所有对齐中出现的相对概率；蓝色与绿色部分相乘表示“$t_v$翻译为$s_u$”这个事件出现次数的期望的估计，称之为{\small\sffamily\bfseries{期望频次}}\index{期望频次}（Expected Count）\index{Expected Count}。
 \vspace{-0.3em}
 %----------------------------------------------
 \begin{figure}[htp]
    \centering
 \input{./Chapter5/Figures/figure-a-more-detailed-explanation-of-formula-3.40}
-   \caption{公式\eqref{eq:5-39}的解释}
+   \caption{公式\eqref{eq:5-37}的解释}
   \label{fig:5-25}
 \end{figure}
 %----------------------------------------------
 \parinterval 期望频次是事件在其分布下出现次数的期望。另$c_{\mathbb{E}}(X)$为事件$X$的期望频次，其计算公式为：
+\begin{eqnarray}
-\begin{equation}
+c_{\mathbb{E}}(X)&=&\sum_i c(x_i) \cdot \funp{P}(x_i)
-c_{\mathbb{E}}(X)=\sum_i c(x_i) \cdot \funp{P}(x_i)
+\label{eq:5-38}
-\end{equation}
+\end{eqnarray}
 \noindent 其中$c(x_i)$表示$X$取$x_i$时出现的次数，$\funp{P}(x_i)$表示$X=x_i$出现的概率。图\ref{fig:5-26}展示了事件$X$的期望频次的详细计算过程。其中$x_1$、$x_2$和$x_3$分别表示事件$X$出现2次、1次和5次的情况。
@@ -1009,39 +999,39 @@ c_{\mathbb{E}}(X)=\sum_i c(x_i) \cdot \funp{P}(x_i)
 \parinterval 因为在$\funp{P}(\seq{s}|\seq{t})$中，$t_v$翻译（连接）到$s_u$的期望频次为：
 \begin{eqnarray}
-c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t}) \equiv \sum\limits_{j=1}^{m} \delta(s_j,s_u) \cdot \sum\limits_{i=0}^{l} \delta(t_i,t_v) \cdot \frac {f(s_u|t_v)}{\sum\limits_{i=0}^{l}f(s_u|t_i)}
+c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t}) & \equiv & \sum\limits_{j=1}^{m} \delta(s_j,s_u) \cdot \sum\limits_{i=0}^{l} \delta(t_i,t_v) \cdot \frac {f(s_u|t_v)}{\sum\limits_{i=0}^{l}f(s_u|t_i)}
-\label{eq:5-40}
+\label{eq:5-39}
 \end{eqnarray}
-\parinterval 所以公式\ref {eq:5-39}可重写为：
+\parinterval 所以公式\ref {eq:5-37}可重写为：
 \begin{eqnarray}
-f(s_u|t_v)=\lambda_{t_v}^{-1} \cdot \funp{P}(\seq{s}| \seq{t}) \cdot c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t})
+f(s_u|t_v)&=&\lambda_{t_v}^{-1} \cdot \funp{P}(\seq{s}| \seq{t}) \cdot c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t})
-\label{eq:5-41}
+\label{eq:5-40}
 \end{eqnarray}
 \parinterval 在此如果令$\lambda_{t_v}^{'}=\frac{\lambda_{t_v}}{\funp{P}(\seq{s}| \seq{t})}$，可得：
 \begin{eqnarray}
 f(s_u|t_v) &= &\lambda_{t_v}^{-1} \cdot \funp{P}(\seq{s}| \seq{t}) \cdot c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t}) \nonumber \\
 &=&{(\lambda_{t_v}^{'})}^{-1} \cdot c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t})
-\label{eq:5-42}
+\label{eq:5-41}
 \end{eqnarray}
 \parinterval 又因为IBM模型对$f(\cdot|\cdot)$的约束如下：
 \begin{eqnarray}
-\forall t_y : \sum\limits_{s_x} f(s_x|t_y) =1
+\forall t_y : \sum\limits_{s_x} f(s_x|t_y) &=& 1
-\label{eq:5-43}
+\label{eq:5-42}
 \end{eqnarray}
-\parinterval 为了满足$f(\cdot|\cdot)$的概率归一化约束，易知$\lambda_{t_v}^{'}$为：
+\parinterval 为了满足$f(\cdot|\cdot)$的概率归一化约束，易得$\lambda_{t_v}^{'}$为：
 \begin{eqnarray}
-\lambda_{t_v}^{'}=\sum\limits_{s_u} c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t})
+\lambda_{t_v}^{'}&=&\sum\limits_{s_u} c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t})
-\label{eq:5-44}
+\label{eq:5-43}
 \end{eqnarray}
 \parinterval 因此，$f(s_u|t_v)$的计算式可再一步变换成下式：
 \begin{eqnarray}
-f(s_u|t_v)=\frac{c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t})}  { \sum\limits_{s_u} c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t}) }
+f(s_u|t_v)&=&\frac{c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t})}  { \sum\limits_{s_u} c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t}) }
-\label{eq:5-45}
+\label{eq:5-44}
 \end{eqnarray}
@@ -1049,8 +1039,8 @@ f(s_u|t_v)=\frac{c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t})}  { \sum\limits_{s_u} c
 \parinterval 进一步，假设有$K$个互译的句对（称作平行语料）：
 $\{(\seq{s}^{[1]},\seq{t}^{[1]}),...,(\seq{s}^{[K]},\seq{t}^{[K]})\}$，$f(s_u|t_v)$的期望频次为：
 \begin{eqnarray}
-c_{\mathbb{E}}(s_u|t_v)=\sum\limits_{k=1}^{K}  c_{\mathbb{E}}(s_u|t_v;s^{[k]},t^{[k]})
+c_{\mathbb{E}}(s_u|t_v)&=&\sum\limits_{k=1}^{K}  c_{\mathbb{E}}(s_u|t_v;s^{[k]},t^{[k]})
-\label{eq:5-46}
+\label{eq:5-45}
 \end{eqnarray}
 \parinterval 于是有$f(s_u|t_v)$的计算公式和迭代过程图\ref{fig:5-27}所示。完整的EM算法如图\ref{fig:5-28}所示。其中E-Step对应4-5行，目的是计算$c_{\mathbb{E}}(\cdot)$；M-Step对应6-9行，目的是计算$f(\cdot|\cdot)$。

--- a/Chapter6/Figures/figure-alignment-matrix-for-zh-to-en-translation.tex
+++ b/Chapter6/Figures/figure-alignment-matrix-for-zh-to-en-translation.tex
@@ -25,7 +25,7 @@
    \node[anchor=west,inner sep=0pt,font=\footnotesize,rotate=45] at([xshift=0.1cm+\bc*2,yshift=0.4em]o.east){satisfied};
    \node[anchor=west,inner sep=0pt,font=\footnotesize,rotate=45] at([xshift=0.1cm+\bc*3,yshift=0.4em]o.east){with};
    \node[anchor=west,inner sep=0pt,font=\footnotesize,rotate=45] at([xshift=0.1cm+\bc*4,yshift=0.4em]o.east){you};
-	\node[anchor=east,inner sep=0pt,font=\footnotesize] at([xshift=\bc*4.5,yshift=-1.0cm-\bc*4]o.west){(a)对齐实例1};
+	\node[anchor=east,inner sep=0pt,font=\small] at([xshift=\bc*4.5,yshift=-1.0cm-\bc*4]o.west){(a)对齐实例1};
 \end{scope}
 \begin{scope}[xshift=15.0em]
    \filldraw [fill=white,drop shadow] (0,0) rectangle (\bc*8,\bc*6);
@@ -56,7 +56,7 @@
    \node[anchor=west,inner sep=0pt,font=\footnotesize,rotate=45] at([xshift=0.1cm+\bc*5,yshift=0.4em]o.east){work};
    \node[anchor=west,inner sep=0pt,font=\footnotesize,rotate=45] at([xshift=0.1cm+\bc*6,yshift=0.4em]o.east){every};
    \node[anchor=west,inner sep=0pt,font=\footnotesize,rotate=45] at([xshift=0.1cm+\bc*7,yshift=0.4em]o.east){day};
-    \node[anchor=east,inner sep=0pt,font=\footnotesize] at([xshift=\bc*6.0,yshift=-1.0cm-\bc*5]o.west){(b)对齐实例2};
+    \node[anchor=east,inner sep=0pt,font=\small] at([xshift=\bc*6.0,yshift=-1.0cm-\bc*5]o.west){(b)对齐实例2};
 \end{scope}
 \end{tikzpicture}
 %---------------------------------------------------------------------
\ No newline at end of file
--- a/Chapter6/Figures/figure-examples-of-sequential-translation-and-reorder-translation.tex
+++ b/Chapter6/Figures/figure-examples-of-sequential-translation-and-reorder-translation.tex
@@ -24,7 +24,7 @@
 		\draw[line width=1.2pt,dashed] ([yshift=-0.3em]n14.south) -- ([yshift=0.2em]n24.north);
 		\draw[line width=1.2pt,dashed] ([yshift=-0.3em]n15.south) -- ([yshift=0.2em]n25.north);
 		\draw[line width=1.2pt,dashed] ([yshift=-0.3em]n16.south) -- ([yshift=0.2em]n26.north);
-        \node[anchor=west] at([xshift=5.5em,yshift=-3em]n21.east){(a)顺序翻译对齐结果};
+        \node[anchor=west] at([xshift=5.5em,yshift=-3em]n21.east){\small{(a)顺序翻译对齐结果}};
 \end{scope}
 \begin{scope}[yshift=-11.5em]
 	\tikzstyle{cand} = [draw,inner sep=4pt,line width=1pt,align=center,drop shadow,minimum height =1.6em,minimum width=4.2em,fill=green!30]
@@ -49,7 +49,7 @@
 		\draw[line width=1.2pt,dashed,out=-40,in=140] ([yshift=-0.3em]n14.south) to ([yshift=0.2em]n26.north);
 		\draw[line width=1.2pt,dashed,out=-140,in=40] ([yshift=-0.3em]n15.south) to ([yshift=0.2em]n23.north);
 		\draw[line width=1.2pt,dashed,out=-140,in=40] ([yshift=-0.3em]n16.south) to ([yshift=0.2em]n24.north);
-		\node[anchor=west] at([xshift=5.5em,yshift=-3em]n21.east){(b)调序翻译对齐结果};
+		\node[anchor=west] at([xshift=5.5em,yshift=-3em]n21.east){\small{(b)调序翻译对齐结果}};
 \end{scope}
 \end{tikzpicture}
 %---------------------------------------------------------------------
\ No newline at end of file
--- a/Chapter6/chapter6.tex
+++ b/Chapter6/chapter6.tex
@@ -23,7 +23,7 @@
 \chapter{基于扭曲度和繁衍率的模型}
-{\chapterfive}展示了一种基于单词的翻译模型。这种模型的形式非常简单，而且其隐含的词对齐信息具有较好的可解释性。不过，语言翻译的复杂性远远超出人们的想象。有两方面挑战\ \dash\ 如何对“ 调序”问题进行建模以及如何对“一对多翻译”问题进行建模。调序是翻译问题中所特有的现象，比如，汉语到日语的翻译中，需要对谓词进行调序。另一方面，一个单词在另一种语言中可能会被翻译为多个连续的词，比如，汉语“ 联合国”翻译到英语会对应三个单词“The United Nations”。这种现象也被称作一对多翻译，它与句子长度预测有着密切的联系。
+{\chapterfive}展示了一种基于单词的翻译模型。这种模型的形式非常简单，而且其隐含的词对齐信息具有较好的可解释性。不过，语言翻译的复杂性远远超出人们的想象。语言翻译主要有两方面挑战\ \dash\ 如何对“ 调序”问题进行建模以及如何对“一对多翻译”问题进行建模。一方面，调序是翻译问题中所特有的现象，比如，汉语到日语的翻译中，需要对谓词进行调序。另一方面，一个单词在另一种语言中可能会被翻译为多个连续的词，比如，汉语“ 联合国”翻译到英语会对应三个单词“The United Nations”。这种现象也被称作一对多翻译，它与句子长度预测有着密切的联系。
 无论是调序还是一对多翻译，简单的翻译模型（如IBM模型1）都无法对其进行很好的处理。因此，需要考虑对这两个问题单独进行建模。本章将会对机器翻译中两个常用的概念进行介绍\ \dash\ 扭曲度（Distortion）和繁衍率（Fertility）。它们可以被看作是对调序和一对多翻译现象的一种统计描述。基于此，本章会进一步介绍基于扭曲度和繁衍率的翻译模型，建立相对完整的基于单词的统计建模体系。相关的技术和概念在后续章节也会被进一步应用。
@@ -34,7 +34,7 @@
 \sectionnewpage
 \section{基于扭曲度的模型}
-下面将介绍扭曲度在机器翻译中的定义及使用方法。这也带来了两个新的翻译模型\ \dash\ IBM模型2\upcite{DBLP:journals/coling/BrownPPM94}和HMM翻译模型\upcite{vogel1996hmm}。
+下面将介绍扭曲度在机器翻译中的定义及使用方法。这也带来了两个新的翻译模型\ \dash\ IBM模型2\upcite{DBLP:journals/coling/BrownPPM94}和HMM\upcite{vogel1996hmm}。
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -74,11 +74,11 @@
 \parinterval 对于建模来说，IBM模型1很好地化简了翻译问题，但是由于使用了很强的假设，导致模型和实际情况有较大差异。其中一个比较严重的问题是假设词对齐的生成概率服从均匀分布。IBM模型2抛弃了这个假设\upcite{DBLP:journals/coling/BrownPPM94}。它认为词对齐是有倾向性的，它与源语言单词的位置和目标语言单词的位置有关。具体来说，对齐位置$a_j$的生成概率与位置$j$、源语言句子长度$m$和目标语言句子长度$l$有关，形式化表述为：
 \begin{eqnarray}
-\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t}) \equiv a(a_j|j,m,l)
+\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t}) & \equiv & a(a_j|j,m,l)
 \label{eq:6-1}
 \end{eqnarray}
-\parinterval 这里还用{\chapterthree}中的例子（图\ref{fig:6-3}）来进行说明。在IBM模型1中，“桌子”对齐到目标语言四个位置的概率是一样的。但在IBM模型2中，“桌子”对齐到“table”被形式化为$a(a_j |j,m,l)=a(3|2,3,3)$，意思是对于源语言位置2（$j=2$）的词，如果它的源语言和目标语言都是3个词（$l=3,m=3$），对齐到目标语言位置3（$a_j=3$）的概率是多少？因为$a(a_j|j,m,l)$也是模型需要学习的参数，因此“桌子”对齐到不同目标语言单词的概率也是不一样的。理想的情况下，通过$a(a_j|j,m,l)$，“桌子”对齐到“table”应该得到更高的概率。
+\parinterval 这里还用{\chapterfive}中的例子（图\ref{fig:6-3}）来进行说明。在IBM模型1中，“桌子”对齐到目标语言四个位置的概率是一样的。但在IBM模型2中，“桌子”对齐到“table”被形式化为$a(a_j |j,m,l)=a(3|2,3,3)$，意思是对于源语言位置2（$j=2$）的词，如果它的源语言和目标语言都是3个词（$l=3,m=3$），对齐到目标语言位置3（$a_j=3$）的概率是多少？因为$a(a_j|j,m,l)$也是模型需要学习的参数，因此“桌子”对齐到不同目标语言单词的概率也是不一样的。理想的情况下，通过$a(a_j|j,m,l)$，“桌子”对齐到“table”应该得到更高的概率。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -97,7 +97,7 @@
 \label{eq:s-word-gen-prob}
 \end{eqnarray}
-把公式\eqref{eq:6-1}、\eqref{eq:s-len-gen-prob}和\eqref{eq:s-word-gen-prob}和 重新带入公式$\funp{P}(\seq{s},\seq{a}|\seq{t})=\funp{P}(m|\seq{t})\prod_{j=1}^{m}{\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t})}$\\${\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\seq{t})}$ 和$\funp{P}(\seq{s}|\seq{t})= \sum_{\seq{a}}\funp{P}(\seq{s},\seq{a}|\seq{t})$，可以得到IBM模型2的数学描述：
+把公式\eqref{eq:6-1}、\eqref{eq:s-len-gen-prob}和\eqref{eq:s-word-gen-prob}重新带入公式$\funp{P}(\seq{s},\seq{a}|\seq{t})=\funp{P}(m|\seq{t})\prod_{j=1}^{m}{\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t})}$\\${\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\seq{t})}$ 和$\funp{P}(\seq{s}|\seq{t})= \sum_{\seq{a}}\funp{P}(\seq{s},\seq{a}|\seq{t})$，可以得到IBM模型2的数学描述：
 \begin{eqnarray}
 \funp{P}(\seq{s}| \seq{t}) & = &  \sum_{\seq{a}}{\funp{P}(\seq{s},\seq{a}| \seq{t})} \nonumber \\
                       & = & \sum_{a_1=0}^{l}{\cdots}\sum _{a_m=0}^{l}{\varepsilon}\prod_{j=1}^{m}{a(a_j|j,m,l)f(s_j|t_{a_j})}
@@ -106,7 +106,7 @@
 \parinterval 类似于模型1，模型2的表达式\eqref{eq:6-4}也能被拆分为两部分进行理解。第一部分：遍历所有的$\seq{a}$；第二部分：对于每个$\seq{a}$累加对齐概率$\funp{P}(\seq{s},\seq{a}| \seq{t})$，即计算对齐概率$a(a_j|j,m,l)$和词汇翻译概率$f(s_j|t_{a_j})$对于所有源语言位置的乘积。
-\parinterval 同样的，模型2的解码及训练优化和模型1的十分相似，在此不再赘述，详细推导过程可以参看{\chapterfive}解码及计算优化部分。这里直接给出IBM模型2的最终表达式：
+\parinterval 同样的，模型2的解码及训练优化和模型1的十分相似，在此不再赘述，详细推导过程可以参看{\chapterfive}\ref{IBM-model1}小节解码及计算优化部分。这里直接给出IBM模型2的最终表达式：
 \begin{eqnarray}
 \funp{P}(\seq{s}| \seq{t}) & = & \varepsilon \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} a(i|j,m,l) f(s_j|t_i)
 \label{eq:6-5}
@@ -132,7 +132,7 @@
 \parinterval 针对此问题，基于HMM的词对齐模型抛弃了IBM模型1-2的绝对位置假设，将一阶隐马尔可夫模型用于词对齐问题\upcite{vogel1996hmm}。HMM词对齐模型认为，单词与单词之间并不是毫无联系的，对齐概率应该取决于对齐位置的差异而不是本身单词所在的位置。具体来说，位置$j$的对齐概率$a_j$与前一个位置$j-1$的对齐位置$a_{j-1}$和译文长度$l$有关，形式化的表述为：
 \begin{eqnarray}
-\funp{P}(a_{j}|a_{1}^{j-1},s_{1}^{j-1},m,\seq{t})\equiv\funp{P}(a_{j}|a_{j-1},l)
+\funp{P}(a_{j}|a_{1}^{j-1},s_{1}^{j-1},m,\seq{t})& \equiv & \funp{P}(a_{j}|a_{j-1},l)
 \label{eq:6-6}
 \end{eqnarray}
@@ -140,13 +140,13 @@
 \parinterval 把公式$\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\seq{t}) \equiv f(s_j|t_{a_j})$和\eqref{eq:6-6}重新带入公式$\funp{P}(\seq{s},\seq{a}|\seq{t})=\funp{P}(m|\seq{t})$\\$\prod_{j=1}^{m}{\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t})\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\seq{t})}$和$\funp{P}(\seq{s}|\seq{t})= \sum_{\seq{a}}\funp{P}(\seq{s},\seq{a}|\seq{t})$,可得HMM词对齐模型的数学描述：
 \begin{eqnarray}
-\funp{P}(\seq{s}| \seq{t})=\sum_{\seq{a}}{\funp{P}(m|\seq{t})}\prod_{j=1}^{m}{\funp{P}(a_{j}|a_{j-1},l)f(s_{j}|t_{a_j})}
+\funp{P}(\seq{s}| \seq{t})& =& \sum_{\seq{a}}{\funp{P}(m|\seq{t})}\prod_{j=1}^{m}{\funp{P}(a_{j}|a_{j-1},l)f(s_{j}|t_{a_j})}
 \label{eq:6-7}
 \end{eqnarray}
 \parinterval 此外，为了使得HMM的对齐概率$\funp{P}(a_{j}|a_{j-1},l)$满足归一化的条件，这里还假设其对齐概率只取决于$a_{j}-a_{j-1}$，即：
 \begin{eqnarray}
-\funp{P}(a_{j}|a_{j-1},l)=\frac{\mu(a_{j}-a_{j-1})}{\sum_{i=1}^{l}{\mu(i-a_{j-1})}}
+\funp{P}(a_{j}|a_{j-1},l)& = & \frac{\mu(a_{j}-a_{j-1})}{\sum_{i=1}^{l}{\mu(i-a_{j-1})}}
 \label{eq:6-8}
 \end{eqnarray}
@@ -179,7 +179,7 @@
 \begin{itemize}
 \vspace{0.3em}
-\item 首先，对于每个英语单词$t_i$决定它的产出率$\varphi_{i}$。比如“Scientists”的产出率是2，可表示为${\varphi}_{1}=2$。这表明它会生成2个汉语单词；
+\item 首先，对于每个英语单词$t_i$确定它的产出率$\varphi_{i}$。比如“Scientists”的产出率是2，可表示为${\varphi}_{1}=2$。这表明它会生成2个汉语单词；
 \vspace{0.3em}
 \item 其次，确定英语句子中每个单词生成的汉语单词列表。比如“Scientists”生成“科学家”和“们”两个汉语单词，可表示为${\tau}_1=\{{\tau}_{11}=\textrm{“科学家”},{\tau}_{12}=\textrm{“们”}\}$。 这里用特殊的空标记NULL表示翻译对空的情况；
 \vspace{0.3em}
@@ -201,10 +201,10 @@
 \parinterval 可以看出，一组$\tau$和$\pi$（记为$<\tau,\pi>$）可以决定一个对齐$\seq{a}$和一个源语句子$\seq{s}$。
 \noindent 相反的，一个对齐$\seq{a}$和一个源语句子$\seq{s}$可以对应多组$<\tau,\pi>$。如图\ref{fig:6-6}所示，不同的$<\tau,\pi>$对应同一个源语言句子和词对齐。它们的区别在于目标语单词“Scientists”生成的源语言单词“科学家”和“ 们”的顺序不同。这里把不同的$<\tau,\pi>$对应到的相同的源语句子$\seq{s}$和对齐$\seq{a}$记为$<\seq{s},\seq{a}>$。因此计算$\funp{P}(\seq{s},\seq{a}| \seq{t})$时需要把每个可能结果的概率加起来，如下：
-\begin{equation}
+\begin{eqnarray}
-\funp{P}(\seq{s},\seq{a}| \seq{t})=\sum_{{<\tau,\pi>}\in{<\seq{s},\seq{a}>}}{\funp{P}(\tau,\pi|\seq{t}) }
+\funp{P}(\seq{s},\seq{a}| \seq{t}) & = & \sum_{{<\tau,\pi>}\in{<\seq{s},\seq{a}>}}{\funp{P}(\tau,\pi|\seq{t}) }
 \label{eq:6-9}
-\end{equation}
+\end{eqnarray}
 %----------------------------------------------
 \begin{figure}[htp]
@@ -233,15 +233,15 @@
 \begin{itemize}
 \vspace{0.5em}
-\item 第一部分：每个$i\in[1,l]$的目标语单词的产出率建模（{\color{red!70} 红色}），即$\varphi_i$的生成概率。它依赖于$\seq{t}$和区间$[1,i-1]$的目标语单词的产出率$\varphi_1^{i-1}$。\footnote{这里约定，当$i=1$ 时，$\varphi_1^0$ 表示空。}
+\item 第一部分：对每个$i\in[1,l]$的目标语单词的产出率建模（{\color{red!70} 红色}），即$\varphi_i$的生成概率。它依赖于$\seq{t}$和区间$[1,i-1]$的目标语单词的产出率$\varphi_1^{i-1}$。\footnote{这里约定，当$i=1$ 时，$\varphi_1^0$ 表示空。}
 \vspace{0.5em}
-\item 第二部分：$i=0$时的产出率建模（{\color{blue!70} 蓝色}），即空标记$t_0$的产出率生成概率。它依赖于$\seq{t}$和区间$[1,i-1]$的目标语单词的产出率$\varphi_1^l$。
+\item 第二部分：对$i=0$时的产出率建模（{\color{blue!70} 蓝色}），即空标记$t_0$的产出率生成概率。它依赖于$\seq{t}$和区间$[1,i-1]$的目标语单词的产出率$\varphi_1^l$。
 \vspace{0.5em}
-\item 第三部分：词汇翻译建模（{\color{green!70} 绿色}），目标语言单词$t_i$生成第$k$个源语言单词$\tau_{ik}$时的概率，依赖于$\seq{t}$、所有目标语言单词的产出率$\varphi_0^l$、区间$i\in[1,l]$的目标语言单词生成的源语言单词$\tau_1^{i-1}$和目标语单词$t_i$生成的前$k$个源语言单词$\tau_{i1}^{k-1}$。
+\item 第三部分：对词汇翻译建模（{\color{green!70} 绿色}），目标语言单词$t_i$生成第$k$个源语言单词$\tau_{ik}$时的概率，依赖于$\seq{t}$、所有目标语言单词的产出率$\varphi_0^l$、区间$i\in[1,l]$的目标语言单词生成的源语言单词$\tau_1^{i-1}$和目标语单词$t_i$生成的前$k$个源语言单词$\tau_{i1}^{k-1}$。
 \vspace{0.5em}
 \item 第四部分：对于每个$i\in[1,l]$的目标语言单词生成的源语言单词的扭曲度建模（{\color{yellow!70!black} 黄色}），即第$i$个目标语言单词生成的第$k$个源语言单词在源文中的位置$\pi_{ik}$ 的概率。其中$\pi_1^{i-1}$ 表示区间$[1,i-1]$的目标语言单词生成的源语言单词的扭曲度，$\pi_{i1}^{k-1}$表示第$i$目标语言单词生成的前$k-1$个源语言单词的扭曲度。
 \vspace{0.5em}
-\item 第五部分：$i=0$时的扭曲度建模（{\color{gray!70} 灰色}），即空标记$t_0$生成源语言位置的概率。
+\item 第五部分：对$i=0$时的扭曲度建模（{\color{gray!70} 灰色}），即空标记$t_0$生成源语言位置的概率。
 \end{itemize}
 %----------------------------------------------------------------------------------------
@@ -262,29 +262,29 @@
 \parinterval 对于$i=0$的情况需要单独进行考虑。实际上，$t_0$只是一个虚拟的单词。它要对应$\seq{s}$中原本为空对齐的单词。这里假设：要等其他非空对应单词都被生成（放置）后，才考虑这些空对齐单词的生成（放置）。即非空对单词都被生成后，在那些还有空的位置上放置这些空对的源语言单词。此外，在任何的空位置上放置空对的源语言单词都是等概率的，即放置空对齐源语言单词服从均匀分布。这样在已经放置了$k$个空对齐源语言单词的时候，应该还有$\varphi_0-k$个空位置。如果第$j$个源语言位置为空，那么
-\begin{equation}
+\begin{eqnarray}
-\funp{P}(\pi_{0k}=j|\pi_{01}^{k-1},\pi_1^l,\tau_0^l,\varphi_0^l,\seq{t})=\frac{1}{\varphi_0-k}
+\funp{P}(\pi_{0k}=j|\pi_{01}^{k-1},\pi_1^l,\tau_0^l,\varphi_0^l,\seq{t}) & = & \frac{1}{\varphi_0-k}
 \label{eq:6-13}
-\end{equation}
+\end{eqnarray}
 否则
-\begin{equation}
+\begin{eqnarray}
-\funp{P}(\pi_{0k}=j|\pi_{01}^{k-1},\pi_1^l,\tau_0^l,\varphi_0^l,\seq{t})=0
+\funp{P}(\pi_{0k}=j|\pi_{01}^{k-1},\pi_1^l,\tau_0^l,\varphi_0^l,\seq{t}) & = & 0
 \label{eq:6-14}
-\end{equation}
+\end{eqnarray}
 这样对于$t_0$所对应的$\tau_0$，就有
 {
 \begin{eqnarray}
-\prod_{k=1}^{\varphi_0}{\funp{P}(\pi_{0k}|\pi_{01}^{k-1},\pi_{1}^{l},\tau_{0}^{l},\varphi_{0}^{l},\seq{t})         }=\frac{1}{\varphi_{0}!}
+\prod_{k=1}^{\varphi_0}{\funp{P}(\pi_{0k}|\pi_{01}^{k-1},\pi_{1}^{l},\tau_{0}^{l},\varphi_{0}^{l},\seq{t})         } & = & \frac{1}{\varphi_{0}!}
 \label{eq:6-15}
 \end{eqnarray}
 }
 \parinterval 而上面提到的$t_0$所对应的这些空位置是如何生成的呢？即如何确定哪些位置是要放置空对齐的源语言单词。在IBM模型3中，假设在所有的非空对齐源语言单词都被生成出来后（共$\varphi_1+\varphi_2+\cdots {\varphi}_l$个非空对源语单词），这些单词后面都以$p_1$概率随机地产生一个“槽”用来放置空对齐单词。这样，${\varphi}_0$就服从了一个二项分布。于是得到
 {
 \begin{eqnarray}
-\funp{P}(\varphi_0|\seq{t})=\big(\begin{array}{c}
+\funp{P}(\varphi_0|\seq{t}) & = & \big(\begin{array}{c}
 \varphi_1+\varphi_2+\cdots \varphi_l\\
 \varphi_0\\
 \end{array}\big)p_0^{\varphi_1+\varphi_2+\cdots \varphi_l-\varphi_0}p_1^{\varphi_0}
@@ -318,7 +318,7 @@ p_0+p_1                            & = & 1 \label{eq:6-21}
 \subsection{IBM 模型4}
-\parinterval IBM模型3仍然存在问题，比如，它不能很好地处理一个目标语言单词生成多个源语言单词的情况。这个问题在模型1和模型2中也存在。如果一个目标语言单词对应多个源语言单词，往往这些源语言单词构成短语或搭配。但是模型1-3把这些源语言单词看成独立的单元，而实际上它们是一个整体。这就造成了在模型1-3中这些源语言单词可能会“分散”开。为了解决这个问题，模型4对模型3进行了进一步修正。
+\parinterval IBM模型3仍然存在问题，比如，它不能很好地处理一个目标语言单词生成多个源语言单词的情况。这个问题在模型1和模型2中也存在。如果一个目标语言单词对应多个源语言单词，则这些源语言单词往往会构成短语。但是模型1-3把这些源语言单词看成独立的单元，而实际上它们是一个整体。这就造成了在模型1-3中这些源语言单词可能会“分散”开。为了解决这个问题，模型4对模型3进行了进一步修正。
 \parinterval 为了更清楚地阐述，这里引入新的术语\ \dash \ {\small\bfnew{概念单元}}\index{概念单元}或{\small\bfnew{概念}}\index{概念}（Concept）\index{Concept}。词对齐可以被看作概念之间的对应。这里的概念是指具有独立语法或语义功能的一组单词。依照Brown等人的表示方法\upcite{DBLP:journals/coling/BrownPPM94}，可以把概念记为cept.。每个句子都可以被表示成一系列的cept.。这里要注意的是，源语言句子中的cept.数量不一定等于目标句子中的cept.数量。因为有些cept. 可以为空，因此可以把那些空对的单词看作空cept.。比如，在图\ref{fig:6-8}的实例中，“了”就对应一个空cept.。
@@ -336,23 +336,23 @@ p_0+p_1                            & = & 1 \label{eq:6-21}
 \parinterval 另外，可以用$\odot_{i}$表示位置为$[i]$的目标语言单词对应的那些源语言单词位置的平均值，如果这个平均值不是整数则对它向上取整。比如在本例中，目标语句中第4个cept. （“.”）对应在源语言句子中的第5个单词。可表示为${\odot}_{4}=5$。
 \parinterval 利用这些新引进的概念，模型4对模型3的扭曲度进行了修改。主要是把扭曲度分解为两类参数。对于$[i]$对应的源语言单词列表($\tau_{[i]}$)中的第一个单词($\tau_{[i]1}$），它的扭曲度用如下公式计算：
-\begin{equation}
+\begin{eqnarray}
-\funp{P}(\pi_{[i]1}=j|{\pi}_1^{[i]-1},{\tau}_0^l,{\varphi}_0^l,\seq{t})=d_{1}(j-{\odot}_{i-1}|A(t_{[i-1]}),B(s_j))
+\funp{P}(\pi_{[i]1}=j|{\pi}_1^{[i]-1},{\tau}_0^l,{\varphi}_0^l,\seq{t}) & = & d_{1}(j-{\odot}_{i-1}|A(t_{[i-1]}),B(s_j))
 \label{eq:6-22}
-\end{equation}
+\end{eqnarray}
 \noindent 其中，第$i$个目标语言单词生成的第$k$个源语言单词的位置用变量$\pi_{ik}$表示。而对于列表($\tau_{[i]}$)中的其他的单词($\tau_{[i]k},1 < k \le \varphi_{[i]}$)的扭曲度，用如下公式计算：
-\begin{equation}
+\begin{eqnarray}
-\funp{P}(\pi_{[i]k}=j|{\pi}_{[i]1}^{k-1},\pi_1^{[i]-1},\tau_0^l,\varphi_0^l,\seq{t})=d_{>1}(j-\pi_{[i]k-1}|B(s_j))
+\funp{P}(\pi_{[i]k}=j|{\pi}_{[i]1}^{k-1},\pi_1^{[i]-1},\tau_0^l,\varphi_0^l,\seq{t}) & = & d_{>1}(j-\pi_{[i]k-1}|B(s_j))
 \label{eq:6-23}
-\end{equation}
+\end{eqnarray}
 \parinterval 这里的函数$A(\cdot)$和函数$B(\cdot)$分别把目标语言和源语言的单词映射到单词的词类。这么做的目的是要减小参数空间的大小。词类信息通常可以通过外部工具得到，比如Brown聚类等。另一种简单的方法是把单词直接映射为它的词性。这样可以直接用现在已经非常成熟的词性标注工具解决问题。
-\parinterval 从上面改进的扭曲度模型可以看出，对于$t_{[i]}$生成的第一个源语言单词，要考虑中心$\odot_{[i]}$和这个源语言单词之间的绝对距离。实际上也就要把$t_{[i]}$生成的所有源语言单词看成一个整体并把它放置在合适的位置。这个过程要依据第一个源语言单词的词类和对应源语中心位置，和前一个非空对目标语言单词$t_{[i-1]}$的词类。而对于$t_{[i]}$生成的其他源语言单词，只需要考虑它与前一个刚放置完的源语言单词的相对位置和这个源语言单词的词类。
+\parinterval 从上面改进的扭曲度模型可以看出，对于$t_{[i]}$生成的第一个源语言单词，要考虑中心$\odot_{[i]}$和这个源语言单词之间的绝对距离。实际上也就要把$t_{[i]}$生成的所有源语言单词看成一个整体并把它放置在合适的位置。这个过程要依据第一个源语言单词的词类和对应的源语中心位置，以及前一个非空的目标语言单词$t_{[i-1]}$的词类。而对于$t_{[i]}$生成的其他源语言单词，只需要考虑它与前一个刚放置完的源语言单词的相对位置和这个源语言单词的词类。
-\parinterval 实际上，上述过程就要先用$t_{[i]}$生成的第一个源语言单词代表整个$t_{[i]}$生成的单词列表，并把第一个源语言单词放置在合适的位置。然后，相对于前一个刚生成的源语言单词，把列表中的其他单词放置在合适的地方。这样就可以在一定程度上保证由同一个目标语言单词生成的源语言单词之间可以相互影响，达到了改进的目的。
+\parinterval 实际上，上述过程要先用$t_{[i]}$生成的第一个源语言单词代表整个$t_{[i]}$生成的单词列表，并把第一个源语言单词放置在合适的位置。然后，相对于前一个刚生成的源语言单词，把列表中的其他单词放置在合适的地方。这样就可以在一定程度上保证由同一个目标语言单词生成的源语言单词之间可以相互影响，达到了改进的目的。
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -360,7 +360,7 @@ p_0+p_1                            & = & 1 \label{eq:6-21}
 \subsection{ IBM 模型5}
-\parinterval 模型3和模型4并不是“准确”的模型。这两个模型会把一部分概率分配给一些根本就不存在的句子。这个问题被称作IBM模型3和模型4的{\small\bfnew{缺陷}}\index{缺陷}（Deficiency）\index{Deficiency}。说得具体一些，模型3和模型4 中并没有这样的约束：如果已经放置了某个源语言单词的位置不能再放置其他单词，也就是说句子的任何位置只能放置一个词，不能多也不能少。由于缺乏这个约束，模型3和模型4中在所有合法的词对齐上概率和不等于1。 这部分缺失的概率被分配到其他不合法的词对齐上。举例来说，如图\ref{fig:6-9}所示，“吃/早饭”和“have breakfast”之间的合法词对齐用直线表示 。但是在模型3和模型4中， 它们的概率和为$0.9<1$。 损失掉的概率被分配到像5和6这样的对齐上了（红色）。虽然IBM模型并不支持一对多的对齐，但是模型3和模型4把概率分配给这些“ 不合法”的词对齐上，因此也就产生所谓的缺陷。
+\parinterval 模型3和模型4并不是“准确”的模型。这两个模型会把一部分概率分配给一些根本就不存在的句子。这个问题被称作IBM模型3和模型4的{\small\bfnew{缺陷}}\index{缺陷}（Deficiency）\index{Deficiency}。说得具体一些，模型3和模型4 中并没有这样的约束：如果已经放置了某个源语言单词的位置不能再放置其他单词，也就是说句子的任何位置只能放置一个词，不能多也不能少。由于缺乏这个约束，模型3和模型4中在所有合法的词对齐上概率和不等于1。 这部分缺失的概率被分配到其他不合法的词对齐上。举例来说，如图\ref{fig:6-9}所示，“吃/早饭”和“have breakfast”之间的合法词对齐用直线表示 。但是在模型3和模型4中， 它们的概率和为$0.9<1$。 损失掉的概率被分配到像a5和a6这样的对齐上了（红色）。虽然IBM模型并不支持一对多的对齐，但是模型3和模型4把概率分配给这些“ 不合法”的词对齐上，因此也就产生所谓的缺陷。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -385,7 +385,7 @@ p_0+p_1                            & = & 1 \label{eq:6-21}
 \label{eq:6-25}
 \end{eqnarray}
-\noindent 这里，因子$1-\delta(v_j, v_{j-1})$是用来判断第$j$个位置是不是为空。如果第$j$个位置为空则$v_j = v_{j-1}$，这样$\funp{P}(\pi_{[i]1}=j|\pi_1^{[i]-1}, \tau_0^l, \varphi_0^l, \seq{t}) = 0$。这样就从模型上避免了模型3和模型4中生成不存在的字符串的问题。这里还要注意的是，对于放置第一个单词的情况，影响放置的因素有$v_j$，$B(s_i)$和$v_{j-1}$。此外还要考虑位置$j$放置了第一个源语言单词以后它的右边是不是还有足够的位置留给剩下的$k-1$个源语言单词。参数$v_m-(\varphi_{[i]}-1)$正是为了考虑这个因素，这里$v_m$表示整个源语言句子中还有多少空位置，$\varphi_{[i]}-1$ 表示源语言位置$j$右边至少还要留出的空格数。对于放置非第一个单词的情况，主要是要考虑它和前一个放置位置的相对位置。这主要体现在参数$v_j-v_{\varphi_{[i]}k-1}$上。式\eqref{eq:6-25} 的其他部分都可以用上面的理论解释，这里不再赘述。
+\noindent 这里，因子$1-\delta(v_j, v_{j-1})$是用来判断第$j$个位置是不是为空。如果第$j$个位置为空则$v_j = v_{j-1}$，这样$\funp{P}(\pi_{[i]1}=j|\pi_1^{[i]-1}, \tau_0^l, \varphi_0^l, \seq{t}) = 0$。这样就从模型上避免了模型3和模型4中生成不存在的字符串的问题。这里还要注意的是，对于放置第一个单词的情况，影响放置的因素有$v_j$，$B(s_i)$和$v_{j-1}$。此外还要考虑位置$j$放置了第一个源语言单词以后它的右边是不是还有足够的位置留给剩下的$k-1$个源语言单词。参数$v_m-(\varphi_{[i]}-1)$正是为了解决这个问题，这里$v_m$表示整个源语言句子中还有多少空位置，$\varphi_{[i]}-1$ 表示源语言位置$j$右边至少还要留出的空格数。对于放置非第一个单词的情况，主要是要考虑它和前一个放置位置的相对位置。这主要体现在参数$v_j-v_{\varphi_{[i]}k-1}$上。式\eqref{eq:6-25} 的其他部分都可以用上面的理论解释，这里不再赘述。
 \parinterval 实际上，模型5和模型4的思想基本一致，即，先确定$\tau_{[i]1}$的绝对位置，然后再确定$\tau_{[i]}$中剩余单词的相对位置。模型5消除了产生不存在的句子的可能性，不过模型5的复杂性也大大增加了。
 %----------------------------------------------------------------------------------------
@@ -395,9 +395,9 @@ p_0+p_1                            & = & 1 \label{eq:6-21}
 \sectionnewpage
 \section{解码和训练}
-\parinterval 与IBM模型1一样，IBM模型2-5和隐马尔可夫模型的解码可以直接使用{\chapterfive}所描述的方法。基本思路与{\chaptertwo}所描述的自左向右搜索方法一致，即：对译文自左向右生成，每次扩展一个源语言单词的翻译，即把源语言单词的译文放到已经生成的译文的右侧。每次扩展可以选择不同的源语言单词或者同一个源语言单词的不同翻译候选，这样就可以得到多个不同的扩展译文。在这个过程中，同时计算翻译模型和语言模型的得分，对每个得到译文候选打分。最终，保留一个或者多个译文。这个过程重复执行直至所有源语言单词被翻译完。
+\parinterval 与IBM模型1一样，IBM模型2-5和隐马尔可夫模型的解码可以直接使用{\chapterfive}所描述的方法。基本思路与{\chaptertwo}所描述的自左向右搜索方法一致，即：对译文自左向右生成，每次扩展一个源语言单词的翻译，即把源语言单词的译文放到已经生成的译文的右侧。每次扩展可以选择不同的源语言单词或者同一个源语言单词的不同翻译候选，这样就可以得到多个不同的扩展译文。在这个过程中，同时计算翻译模型和语言模型的得分，对每个得到的译文候选打分。最终，保留一个或者多个译文。这个过程重复执行直至所有源语言单词被翻译完。
-\parinterval 类似的，IBM模型2-5和隐马尔可夫模型也都可以使用期望最大化（EM）方法进行模型训练。相关数学推导可参考附录\ref{appendix-B}的内容。通常，可以使用这些模型获得双语句子间的词对齐结果，比如使用GIZA++工具。这时，往往会使用多个模型，把简单的模型训练后的参数作为初始值送给后面更加复杂的模型。比如，先用IBM模型1训练，之后把参数送给IBM模型2，再训练，之后把参数送给隐马尔可夫模型等。值得注意的是，并不是所有的模型使用EM算法都能找到全局最优解。特别是IBM模型3-5的训练中使用一些剪枝和近似的方法，优化的真实目标函数会更加复杂。不过，IBM模型1是一个{\small\bfnew{凸函数}}\index{凸函数}（Convex Function）\index{Convex Function}，因此理论上使用EM方法能够找到全局最优解。更实际的好处是，IBM 模型1训练的最终结果与参数的初始化过程无关。这也是为什么在使用IBM 系列模型时，往往会使用IBM模型1作为起始模型的原因。
+\parinterval 类似的，IBM模型2-5和隐马尔可夫模型也都可以使用期望最大化（EM）方法进行模型训练。相关数学推导可参考附录\ref{appendix-B}的内容。通常，可以使用这些模型获得双语句子间的词对齐结果，比如使用GIZA++工具。这时，往往会使用多个模型，把简单的模型训练后的参数作为初始值传给后面更加复杂的模型。比如，先用IBM模型1训练，之后把参数送给IBM模型2，再训练，之后把参数送给隐马尔可夫模型等。值得注意的是，并不是所有的模型使用EM算法都能找到全局最优解。特别是IBM模型3-5的训练中使用一些剪枝和近似的方法，优化的真实目标函数会更加复杂。不过，IBM模型1是一个{\small\bfnew{凸函数}}\index{凸函数}（Convex Function）\index{Convex Function}，因此理论上使用EM方法能够找到全局最优解。更实际的好处是，IBM 模型1训练的最终结果与参数的初始化过程无关。这也是为什么在使用IBM 系列模型时，往往会使用IBM模型1作为起始模型的原因。
 %----------------------------------------------------------------------------------------
 %    NEW SECTION
@@ -428,13 +428,13 @@ p_0+p_1                            & = & 1 \label{eq:6-21}
 \parinterval IBM模型的缺陷是指翻译模型会把一部分概率分配给一些根本不存在的源语言字符串。如果用$\funp{P}(\textrm{well}|\seq{t})$表示$\funp{P}(\seq{s}| \seq{t})$在所有的正确的（可以理解为语法上正确的）$\seq{s}$上的和，即
 \begin{eqnarray}
-\funp{P}(\textrm{well}|\seq{t})=\sum_{\seq{s}\textrm{\;is\;well\;formed}}{\funp{P}(\seq{s}| \seq{t})}
+\funp{P}(\textrm{well}|\seq{t}) & = & \sum_{\seq{s}\textrm{\;is\;well\;formed}}{\funp{P}(\seq{s}| \seq{t})}
 \label{eq:6-26}
 \end{eqnarray}
 \parinterval 类似地，用$\funp{P}(\textrm{ill}|\seq{t})$表示$\funp{P}(\seq{s}| \seq{t})$在所有的错误的（可以理解为语法上错误的）$\seq{s}$上的和。如果$\funp{P}(\textrm{well}|\seq{t})+ \funp{P}(\textrm{ill}|\seq{t})<1$，就把剩余的部分定义为$\funp{P}(\textrm{failure}|\seq{t})$。它的形式化定义为，
 \begin{eqnarray}
-\funp{P}({\textrm{failure}|\seq{t}})  = 1 - \funp{P}({\textrm{well}|\seq{t}}) - \funp{P}({\textrm{ill}|\seq{t}})
+\funp{P}({\textrm{failure}|\seq{t}}) & = & 1 - \funp{P}({\textrm{well}|\seq{t}}) - \funp{P}({\textrm{ill}|\seq{t}})
 \label{eq:6-27}
 \end{eqnarray}
@@ -452,7 +452,7 @@ p_0+p_1                            & = & 1 \label{eq:6-21}
 \parinterval 在IBM模型中，$\funp{P}(\seq{t})\funp{P}(\seq{s}| \seq{t})$会随着目标语言句子长度的增加而减少，因为这种模型有多个概率化的因素组成，乘积项越多结果的值越小。这也就是说，IBM模型会更倾向选择长度短一些的目标语言句子。显然这种对短句子的偏向性并不是机器翻译所期望的。
-\parinterval 这个问题在很多机器翻译系统中都存在。它实际上也是了一种{\small\bfnew{系统偏置}}\index{系统偏置}（System Bias）\index{System Bias}的体现。为了消除这种偏置，可以通过在模型中增加一个短句子惩罚引子来抵消掉模型对短句子的倾向性。比如，可以定义一个惩罚引子，它的值随着长度的减少而增加。不过，简单引入这样的惩罚因子会导致模型并不符合一个严格的噪声信道模型。它对应一个基于判别式框架的翻译模型，这部分内容会在{\chapterseven}进行介绍。
+\parinterval 这个问题在很多机器翻译系统中都存在。它实际上也是了一种{\small\bfnew{系统偏置}}\index{系统偏置}（System Bias）\index{System Bias}的体现。为了消除这种偏置，可以通过在模型中增加一个短句子惩罚因子来抵消掉模型对短句子的倾向性。比如，可以定义一个惩罚因子，它的值随着长度的减少而增加。不过，简单引入这样的惩罚因子会导致模型并不符合一个严格的噪声信道模型。它对应一个基于判别式框架的翻译模型，这部分内容会在{\chapterseven}进行介绍。
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -460,7 +460,7 @@ p_0+p_1                            & = & 1 \label{eq:6-21}
 \subsection{其他问题}
-\parinterval 模型5的意义是什么？模型5的提出是为了消除模型3和模型4的缺陷。缺陷的本质是，$\funp{P}(\seq{s},\seq{a}| \seq{t})$在所有合理的对齐上概率和不为1。 但是，在这里更关心是哪个对齐$\seq{a}$使$\funp{P}(\seq{s},\seq{a}| \seq{t})$达到最大，即使$\funp{P}(\seq{s},\seq{a}|\seq{t})$不符合概率分布的定义，也并不影响我们寻找理想的对齐$\seq{a}$。从工程的角度说，$\funp{P}(\seq{s},\seq{a}| \seq{t})$不归一并不是一个十分严重的问题。遗憾的是，实际上到现在为止有太多对IBM模型3和模型4中的缺陷进行过系统的实验和分析，但对于这个问题到底有多严重并没有定论。当然用模型5是可以解决这个问题。但是如果用一个非常复杂的模型去解决了一个并不产生严重后果的问题，那这个模型也就没有太大意义了（从实践的角度）。
+\parinterval 模型5的意义是什么？模型5的提出是为了消除模型3和模型4的缺陷。缺陷的本质是，$\funp{P}(\seq{s},\seq{a}| \seq{t})$在所有合理的对齐上概率和不为1。 但是，在这里更关心是哪个对齐$\seq{a}$使$\funp{P}(\seq{s},\seq{a}| \seq{t})$达到最大，即使$\funp{P}(\seq{s},\seq{a}|\seq{t})$不符合概率分布的定义，也并不影响我们寻找理想的对齐$\seq{a}$。从工程的角度说，$\funp{P}(\seq{s},\seq{a}| \seq{t})$不归一并不是一个十分严重的问题。遗憾的是，实际上到现在为止有太多对IBM模型3和模型4中的缺陷进行系统性的实验和分析，但对于这个问题到底有多严重并没有定论。当然用模型5是可以解决这个问题。但是如果用一个非常复杂的模型去解决了一个并不产生严重后果的问题，那这个模型也就没有太大意义了（从实践的角度）。
 \parinterval 概念（cept.）的意义是什么？经过前面的分析可知，IBM模型的词对齐模型使用了cept.这个概念。但是，在IBM模型中使用的cept.最多只能对应一个目标语言单词（模型并没有用到源语言cept. 的概念）。因此可以直接用单词代替cept.。这样，即使不引入cept.的概念，也并不影响IBM模型的建模。实际上，cept.的引入确实可以帮助我们从语法和语义的角度解释词对齐过程。不过，这个方法在IBM 模型中的效果究竟如何还没有定论。

--- a/Chapter7/Figures/figure-basic-process-of-translation.tex
+++ b/Chapter7/Figures/figure-basic-process-of-translation.tex
@@ -11,7 +11,7 @@
 \node[anchor=east] (t0) at (-0.5em, -1.5) {$\seq{t}$：};
-\node[anchor=north] (l) at ([xshift=7em,yshift=-0.5em]t0.south) {\footnotesize{(a)\ }};
+\node[anchor=north] (l) at ([xshift=7em,yshift=-0.5em]t0.south) {\small{(a)\ }};
 \end{scope}
@@ -29,7 +29,7 @@
 \path[<->, thick] (s2.south) edge (t1.north);
 }
-\node[anchor=north] (l) at ([xshift=7em,yshift=-0.5em]t0.south) {\footnotesize{(b)\ }};
+\node[anchor=north] (l) at ([xshift=7em,yshift=-0.5em]t0.south) {\small{(b)\ }};
 \end{scope}
@@ -50,7 +50,7 @@
 \node[anchor=west,fill=red!20] (t2) at ([xshift=1em]t1.east) {\footnotesize{an apple}};
 \path[<->, thick] (s3.south) edge (t2.north);
 }
-\node[anchor=north] (l) at ([xshift=7em,yshift=-0.5em]t0.south) {\footnotesize{(c)\ }};
+\node[anchor=north] (l) at ([xshift=7em,yshift=-0.5em]t0.south) {\small{(c)\ }};
 \end{scope}
@@ -76,6 +76,6 @@
 \node[anchor=west,fill=red!20] (t3) at ([xshift=1em]t2.east) {\footnotesize{on the table}};
 \path[<->, thick] (s1.south) edge (t3.north);
 }
-\node[anchor=north] (l) at ([xshift=7em,yshift=-0.5em]t0.south) {\footnotesize{(d)\ }};
+\node[anchor=north] (l) at ([xshift=7em,yshift=-0.5em]t0.south) {\small{(d)\ }};
 \end{scope}
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter7/Figures/figure-example-of-hypothesis-recombination.tex
+++ b/Chapter7/Figures/figure-example-of-hypothesis-recombination.tex
@@ -5,7 +5,7 @@
 {
 \node [anchor=north,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h0) at (0,0) {\small{null}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl0) at (h0.north west) {\scriptsize{{\color{white} \textbf{0}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt0) at (h0.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=1}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt0) at (h0.east) {\tiny{{\color{white} \textbf{$\funp{P}$=1}}}};
 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h2) at ([xshift=2.2em,yshift=3.5em]h0.east) {\small{an}};
 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h3) at ([xshift=2.2em]h2.east) {\small{apple}};
@@ -13,8 +13,8 @@
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl2) at (h2.north west) {\scriptsize{{\color{white} \textbf{1}}}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl3) at (h3.north west) {\scriptsize{{\color{white} \textbf{2}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt2) at (h2.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.3}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt2) at (h2.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.3}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt3) at (h3.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.5}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt3) at (h3.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.5}}}};
 \draw [->,very thick,ublue] ([xshift=0.1em]pt0.south) -- ([xshift=-0.1em]h2.west);
 \draw [->,very thick,ublue] ([xshift=0.1em]pt2.south) -- ([xshift=-0.1em]h3.west);
@@ -22,7 +22,7 @@
 {
 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h1) at ([xshift=7em]h0.east) {\small{an apple}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl1) at (h1.north west) {\scriptsize{{\color{white} \textbf{1-2}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt1) at (h1.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.5}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt1) at (h1.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.5}}}};
 \draw [->,very thick,ublue] ([xshift=0.1em]pt0.south) -- ([xshift=-0.1em]h1.west);
 }
 }
@@ -37,10 +37,10 @@
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl5) at (h6.north west) {\scriptsize{{\color{white} \textbf{1}}}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl5) at (h8.north west) {\scriptsize{{\color{white} \textbf{2}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt4) at (h4.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=1}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt4) at (h4.east) {\tiny{{\color{white} \textbf{$\funp{P}$=1}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt5) at (h5.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.3}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt5) at (h5.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.3}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt6) at (h6.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.4}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt6) at (h6.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.4}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt8) at (h8.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.2}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt8) at (h8.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.2}}}};
 \draw [->,very thick,ublue] ([xshift=0.1em]pt4.south) -- ([xshift=-0.1em]h5.west);
 \draw [->,very thick,ublue] ([xshift=0.1em]pt4.south) -- ([xshift=-0.1em]h6.west);
@@ -48,7 +48,7 @@
 {
 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h7) at ([xshift=2.2em]h5.east) {\small{is not}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt7) at (h7.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.2}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt7) at (h7.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.2}}}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl5) at (h7.north west) {\scriptsize{{\color{white} \textbf{2}}}};
 \draw [->,very thick,ublue] ([xshift=0.1em]pt5.south) -- ([xshift=-0.1em]h7.west);
 }
@@ -56,7 +56,7 @@
 \node[anchor=north] (l1) at ([xshift=5.5em,yshift=-1em]h0.south) {\scriptsize{原假设}};
 \node[anchor=north] (l2) at ([xshift=5.5em,yshift=-1em]h4.south) {\scriptsize{原假设}};
-\node[anchor=north] (part1) at ([xshift=16em,yshift=-2em]h0.south){\scriptsize{（a）译文相同时的假设重组}};
+\node[anchor=north] (part1) at ([xshift=16em,yshift=-3em]h0.south){\small{（a）译文相同时的假设重组}};
 \end{scope}
@@ -68,7 +68,7 @@
 {
 \node [anchor=north,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h0) at (0,0) {\small{null}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl0) at (h0.north west) {\scriptsize{{\color{white} \textbf{0}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt0) at (h0.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=1}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt0) at (h0.east) {\tiny{{\color{white} \textbf{$\funp{P}$=1}}}};
 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h2) at ([xshift=2.2em,yshift=3.5em]h0.east) {\small{an}};
 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h3) at ([xshift=2.2em]h2.east) {\small{apple}};
@@ -76,8 +76,8 @@
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl2) at (h2.north west) {\scriptsize{{\color{white} \textbf{1}}}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl3) at (h3.north west) {\scriptsize{{\color{white} \textbf{2}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt2) at (h2.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.3}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt2) at (h2.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.3}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt3) at (h3.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.5}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt3) at (h3.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.5}}}};
 \draw [->,very thick,ublue] ([xshift=0.1em]pt0.south) -- ([xshift=-0.1em]h2.west);
 \draw [->,very thick,ublue] ([xshift=0.1em]pt2.south) -- ([xshift=-0.1em]h3.west);
@@ -97,10 +97,10 @@
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl5) at (h6.north west) {\scriptsize{{\color{white} \textbf{1}}}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl5) at (h8.north west) {\scriptsize{{\color{white} \textbf{2}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt4) at (h4.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=1}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt4) at (h4.east) {\tiny{{\color{white} \textbf{$\funp{P}$=1}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt5) at (h5.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.3}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt5) at (h5.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.3}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt6) at (h6.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.4}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt6) at (h6.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.4}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt8) at (h8.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.2}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt8) at (h8.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.2}}}};
 \draw [->,very thick,ublue] ([xshift=0.1em]pt4.south) -- ([xshift=-0.1em]h5.west);
 \draw [->,very thick,ublue] ([xshift=0.1em]pt4.south) -- ([xshift=-0.1em]h6.west);
@@ -115,11 +115,11 @@
 {
 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em,opacity=0.6] (h1) at ([xshift=7em]h0.east) {\small{an apple}};
 \node [anchor=north west,inner sep=1.0pt,fill=black,opacity=0.6] (hl1) at (h1.north west) {\scriptsize{{\color{white} \textbf{1-2}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black,opacity=0.6] (pt1) at (h1.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.5}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black,opacity=0.6] (pt1) at (h1.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.5}}}};
 }
 {
 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em,opacity=0.6] (h7) at ([xshift=2.2em]h5.east) {\small{is not}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black,opacity=0.6] (pt7) at (h7.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.2}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black,opacity=0.6] (pt7) at (h7.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.2}}}};
 \node [anchor=north west,inner sep=1.0pt,fill=black,opacity=0.6] (hl5) at (h7.north west) {\scriptsize{{\color{white} \textbf{2}}}};
 }
 }
@@ -131,7 +131,7 @@
 \node[anchor=north] (l1) at ([xshift=7.5em,yshift=-1em]h0.south) {\scriptsize{重组假设}};
 \node[anchor=north] (l2) at ([xshift=7.5em,yshift=-1em]h4.south) {\scriptsize{重组假设}};
-\node[anchor=north] (part2) at ([xshift=0em,yshift=-14em]h0.south){\scriptsize{（b）译文不同时的假设重组}};
+\node[anchor=north] (part2) at ([xshift=0em,yshift=-14em]h0.south){\small{（b）译文不同时的假设重组}};
 \end{scope}

--- a/Chapter7/Figures/figure-example-of-stack-decode.tex
+++ b/Chapter7/Figures/figure-example-of-stack-decode.tex
@@ -6,7 +6,7 @@
 {
 \node [anchor=north,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h0) at (0,0) {\scriptsize{null}};
 \node [anchor=north west,inner sep=1.5pt,fill=black] (hl0) at (h0.north west) {\scriptsize{{\color{white} \textbf{0}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt0) at (h0.east) {\scriptsize{{\color{white} \textbf{$\funp{P}$=1}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt0) at (h0.east) {\tiny{{\color{white} \textbf{$\funp{P}$=1}}}};
 }
 {
 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h13) at ([xshift=2.1em,yshift=6em]h0.east) {\scriptsize{there is}};
@@ -17,8 +17,8 @@
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl3) at (h13.north west) {\scriptsize{{\color{white} \textbf{3}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt1) at (h1.east) {\scriptsize{{\color{white} \textbf{$\funp{P}$=.2}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt1) at (h1.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.2}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt3) at (h13.east) {\scriptsize{{\color{white} \textbf{$\funp{P}$=.5}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt3) at (h13.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.5}}}};
 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h2) at ([xshift=2.1em]h1.east) {\scriptsize{have}};
 \node [anchor=west,inner sep=2pt,minimum height=2em,minimum width=3em] (h22) at ([xshift=2.1em]h12.east) {\small{\textbf{...}}};
@@ -32,10 +32,10 @@
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl3) at (h3.north west) {\scriptsize{{\color{white} \textbf{2}}}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl33) at (h33.north west) {\scriptsize{{\color{white} \textbf{4-5}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt2) at (h2.east) {\scriptsize{{\color{white} \textbf{$\funp{P}$=.5}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt2) at (h2.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.5}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt23) at (h23.east) {\scriptsize{{\color{white} \textbf{$\funp{P}$=.5}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt23) at (h23.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.5}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt3) at (h3.east) {\scriptsize{{\color{white} \textbf{$\funp{P}$=.5}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt3) at (h3.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.5}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt33) at (h33.east) {\scriptsize{{\color{white} \textbf{$\funp{P}$=.5}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt33) at (h33.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.5}}}};
 }
 \node [anchor=north] (l0) at ([xshift=0.2em,yshift=-0.7em]h0.south) {\small{\textbf{未译词}}};
 \node [anchor=north] (l1) at ([xshift=0.3em,yshift=-0.7em]h1.south) {\small{\textbf{已译}1\textbf{词}}};

--- a/Chapter7/Figures/figure-judge-type-of-reorder-method.tex
+++ b/Chapter7/Figures/figure-judge-type-of-reorder-method.tex
@@ -105,8 +105,8 @@
 \node[anchor=north] (m1) at ([xshift=0.6em,yshift=0.1em]b05.east) {M};
 }
-\node[anchor=north] (l1) at ([xshift=1.8em,yshift=-0.5em]a10.south) {\scriptsize{基于词}};
+\node[anchor=north] (l1) at ([xshift=1.8em,yshift=-0.5em]a10.south) {\small{基于词}};
-\node[anchor=north] (l2) at ([xshift=2.2em,yshift=-0.5em]b10.south) {\scriptsize{基于短语}};
+\node[anchor=north] (l2) at ([xshift=2.2em,yshift=-0.5em]b10.south) {\small{基于短语}};
 \end{scope}

--- a/Chapter7/Figures/figure-search-space-representation-of-feature-weight.tex
+++ b/Chapter7/Figures/figure-search-space-representation-of-feature-weight.tex
@@ -27,7 +27,7 @@
 \node[anchor=north] (label3) at ([xshift=0em,yshift=-2.5em]label2.north) {取值};	
 }
-\node[anchor=north] (l1) at ([xshift=0em,yshift=-2.5em]x3.south) {\footnotesize{(a)搜索空间}};
+\node[anchor=north] (l1) at ([xshift=0em,yshift=-2.5em]x3.south) {\small{(a)搜索空间}};
 \end{scope}
 \begin{scope}[scale=0.55,xshift=3.2in] 
@@ -68,7 +68,7 @@
 \node[anchor=north] (e4) at ([xshift=0,yshift=-0.2em]e3.south) {$w_M = 1.00$};
 }
-\node[anchor=north] (l1) at ([xshift=0em,yshift=-2.5em]x3.south) {\footnotesize{(b)一条搜索路径}};
+\node[anchor=north] (l1) at ([xshift=0em,yshift=-2.5em]x3.south) {\small{(b)一条搜索路径}};
 \end{scope}
 \begin{scope}[scale=0.55,xshift=6.8in] 
@@ -119,6 +119,6 @@
 \node[anchor=north] (label2) at ([xshift=0em,yshift=-2.5em]label1.north) {种组合};
 }
-\node[anchor=north] (l1) at ([xshift=0em,yshift=-2.5em]x3.south) {\footnotesize{(c)多条搜索路径}};
+\node[anchor=north] (l1) at ([xshift=0em,yshift=-2.5em]x3.south) {\small{(c)多条搜索路径}};
 \end{scope}
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter7/Figures/figure-translation-hypothesis-extension.tex
+++ b/Chapter7/Figures/figure-translation-hypothesis-extension.tex
@@ -5,24 +5,24 @@
 \begin{scope}
 {
 \node [anchor=north,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3.5em] (h0) at (0,0) {\small{null}};
-\node [anchor=north west,inner sep=1.5pt,fill=black] (hl0) at (h0.north west) {\scriptsize{{\color{white} \textbf{0}}}};
+\node [anchor=north west,inner sep=1.5pt,fill=black] (hl0) at (h0.north west) {\tiny{{\color{white} \textbf{0}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt0) at (h0.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=1}}}};
+\node [anchor=north,inner sep=1pt,minimum width=3.5em,fill=black] (pt0) at (h0.south) {\footnotesize{{\color{white} \textbf{$\funp{P}$=1}}}};
 }
 {
 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3.5em] (h1) at ([xshift=3em]h0.east) {\small{on}};
 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3.5em] (h2) at ([xshift=3em,yshift=3em]h0.east) {\small{table}};
 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3.5em] (h3) at ([xshift=3em,yshift=-3em]h0.east) {\small{there is}};
-\node [anchor=north west,inner sep=1.5pt,fill=black] (hl1) at (h1.north west) {\scriptsize{{\color{white} \textbf{2}}}};
+\node [anchor=north west,inner sep=1.5pt,fill=black] (hl1) at (h1.north west) {\tiny{{\color{white} \textbf{2}}}};
-\node [anchor=north west,inner sep=1.5pt,fill=black] (hl2) at (h2.north west) {\scriptsize{{\color{white} \textbf{1}}}};
+\node [anchor=north west,inner sep=1.5pt,fill=black] (hl2) at (h2.north west) {\tiny{{\color{white} \textbf{1}}}};
-\node [anchor=north west,inner sep=1.5pt,fill=black] (hl3) at (h3.north west) {\scriptsize{{\color{white} \textbf{3}}}};
+\node [anchor=north west,inner sep=1.5pt,fill=black] (hl3) at (h3.north west) {\tiny{{\color{white} \textbf{3}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt1) at (h1.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.2}}}};
+\node [anchor=north,inner sep=1pt,minimum width=3.5em,fill=black] (pt1) at (h1.south) {\footnotesize{{\color{white} \textbf{$\funp{P}$=0.2}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt2) at (h2.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.3}}}};
+\node [anchor=north,inner sep=1pt,minimum width=3.5em,fill=black] (pt2) at (h2.south) {\footnotesize{{\color{white} \textbf{$\funp{P}$=0.3}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt3) at (h3.east) {\footnotesize{{\color{white} \textbf{P=.5}}}};
+\node [anchor=north,inner sep=1pt,minimum width=3.5em,fill=black] (pt3) at (h3.south) {\footnotesize{{\color{white} \textbf{P=0.5}}}};
-\draw [->,very thick,ublue] ([xshift=0.1em]pt0.south) -- ([xshift=-0.1em]h1.west);
+\draw [->,very thick,ublue] ([xshift=0.1em]h0.east) -- ([xshift=-0.1em]h1.west);
-\draw [->,very thick,ublue] ([xshift=0.1em]pt0.south) -- ([xshift=-0.1em]h2.west);
+\draw [->,very thick,ublue] ([xshift=0.1em]h0.east) -- ([xshift=-0.1em]h2.west);
-\draw [->,very thick,ublue] ([xshift=0.1em]pt0.south) -- ([xshift=-0.1em]h3.west);
+\draw [->,very thick,ublue] ([xshift=0.1em]h0.east) -- ([xshift=-0.1em]h3.west);
 }
 {
@@ -32,40 +32,40 @@
 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=4em] (h7) at ([xshift=3em,yshift=1.2em]h5.east) {\small{on the table}};
 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=4.6em] (h8) at ([xshift=3em,yshift=-2em]h5.east) {\small{\ \;apple}};
-\node [anchor=north west,inner sep=1.5pt,fill=black] (hl4) at (h4.north west) {\scriptsize{{\color{white} \textbf{4}}}};
+\node [anchor=north west,inner sep=1.5pt,fill=black] (hl4) at (h4.north west) {\tiny{{\color{white} \textbf{4}}}};
-\node [anchor=north west,inner sep=1.5pt,fill=black] (hl5) at (h5.north west) {\scriptsize{{\color{white} \textbf{4-5}}}};
+\node [anchor=north west,inner sep=1.5pt,fill=black] (hl5) at (h5.north west) {\tiny{{\color{white} \textbf{4-5}}}};
-\node [anchor=north west,inner sep=1.5pt,fill=black] (hl6) at (h6.north west) {\scriptsize{{\color{white} \textbf{1}}}};
+\node [anchor=north west,inner sep=1.5pt,fill=black] (hl6) at (h6.north west) {\tiny{{\color{white} \textbf{1}}}};
-\node [anchor=north west,inner sep=1.5pt,fill=black] (hl7) at (h7.north west) {\scriptsize{{\color{white} \textbf{1-2}}}};
+\node [anchor=north west,inner sep=1.5pt,fill=black] (hl7) at (h7.north west) {\tiny{{\color{white} \textbf{1-2}}}};
-\node [anchor=north west,inner sep=1.5pt,fill=black] (hl8) at (h8.north west) {\scriptsize{{\color{white} \textbf{5}}}};
+\node [anchor=north west,inner sep=1.5pt,fill=black] (hl8) at (h8.north west) {\tiny{{\color{white} \textbf{5}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt4) at (h4.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.1}}}};
+\node [anchor=north,inner sep=1pt,minimum width=3.5em,fill=black] (pt4) at (h4.south) {\footnotesize{{\color{white} \textbf{$\funp{P}$=0.1}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt5) at (h5.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.4}}}};
+\node [anchor=north,inner sep=1pt,minimum width=3.5em,fill=black] (pt5) at (h5.south) {\footnotesize{{\color{white} \textbf{$\funp{P}$=0.4}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt6) at (h6.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.3}}}};
+\node [anchor=north,inner sep=1pt,minimum width=3.5em,fill=black] (pt6) at (h6.south) {\footnotesize{{\color{white} \textbf{$\funp{P}$=0.3}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt7) at (h7.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.4}}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.6em,fill=black] (pt7) at (h7.south) {\footnotesize{{\color{white} \textbf{$\funp{P}$=0.4}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt8) at (h8.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.2}}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.6em,fill=black] (pt8) at (h8.south) {\footnotesize{{\color{white} \textbf{$\funp{P}$=0.2}}}};
-\draw [->,very thick,ublue] ([xshift=0.1em]pt1.south) -- ([xshift=1em,yshift=0.7em]pt1.south);
+\draw [->,very thick,ublue] ([xshift=0.1em]h6.east) -- ([xshift=1em,yshift=0.7em]h6.east);
+\draw [->,very thick,ublue] ([xshift=0.1em]h6.east) -- ([xshift=1em,yshift=-0.7em]h6.east);
-\draw [->,very thick,ublue] ([xshift=0.1em]pt2.south) -- ([xshift=1em,yshift=-0.7em]pt2.south);
+\draw [->,very thick,ublue] ([xshift=0.1em]h2.east) -- ([xshift=1em,yshift=0.7em]h2.east);
-\draw [->,very thick,ublue] ([xshift=0.1em]pt2.south) -- ([xshift=1em,yshift=0.7em]pt2.south);
+\draw [->,very thick,ublue] ([xshift=0.1em]h2.east) -- ([xshift=1em,yshift=-0.7em]h2.east);
-\draw [->,very thick,ublue] ([xshift=0.1em]pt6.south) -- ([xshift=1em,yshift=-0.7em]pt6.south);
+\draw [->,very thick,ublue] ([xshift=0.1em]h1.east) -- ([xshift=1em,yshift=0.7em]h1.east);
-\draw [->,very thick,ublue] ([xshift=0.1em]pt6.south) -- ([xshift=1em,yshift=0.7em]pt6.south);
-\draw [->,very thick,ublue] ([xshift=0.1em]pt3.south) -- ([xshift=-0.1em]h4.west);
+\draw [->,very thick,ublue] ([xshift=0.1em]h3.east) -- ([xshift=-0.1em]h4.west);
-\draw [->,very thick,ublue] ([xshift=0.1em]pt3.south) -- ([xshift=-0.1em]h5.west);
+\draw [->,very thick,ublue] ([xshift=0.1em]h3.east) -- ([xshift=-0.1em]h5.west);
-\draw [->,very thick,ublue] ([xshift=0.1em]pt3.south) -- ([xshift=-0.1em]h6.west);
+\draw [->,very thick,ublue] ([xshift=0.1em]h3.east) -- ([xshift=-0.1em]h6.west);
-\draw [->,very thick,ublue] ([xshift=0.1em]pt5.south) -- ([xshift=-0.1em]h7.west);
+\draw [->,very thick,ublue] ([xshift=0.1em]h5.east) -- ([xshift=-0.1em]h7.west);
-\draw [->,very thick,ublue] ([xshift=0.1em]pt5.south) -- ([xshift=1em,yshift=-0.7em]pt5.south);
+\draw [->,very thick,ublue] ([xshift=0.1em]h5.east) -- ([xshift=1em,yshift=-0.7em]h5.east);
-\draw [->,very thick,ublue] ([xshift=0.1em]pt4.south) -- ([xshift=-0.1em]h8.west);
+\draw [->,very thick,ublue] ([xshift=0.1em]h4.east) -- ([xshift=-0.1em]h8.west);
-\draw [->,very thick,ublue] ([xshift=0.1em]pt4.south) -- ([xshift=1em,yshift=-0.7em]pt4.south);
+\draw [->,very thick,ublue] ([xshift=0.1em]h4.east) -- ([xshift=1em,yshift=-0.7em]h4.east);
 }
 {
-\draw [->,ultra thick,red,line width=2pt,opacity=0.7] ([xshift=-0.5em,yshift=-0.5em]h0.west) -- ([xshift=0.7em,yshift=-0.5em]h0.east) -- ([xshift=-0.2em,yshift=-0.5em]h3.west) -- ([xshift=0.8em,yshift=-0.5em]h3.east) -- ([xshift=-0.2em,yshift=-0.5em]h5.west) -- ([xshift=0.8em,yshift=-0.5em]h5.east) -- ([xshift=-0.2em,yshift=-0.5em]h7.west) -- ([xshift=1.5em,yshift=-0.5em]h7.east);
+\draw [->,ultra thick,red,line width=2pt,opacity=0.7] ([xshift=-0.5em,yshift=-0.6em]h0.west) -- ([xshift=0.0em,yshift=-0.6em]h0.east) -- ([xshift=-0.2em,yshift=-0.6em]h3.west) -- ([xshift=0.0em,yshift=-0.6em]h3.east) -- ([xshift=-0.2em,yshift=-0.6em]h5.west) -- ([xshift=0.0em,yshift=-0.6em]h5.east) -- ([xshift=-0.2em,yshift=-0.6em]h7.west) -- ([xshift=1.5em,yshift=-0.6em]h7.east);
-\node [anchor=north west] (wtranslabel) at ([yshift=-3em]h0.south west) {\small{翻译路径：}};
+\node [anchor=north west] (wtranslabel) at ([yshift=-4.3em]h0.south west) {\small{翻译路径：}};
 \draw [->,ultra thick,red,line width=1.5pt,opacity=0.7] (wtranslabel.east) -- ([xshift=1.5em]wtranslabel.east);
 }
 \end{scope}

--- a/Chapter7/Figures/figure-word-and-phrase-translation-regard-as-path.tex
+++ b/Chapter7/Figures/figure-word-and-phrase-translation-regard-as-path.tex
@@ -19,59 +19,59 @@
 \draw [->,very thick,ublue] (s5.south) -- ([yshift=-0.7em]s5.south);
 {\small
-\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (t11) at ([yshift=-1em]s1.south) {I};
+\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.6em,minimum width=2.5em] (t11) at ([yshift=-1em]s1.south) {I};
-\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (t12) at ([yshift=-0.2em]t11.south) {me};
+\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.6em,minimum width=2.5em] (t12) at ([yshift=-0.8em]t11.south) {me};
-\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (t13) at ([yshift=-0.2em]t12.south) {I'm};
+\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.6em,minimum width=2.5em] (t13) at ([yshift=-0.8em]t12.south) {I'm};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl11) at (t11.north west) {\tiny{{\color{white} \textbf{1}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl12) at (t12.north west) {\tiny{{\color{white} \textbf{1}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl13) at (t13.north west) {\tiny{{\color{white} \textbf{1}}}};
 {
-\node [anchor=north west,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=6.55em] (t14) at ([yshift=-0.2em]t13.south west) {I'm};
+\node [anchor=north west,inner sep=2pt,fill=red!20,minimum height=1.6em,minimum width=6.55em] (t14) at ([yshift=-0.8em]t13.south west) {I'm \quad \quad};
-\node [anchor=north west,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=6.55em] (t15) at ([yshift=-0.2em]t14.south west) {I};
+\node [anchor=north west,inner sep=2pt,fill=red!20,minimum height=1.6em,minimum width=6.55em] (t15) at ([yshift=-0.8em]t14.south west) {I \quad \quad};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl14) at (t14.north west) {\tiny{{\color{white} \textbf{1-2}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl15) at (t15.north west) {\tiny{{\color{white} \textbf{1-2}}}};
 }
-\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=2.5em] (t21) at ([yshift=-1em]s2.south) {to};
+\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.6em,minimum width=2.5em] (t21) at ([yshift=-1em]s2.south) {to};
-\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=2.5em] (t22) at ([yshift=-0.2em]t21.south) {with};
+\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.6em,minimum width=2.5em] (t22) at ([yshift=-0.8em]t21.south) {with};
-\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=2.5em] (t23) at ([yshift=-0.2em]t22.south) {for};
+\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.6em,minimum width=2.5em] (t23) at ([yshift=-0.8em]t22.south) {for};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl21) at (t21.north west) {\tiny{{\color{white} \textbf{2}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl22) at (t22.north west) {\tiny{{\color{white} \textbf{2}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl23) at (t23.north west) {\tiny{{\color{white} \textbf{2}}}};
 {
-\node [anchor=north west,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=6.55em] (t24) at ([yshift=-0.2em,xshift=-2.6em]t15.south east) {for you};
+\node [anchor=north west,inner sep=2pt,fill=green!20,minimum height=1.6em,minimum width=6.55em] (t24) at ([yshift=-0.8em,xshift=-2.6em]t15.south east) {for you};
-\node [anchor=north west,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=6.55em] (t25) at ([yshift=-0.2em]t24.south west) {with you};
+\node [anchor=north west,inner sep=2pt,fill=green!20,minimum height=1.6em,minimum width=6.55em] (t25) at ([yshift=-0.8em]t24.south west) {with you};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl24) at (t24.north west) {\tiny{{\color{white} \textbf{2-3}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl25) at (t25.north west) {\tiny{{\color{white} \textbf{2-3}}}};
 }
-\node [anchor=north,inner sep=2pt,fill=blue!20,minimum height=1.5em,minimum width=2.5em] (t31) at ([yshift=-1em]s3.south) {you};
+\node [anchor=north,inner sep=2pt,fill=blue!20,minimum height=1.6em,minimum width=2.5em] (t31) at ([yshift=-1em]s3.south) {you};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl31) at (t31.north west) {\tiny{{\color{white} \textbf{3}}}};
 {
-\node [anchor=west,inner sep=2pt,fill=blue!20,minimum height=1.5em,minimum width=13.35em] (t32) at ([xshift=1.4em]t14.east) {you are satisfied};
+\node [anchor=west,inner sep=2pt,fill=blue!20,minimum height=1.6em,minimum width=13.35em] (t32) at ([xshift=1.4em]t14.east) {\quad \; you are satisfied};
-\node [anchor=north west,inner sep=2pt,fill=blue!20,minimum height=1.5em,minimum width=7.45em] (t33) at ([yshift=-0.2em]t32.south west) {$\phi$};
+\node [anchor=north west,inner sep=2pt,fill=blue!20,minimum height=1.6em,minimum width=7.45em] (t33) at ([yshift=-0.8em]t32.south west) {$\phi$ \quad \; \ };
 \node [anchor=north west,inner sep=1pt,fill=black] (tl32) at (t32.north west) {\tiny{{\color{white} \textbf{3-5}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl33) at (t33.north west) {\tiny{{\color{white} \textbf{3-4}}}};
 }
-\node [anchor=north,inner sep=2pt,fill=orange!20,minimum height=1.5em,minimum width=3em] (t41) at ([yshift=-1em]s4.south) {$\phi$};
+\node [anchor=north,inner sep=2pt,fill=orange!20,minimum height=1.6em,minimum width=3em] (t41) at ([yshift=-1em]s4.south) {$\phi$};
-\node [anchor=north,inner sep=2pt,fill=orange!20,minimum height=1.5em,minimum width=3em] (t42) at ([yshift=-0.2em]t41.south) {show};
+\node [anchor=north,inner sep=2pt,fill=orange!20,minimum height=1.6em,minimum width=3em] (t42) at ([yshift=-0.8em]t41.south) {show};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl41) at (t41.north west) {\tiny{{\color{white} \textbf{4}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl42) at (t42.north west) {\tiny{{\color{white} \textbf{4}}}};
 {
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=9.00em] (t43) at ([xshift=1.75em]t24.east) {satisfied};
+\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=1.6em,minimum width=9.00em] (t43) at ([xshift=1.75em]t24.east) {satisfied};
-\node [anchor=north west,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=9.00em] (t44) at ([yshift=-0.2em]t43.south west) {satisfactory};
+\node [anchor=north west,inner sep=2pt,fill=red!20,minimum height=1.6em,minimum width=9.00em] (t44) at ([yshift=-0.8em]t43.south west) {satisfactory};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl43) at (t43.north west) {\tiny{{\color{white} \textbf{4-5}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl44) at (t44.north west) {\tiny{{\color{white} \textbf{4-5}}}};
 }
-\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=4.5em] (t51) at ([yshift=-1em]s5.south) {satisfy};
+\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.6em,minimum width=4.5em] (t51) at ([yshift=-1em]s5.south) {satisfy};
-\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=4.5em] (t52) at ([yshift=-0.2em]t51.south) {satisfied};
+\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.6em,minimum width=4.5em] (t52) at ([yshift=-0.8em]t51.south) {satisfied};
-\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=4.5em] (t53) at ([yshift=-0.2em]t52.south) {satisfies};
+\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.6em,minimum width=4.5em] (t53) at ([yshift=-0.8em]t52.south) {satisfies};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl51) at (t51.north west) {\tiny{{\color{white} \textbf{5}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl52) at (t52.north west) {\tiny{{\color{white} \textbf{5}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl53) at (t53.north west) {\tiny{{\color{white} \textbf{5}}}};
@@ -80,38 +80,38 @@
 {\tiny
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt11) at (t11.east) {{\color{white} \textbf{$\funp{P}$=.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt11) at (t11.south) {{\color{white} \textbf{$\funp{P}$=0.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt12) at (t12.east) {{\color{white} \textbf{$\funp{P}$=.2}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt12) at (t12.south) {{\color{white} \textbf{$\funp{P}$=0.2}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt13) at (t13.east) {{\color{white} \textbf{$\funp{P}$=.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt13) at (t13.south) {{\color{white} \textbf{$\funp{P}$=0.4}}};
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt14) at (t14.east) {{\color{white} \textbf{$\funp{P}$=.1}}};
+\node [anchor=north,inner sep=1pt,minimum width=10.9em,fill=black] (pt14) at (t14.south) {{\color{white} \textbf{$\funp{P}$=0.1}\quad \quad}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt15) at (t15.east) {{\color{white} \textbf{$\funp{P}$=.2}}};
+\node [anchor=north,inner sep=1pt,minimum width=10.9em,fill=black] (pt15) at (t15.south) {{\color{white} \textbf{$\funp{P}$=0.2}\quad \quad}};
 }
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt21) at (t21.east) {{\color{white} \textbf{$\funp{P}$=.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt21) at (t21.south) {{\color{white} \textbf{$\funp{P}$=0.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt22) at (t22.east) {{\color{white} \textbf{$\funp{P}$=.3}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt22) at (t22.south) {{\color{white} \textbf{$\funp{P}$=0.3}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt23) at (t23.east) {{\color{white} \textbf{$\funp{P}$=.3}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt23) at (t23.south) {{\color{white} \textbf{$\funp{P}$=0.3}}};
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt24) at (t24.east) {{\color{white} \textbf{$\funp{P}$=.2}}};
+\node [anchor=north,inner sep=1pt,minimum width=10.9em,fill=black] (pt24) at (t24.south) {{\color{white} \textbf{$\funp{P}$=0.2}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt25) at (t25.east) {{\color{white} \textbf{$\funp{P}$=.1}}};
+\node [anchor=north,inner sep=1pt,minimum width=10.9em,fill=black] (pt25) at (t25.south) {{\color{white} \textbf{$\funp{P}$=0.1}}};
 }
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt31) at (t31.east) {{\color{white} \textbf{$\funp{P}$=1}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt31) at (t31.south) {{\color{white} \textbf{$\funp{P}$=1}}};
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt33) at (t32.east) {{\color{white} \textbf{$\funp{P}$=.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=22.2em,fill=black] (pt33) at (t32.south) {{\color{white} \textbf{\quad $\funp{P}$=0.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt33) at (t33.east) {{\color{white} \textbf{$\funp{P}$=.3}}};
+\node [anchor=north,inner sep=1pt,minimum width=12.4em,fill=black] (pt33) at (t33.south) {{\color{white} \textbf{$\funp{P}$=0.3}\quad \quad \quad \;}};
 }
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt41) at (t41.east) {{\color{white} \textbf{$\funp{P}$=.5}}};
+\node [anchor=north,inner sep=1pt,minimum width=5em,fill=black] (pt41) at (t41.south) {{\color{white} \textbf{$\funp{P}$=0.5}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt42) at (t42.east) {{\color{white} \textbf{$\funp{P}$=.5}}};
+\node [anchor=north,inner sep=1pt,minimum width=5em,fill=black] (pt42) at (t42.south) {{\color{white} \textbf{$\funp{P}$=0.5}}};
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt43) at (t43.east) {{\color{white} \textbf{$\funp{P}$=.3}}};
+\node [anchor=north,inner sep=1pt,minimum width=15em,fill=black] (pt43) at (t43.south) {{\color{white} \textbf{$\funp{P}$=0.3}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt44) at (t44.east) {{\color{white} \textbf{$\funp{P}$=.2}}};
+\node [anchor=north,inner sep=1pt,minimum width=15em,fill=black] (pt44) at (t44.south) {{\color{white} \textbf{$\funp{P}$=0.2}}};
 }
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt51) at (t51.east) {{\color{white} \textbf{$\funp{P}$=.5}}};
+\node [anchor=north,inner sep=1pt,minimum width=7.5em,fill=black] (pt51) at (t51.south) {{\color{white} \textbf{$\funp{P}$=0.5}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt52) at (t52.east) {{\color{white} \textbf{$\funp{P}$=.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=7.5em,fill=black] (pt52) at (t52.south) {{\color{white} \textbf{$\funp{P}$=0.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt53) at (t53.east) {{\color{white} \textbf{$\funp{P}$=.1}}};
+\node [anchor=north,inner sep=1pt,minimum width=7.5em,fill=black] (pt53) at (t53.south) {{\color{white} \textbf{$\funp{P}$=0.1}}};
 }
@@ -129,13 +129,13 @@
 \draw[decorate,thick,decoration={brace,amplitude=5pt,mirror}] ([yshift=0em,xshift=-0.5em]t11.north west) -- ([xshift=-0.5em]t13.south west) node [pos=0.5,left,xshift=1.0em,yshift=0.0em,text width=5em,align=left] (label2) {\footnotesize{\textbf{单词翻译}}};
 {
-\draw [->,ultra thick,red,line width=2pt,opacity=0.7] ([xshift=-0.5em,yshift=-0.5em]t13.west) -- ([xshift=0.8em,yshift=-0.5em]t13.east) -- ([xshift=-0.2em,yshift=-0.5em]t22.west) -- ([xshift=0.8em,yshift=-0.5em]t22.east) -- ([xshift=-0.2em,yshift=-0.5em]t31.west) -- ([xshift=0.8em,yshift=-0.5em]t31.east) -- ([xshift=-0.2em,yshift=-0.5em]t41.west) -- ([xshift=0.8em,yshift=-0.5em]t41.east) -- ([xshift=-0.2em,yshift=-0.5em]t52.west) -- ([xshift=1.2em,yshift=-0.5em]t52.east);
+\draw [->,ultra thick,red,line width=2pt,opacity=0.7] ([xshift=-0.5em,yshift=-0.62em]t13.west) -- ([xshift=0.8em,yshift=-0.62em]t13.east) -- ([xshift=-0.2em,yshift=-0.62em]t22.west) -- ([xshift=0.8em,yshift=-0.62em]t22.east) -- ([xshift=-0.2em,yshift=-0.62em]t31.west) -- ([xshift=0.8em,yshift=-0.62em]t31.east) -- ([xshift=-0.2em,yshift=-0.62em]t41.west) -- ([xshift=0.8em,yshift=-0.62em]t41.east) -- ([xshift=-0.2em,yshift=-0.62em]t52.west) -- ([xshift=1.2em,yshift=-0.62em]t52.east);
 }
 {
 \draw [->,ultra thick,ublue,line width=2pt,opacity=0.7] ([xshift=-0.5em,yshift=-0.5em]t15.west) -- ([xshift=0.8em,yshift=-0.5em]t15.east) -- ([xshift=-0.2em,yshift=-0.5em]t32.west) -- ([xshift=1.2em,yshift=-0.5em]t32.east);
-\draw [->,ultra thick,ublue,line width=2pt,opacity=0.7] ([xshift=-0.5em,yshift=-0.4em]t13.west) -- ([xshift=0.8em,yshift=-0.4em]t13.east) -- ([xshift=-0.2em,yshift=-0.4em]t25.west) -- ([xshift=0.8em,yshift=-0.4em]t25.east) -- ([xshift=-0.2em,yshift=-0.4em]t41.west) -- ([xshift=0.8em,yshift=-0.4em]t41.east) -- ([xshift=-0.2em,yshift=-0.4em]t52.west) -- ([xshift=1.2em,yshift=-0.4em]t52.east);
+\draw [->,ultra thick,ublue,line width=2pt,opacity=0.7] ([xshift=-0.5em,yshift=-0.42em]t13.west) -- ([xshift=0.8em,yshift=-0.42em]t13.east) -- ([xshift=-0.2em,yshift=-0.5em]t25.west) -- ([xshift=0.8em,yshift=-0.5em]t25.east) -- ([xshift=-0.2em,yshift=-0.42em]t41.west) -- ([xshift=0.8em,yshift=-0.42em]t41.east) -- ([xshift=-0.2em,yshift=-0.42em]t52.west) -- ([xshift=1.2em,yshift=-0.42em]t52.east);
 }
 {
@@ -143,12 +143,12 @@
 }
 {
-\node [anchor=north west] (wtranslabel) at ([yshift=-4em]t15.south west) {\scriptsize{翻译路径（仅包含单词）}};
+\node [anchor=north west] (wtranslabel) at ([yshift=-5em]t15.south west) {\scriptsize{翻译路径（仅包含单词）}};
 \draw [->,ultra thick,red,line width=1.5pt,opacity=0.7] ([xshift=0.2em]wtranslabel.east) -- ([xshift=1.2em]wtranslabel.east);
 }
 {
-\node [anchor=north west] (ptranslabel) at ([yshift=-5.5em]t15.south west) {\scriptsize{翻译路径（含有短语）}};
+\node [anchor=north west] (ptranslabel) at ([yshift=-6.5em]t15.south west) {\scriptsize{翻译路径（含有短语）}};
 \draw [->,ultra thick,ublue,line width=1.5pt,opacity=0.7] ([xshift=0.95em]ptranslabel.east) -- ([xshift=1.95em]ptranslabel.east);
 }

--- a/Chapter7/Figures/figure-word-translation-regard-as-path.tex
+++ b/Chapter7/Figures/figure-word-translation-regard-as-path.tex
@@ -20,15 +20,15 @@
 {\small
 \node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (t11) at ([yshift=-1em]s1.south) {I};
-\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (t12) at ([yshift=-0.2em]t11.south) {me};
+\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (t12) at ([yshift=-0.8em]t11.south) {me};
-\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (t13) at ([yshift=-0.2em]t12.south) {I'm};
+\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (t13) at ([yshift=-0.8em]t12.south) {I'm};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl11) at (t11.north west) {\tiny{{\color{white} \textbf{1}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl12) at (t12.north west) {\tiny{{\color{white} \textbf{1}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl13) at (t13.north west) {\tiny{{\color{white} \textbf{1}}}};
 \node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=2.5em] (t21) at ([yshift=-1em]s2.south) {to};
-\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=2.5em] (t22) at ([yshift=-0.2em]t21.south) {with};
+\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=2.5em] (t22) at ([yshift=-0.8em]t21.south) {with};
-\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=2.5em] (t23) at ([yshift=-0.2em]t22.south) {for};
+\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=2.5em] (t23) at ([yshift=-0.8em]t22.south) {for};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl21) at (t21.north west) {\tiny{{\color{white} \textbf{2}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl22) at (t22.north west) {\tiny{{\color{white} \textbf{2}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl23) at (t23.north west) {\tiny{{\color{white} \textbf{2}}}};
@@ -37,13 +37,13 @@
 \node [anchor=north west,inner sep=1pt,fill=black] (tl31) at (t31.north west) {\tiny{{\color{white} \textbf{3}}}};
 \node [anchor=north,inner sep=2pt,fill=orange!20,minimum height=1.5em,minimum width=3em] (t41) at ([yshift=-1em]s4.south) {$\phi$};
-\node [anchor=north,inner sep=2pt,fill=orange!20,minimum height=1.5em,minimum width=3em] (t42) at ([yshift=-0.2em]t41.south) {show};
+\node [anchor=north,inner sep=2pt,fill=orange!20,minimum height=1.5em,minimum width=3em] (t42) at ([yshift=-0.8em]t41.south) {show};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl41) at (t41.north west) {\tiny{{\color{white} \textbf{4}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl42) at (t42.north west) {\tiny{{\color{white} \textbf{4}}}};
 \node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=4.5em] (t51) at ([yshift=-1em]s5.south) {satisfy};
-\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=4.5em] (t52) at ([yshift=-0.2em]t51.south) {satisfied};
+\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=4.5em] (t52) at ([yshift=-0.8em]t51.south) {satisfied};
-\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=4.5em] (t53) at ([yshift=-0.2em]t52.south) {satisfies};
+\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=4.5em] (t53) at ([yshift=-0.8em]t52.south) {satisfies};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl51) at (t51.north west) {\tiny{{\color{white} \textbf{5}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl52) at (t52.north west) {\tiny{{\color{white} \textbf{5}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl53) at (t53.north west) {\tiny{{\color{white} \textbf{5}}}};
@@ -52,22 +52,22 @@
 {\tiny
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt11) at (t11.east) {{\color{white} \textbf{$\funp{P}$=.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt11) at (t11.south) {{\color{white} \textbf{$\funp{P}$=0.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt12) at (t12.east) {{\color{white} \textbf{$\funp{P}$=.2}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt12) at (t12.south) {{\color{white} \textbf{$\funp{P}$=0.2}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt13) at (t13.east) {{\color{white} \textbf{$\funp{P}$=.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt13) at (t13.south) {{\color{white} \textbf{$\funp{P}$=0.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt21) at (t21.east) {{\color{white} \textbf{$\funp{P}$=.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt21) at (t21.south) {{\color{white} \textbf{$\funp{P}$=0.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt22) at (t22.east) {{\color{white} \textbf{$\funp{P}$=.3}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt22) at (t22.south) {{\color{white} \textbf{$\funp{P}$=0.3}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt23) at (t23.east) {{\color{white} \textbf{$\funp{P}$=.3}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt23) at (t23.south) {{\color{white} \textbf{$\funp{P}$=0.3}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt31) at (t31.east) {{\color{white} \textbf{$\funp{P}$=1}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt31) at (t31.south) {{\color{white} \textbf{$\funp{P}$=1}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt41) at (t41.east) {{\color{white} \textbf{$\funp{P}$=.5}}};
+\node [anchor=north,inner sep=1pt,minimum width=5em,fill=black] (pt41) at (t41.south) {{\color{white} \textbf{$\funp{P}$=0.5}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt42) at (t42.east) {{\color{white} \textbf{$\funp{P}$=.5}}};
+\node [anchor=north,inner sep=1pt,minimum width=5em,fill=black] (pt42) at (t42.south) {{\color{white} \textbf{$\funp{P}$=0.5}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt51) at (t51.east) {{\color{white} \textbf{$\funp{P}$=.5}}};
+\node [anchor=north,inner sep=1pt,minimum width=7.5em,fill=black] (pt51) at (t51.south) {{\color{white} \textbf{$\funp{P}$=0.5}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt52) at (t52.east) {{\color{white} \textbf{$\funp{P}$=.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=7.5em,fill=black] (pt52) at (t52.south) {{\color{white} \textbf{$\funp{P}$=0.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt53) at (t53.east) {{\color{white} \textbf{$\funp{P}$=.1}}};
+\node [anchor=north,inner sep=1pt,minimum width=7.5em,fill=black] (pt53) at (t53.south) {{\color{white} \textbf{$\funp{P}$=0.1}}};
 }

--- a/Chapter7/chapter7.tex
+++ b/Chapter7/chapter7.tex
@@ -206,7 +206,7 @@ p_4 &=& \text{问题}\nonumber
 \parinterval 图\ref{fig:7-10}给出了一个由三个双语短语$\{(\bar{s}_{\bar{a}_1},\bar{t}_1),(\bar{s}_{\bar{a}_2},\bar{t}_2),(\bar{s}_{\bar{a}_3},\bar{t}_3)\}$ 构成的汉英互译句对，其中短语对齐信息为$\bar{a}_1 = 1$，$\bar{a}_2 = 2$，$\bar{a}_3 = 3$。这里，可以把这三个短语对的组合看作是翻译推导，形式化表示为如下公式：
 \begin{eqnarray}
-d = {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)} \circ {(\bar{s}_{\bar{a}_3},\bar{t}_3)}
+d & = & {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)} \circ {(\bar{s}_{\bar{a}_3},\bar{t}_3)}
 \label{eq:7-1}
 \end{eqnarray}
@@ -245,7 +245,7 @@ d = {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)} \c
 \parinterval 对于统计机器翻译，其目的是找到输入句子的可能性最大的译文：
 \begin{eqnarray}
-\hat{\seq{t}} = \argmax_{\seq{t}} \funp{P}(\seq{t}|\seq{s})
+\hat{\seq{t}} & = & \argmax_{\seq{t}} \funp{P}(\seq{t}|\seq{s})
 \label{eq:7-2}
 \end{eqnarray}
@@ -261,7 +261,7 @@ d = {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)} \c
 \parinterval 基于短语的翻译模型假设$\seq{s}$到$\seq{t}$的翻译可以用翻译推导进行描述，这些翻译推导都是由双语短语组成。于是，两个句子之间的映射就可以被看作是一个个短语的映射。显然短语翻译的建模要比整个句子翻译的建模简单得多。从模型上看，可以把翻译推导$d$当作是从$\seq{s}$到$\seq{t}$翻译的一种隐含结构。这种结构定义了对问题的一种描述，即翻译由一系列短语组成。根据这个假设，可以把句子的翻译概率定义为：
 \begin{eqnarray}
-\funp{P}(\seq{t}|\seq{s}) = \sum_{d} \funp{P}(d,\seq{t}|\seq{s})
+\funp{P}(\seq{t}|\seq{s}) & = & \sum_{d} \funp{P}(d,\seq{t}|\seq{s})
 \label{eq:7-3}
 \end{eqnarray}
@@ -282,25 +282,25 @@ d = {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)} \c
 \parinterval 另一种常用的方法是直接用$\funp{P}(d,\seq{t}|\seq{s})$的最大值代表整个翻译推导的概率和。这种方法假设翻译概率是非常尖锐的，“最好”的推导会占有概率的主要部分。它被形式化为：
 \begin{eqnarray}
-\funp{P}(\seq{t}|\seq{s}) \approx \max \funp{P}(d,\seq{t}|\seq{s})
+\funp{P}(\seq{t}|\seq{s}) & \approx & \max \funp{P}(d,\seq{t}|\seq{s})
 \label{eq:7-6}
 \end{eqnarray}
 \parinterval 于是，翻译的目标可以被重新定义：
 \begin{eqnarray}
-\hat{\seq{t}} = \arg\max_{\seq{t}} (\max \funp{P}(d,\seq{t}|\seq{s}))
+\hat{\seq{t}} & = & \arg\max_{\seq{t}} (\max \funp{P}(d,\seq{t}|\seq{s}))
 \label{eq:7-7}
 \end{eqnarray}
 \parinterval 值得注意的是，翻译推导中蕴含着译文的信息，因此每个翻译推导都与一个译文对应。因此可以把公式\eqref{eq:7-7}所描述的问题重新定义为：
 \begin{eqnarray}
-\hat{d} = \arg\max_{d} \funp{P}(d,\seq{t}|\seq{s})
+\hat{d} & = & \arg\max_{d} \funp{P}(d,\seq{t}|\seq{s})
 \label{eq:7-8}
 \end{eqnarray}
 \parinterval 也就是，给定一个输入句子$\seq{s}$，找到从它出发的最优翻译推导$\hat{d}$，把这个翻译推导所对应的目标语词串看作最优的译文。假设函数$t(\cdot)$可以返回一个推导的目标语词串，则最优译文也可以被看作是：
 \begin{eqnarray}
-\hat{\seq{t}} = t(\hat{d})
+\hat{\seq{t}} & = & t(\hat{d})
 \label{eq:7-9}
 \end{eqnarray}
@@ -474,7 +474,7 @@ d = {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)} \c
 \parinterval 抽取双语短语之后，需要对每个双语短语的质量进行评价。这样，在使用这些双语短语时，可以更有效地估计整个句子翻译的好坏。在统计机器翻译中，一般用双语短语出现的可能性大小来度量双语短语的好坏。这里，使用相对频次估计对短语的翻译条件概率进行计算，公式如下：
 \begin{eqnarray}
-\funp{P}(\bar{t}|\bar{s}) = \frac{c(\bar{s},\bar{t})}{c(\bar{s})}
+\funp{P}(\bar{t}|\bar{s}) & = & \frac{c(\bar{s},\bar{t})}{c(\bar{s})}
 \label{eq:7-13}
 \end{eqnarray}
@@ -482,7 +482,7 @@ d = {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)} \c
 \parinterval 当遇到低频短语时，短语翻译概率的估计可能会不准确。例如，短语$\bar{s}$和$\bar{t}$在语料中只出现了一次，且在一个句子中共现，那么$\bar{s}$到$\bar{t}$的翻译概率为$\funp{P}(\bar{t}|\bar{s})=1$，这显然是不合理的，因为$\bar{s}$和$\bar{t}$的出现完全可能是偶然事件。既然直接度量双语短语的好坏会面临数据稀疏问题，一个自然的想法就是把短语拆解成单词，利用双语短语中单词翻译的好坏间接度量双语短语的好坏。为了达到这个目的，可以使用{\small\bfnew{词汇化翻译概率}}\index{词汇化翻译概率}（Lexical Translation Probability）\index{Lexical Translation Probability}。前面借助词对齐信息完成了双语短语的抽取，因此，词对齐信息本身就包含了短语内部单词之间的对应关系。因此同样可以借助词对齐来计算词汇翻译概率，公式如下：
 \begin{eqnarray}
-\funp{P}_{\textrm{lex}}(\bar{t}|\bar{s}) = \prod_{j=1}^{|\bar{s}|} \frac{1}{|\{j|a(j,i) = 1\}|} \sum_{\forall(j,i):a(j,i) = 1} \sigma (t_i|s_j)
+\funp{P}_{\textrm{lex}}(\bar{t}|\bar{s}) & = & \prod_{j=1}^{|\bar{s}|} \frac{1}{|\{j|a(j,i) = 1\}|} \sum_{\forall(j,i):a(j,i) = 1} \sigma (t_i|s_j)
 \label{eq:7-14}
 \end{eqnarray}
@@ -541,7 +541,7 @@ d = {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)} \c
 \parinterval 基于距离的调序方法的核心思想就是度量当前翻译结果与顺序翻译之间的差距。对于译文中的第$i$个短语，令$\rm{start}_i$表示它所对应的源语言短语中第一个词所在的位置，$\rm{end}_i$表示它所对应的源语言短语中最后一个词所在的位置。于是，这个短语（相对于前一个短语）的调序距离为：
 \begin{eqnarray}
-dr = {\rm{start}}_i-{\rm{end}}_{i-1}-1
+dr & = & {\rm{start}}_i-{\rm{end}}_{i-1}-1
 \label{eq:7-15}
 \end{eqnarray}
@@ -579,7 +579,7 @@ dr = {\rm{start}}_i-{\rm{end}}_{i-1}-1
 \parinterval 对于每种调序类型，都可以定义一个调序概率，如下：
 \begin{eqnarray}
-\funp{P}(\seq{o}|\seq{s},\seq{t},\seq{a}) = \prod_{i=1}^{K} \funp{P}(o_i| \bar{s}_{a_i}, \bar{t}_i, a_{i-1}, a_i)
+\funp{P}(\seq{o}|\seq{s},\seq{t},\seq{a}) & = & \prod_{i=1}^{K} \funp{P}(o_i| \bar{s}_{a_i}, \bar{t}_i, a_{i-1}, a_i)
 \label{eq:7-16}
 \end{eqnarray}
@@ -654,13 +654,13 @@ dr = {\rm{start}}_i-{\rm{end}}_{i-1}-1
 \parinterval 这里介绍一种更加高效的特征权重调优方法$\ \dash \ ${\small\bfnew{最小错误率训练}}\index{最小错误率训练}（Minimum Error Rate Training\index{Minimum Error Rate Training}，MERT）。最小错误率训练是统计机器翻译发展中代表性工作，也是机器翻译领域原创的重要技术方法之一\upcite{DBLP:conf/acl/Och03}。最小错误率训练假设：翻译结果相对于标准答案的错误是可度量的，进而可以通过降低错误数量的方式来找到最优的特征权重。假设有样本集合$S = \{(s_1,\seq{r}_1),...,(s_N,\seq{r}_N)\}$，$s_i$为样本中第$i$个源语言句子，$\seq{r}_i$为相应的参考译文。注意，$\seq{r}_i$ 可以包含多个参考译文。$S$通常被称为{\small\bfnew{调优集合}}\index{调优集合}（Tuning Set）\index{Tuning Set}。对于$S$中的每个源语句子$s_i$，机器翻译模型会解码出$n$-best推导$\hat{\seq{d}}_{i} = \{\hat{d}_{ij}\}$，其中$\hat{d}_{ij}$表示对于源语言句子$s_i$得到的第$j$个最好的推导。$\{\hat{d}_{ij}\}$可以被定义如下：
 \begin{eqnarray}
-\{\hat{d}_{ij}\} = \arg\max_{\{d_{ij}\}} \sum_{i=1}^{M} \lambda_i \cdot h_i (d,\seq{t},\seq{s})
+\{\hat{d}_{ij}\} & = & \arg\max_{\{d_{ij}\}} \sum_{i=1}^{M} \lambda_i \cdot h_i (d,\seq{t},\seq{s})
 \label{eq:7-17}
 \end{eqnarray}
 \parinterval 对于每个样本都可以得到$n$-best推导集合，整个数据集上的推导集合被记为$\hat{\seq{D}} = \{\hat{\seq{d}}_{1},...,\hat{\seq{d}}_{s}\}$。进一步，令所有样本的参考译文集合为$\seq{R} = \{\seq{r}_1,...,\seq{r}_N\}$。最小错误率训练的目标就是降低$\hat{\seq{D}}$相对于$\seq{R}$的错误。也就是，通过调整不同特征的权重$\lambda = \{ \lambda_i \}$，让错误率最小，形式化描述为：
 \begin{eqnarray}
-\hat{\lambda} = \arg\min_{\lambda} \textrm{Error}(\hat{\seq{D}},\seq{R})
+\hat{\lambda} & = & \arg\min_{\lambda} \textrm{Error}(\hat{\seq{D}},\seq{R})
 \label{eq:7-18}
 \end{eqnarray}
 %公式--------------------------------------------------------------------

--- a/Chapter8/Figures/figure-different-representations-of-syntax-tree.tex
+++ b/Chapter8/Figures/figure-different-representations-of-syntax-tree.tex
@@ -16,6 +16,6 @@
 \node [anchor=north west] (cap1) at (-1.5em,-1in) {{(a) 树状表示}};
 \node [anchor=west] (cap2) at ([xshift=0.5in]cap1.east) {{(b) 序列表示（缩进）}};
-\node [anchor=west] (cap3) at ([xshift=0.5in]cap2.east) {{(c) 序列表示}};
+\node [anchor=west] (cap3) at ([xshift=0.3in]cap2.east) {{(c) 序列表示}};
 }
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter8/Figures/figure-example-of-cky-algorithm-execution.tex
+++ b/Chapter8/Figures/figure-example-of-cky-algorithm-execution.tex
@@ -35,7 +35,7 @@
 \node [anchor=north] (l3) at ([yshift=-1em]cell53.south) {\tiny{$l$=3}};
 \node [anchor=north] (l4) at ([yshift=-1em]cell54.south) {\tiny{$l$=4}};
 \node [anchor=north] (l5) at ([yshift=-1em]cell55.south) {\tiny{$l$=5}};
-\node [anchor=north] (caption1) at ([xshift=0.0em,yshift=0.0em]l5.south) {(a)};
+\node [anchor=north] (caption1) at ([xshift=0.0em,yshift=0.0em]l5.south) {\small{(a)}};
 \node [anchor=center] (y1) at ([xshift=-2.1em,yshift=2em]cell11.center) {\tiny{\blue 0}};
 \node [anchor=center] (y2) at ([xshift=-2.1em,yshift=2em]cell21.center) {\tiny{\blue 1}};
@@ -88,7 +88,7 @@
 \node [anchor=north] (l3) at ([yshift=-1em]cell53.south) {\tiny{$l$=3}};
 \node [anchor=north] (l4) at ([yshift=-1em]cell54.south) {\tiny{$l$=4}};
 \node [anchor=north] (l5) at ([yshift=-1em]cell55.south) {\tiny{$l$=5}};
-\node [anchor=north] (caption2) at ([xshift=0.0em,yshift=0.0em]l5.south) {(b)};
+\node [anchor=north] (caption2) at ([xshift=0.0em,yshift=0.0em]l5.south) {\small{(b)}};
 \node [anchor=center] (y1) at ([xshift=-2.1em,yshift=2em]cell11.center) {\tiny{\blue 0}};
 \node [anchor=center] (y2) at ([xshift=-2.1em,yshift=2em]cell21.center) {\tiny{\blue 1}};
@@ -170,7 +170,7 @@
 \node [anchor=north] (l3) at ([yshift=-1em]cell53.south) {\tiny{$l$=3}};
 \node [anchor=north] (l4) at ([yshift=-1em]cell54.south) {\tiny{$l$=4}};
 \node [anchor=north] (l5) at ([yshift=-1em]cell55.south) {\tiny{$l$=5}};
-\node [anchor=north] (caption3) at ([xshift=0.0em,yshift=0.0em]l5.south) {(c)};
+\node [anchor=north] (caption3) at ([xshift=0.0em,yshift=0.0em]l5.south) {\small{(c)}};
 \node [anchor=center] (y1) at ([xshift=-2.1em,yshift=2em]cell11.center) {\tiny{\blue 0}};
 \node [anchor=center] (y2) at ([xshift=-2.1em,yshift=2em]cell21.center) {\tiny{\blue 1}};
@@ -267,7 +267,7 @@
 \node [anchor=north] (l3) at ([yshift=-1em]cell53.south) {\tiny{$l$=3}};
 \node [anchor=north] (l4) at ([yshift=-1em]cell54.south) {\tiny{$l$=4}};
 \node [anchor=north] (l5) at ([yshift=-1em]cell55.south) {\tiny{$l$=5}};
-\node [anchor=north] (caption4) at ([xshift=0.0em,yshift=0.0em]l5.south) {(d)};
+\node [anchor=north] (caption4) at ([xshift=0.0em,yshift=0.0em]l5.south) {\small{(d)}};
 \node [anchor=center] (y1) at ([xshift=-2.1em,yshift=2em]cell11.center) {\tiny{\blue 0}};
 \node [anchor=center] (y2) at ([xshift=-2.1em,yshift=2em]cell21.center) {\tiny{\blue 1}};

--- a/Chapter8/Figures/figure-execution-of-cube-pruning.tex
+++ b/Chapter8/Figures/figure-execution-of-cube-pruning.tex
@@ -40,7 +40,7 @@
 \draw [->,thick] ([xshift=-1.0em,yshift=1.0em]alig1.north west)--([xshift=-1.0em,yshift=-0.7em]alig4.south west);
 \draw [->,thick] ([xshift=-1.0em,yshift=1.0em]alig1.north west)--([xshift=0.8em,yshift=1.0em]alig13.north east);
-\node[anchor=north] (l) at ([xshift=0em,yshift=-1.5em]alig4.south) {\scriptsize{(a)}};
+\node[anchor=north] (l) at ([xshift=0em,yshift=-1.5em]alig4.south) {\small{(a)}};
 \end{scope}
 %图2
@@ -87,7 +87,7 @@
 \draw [->,thick] ([xshift=-1.0em,yshift=1.0em]alig1.north west)--([xshift=-1.0em,yshift=-0.7em]alig4.south west);
 \draw [->,thick] ([xshift=-1.0em,yshift=1.0em]alig1.north west)--([xshift=0.8em,yshift=1.0em]alig13.north east);
-\node[anchor=north] (l) at ([xshift=0em,yshift=-1.5em]alig4.south) {\scriptsize{(b)}};
+\node[anchor=north] (l) at ([xshift=0em,yshift=-1.5em]alig4.south) {\small{(b)}};
 \end{scope}
 %图3
@@ -137,7 +137,7 @@
 \draw [->,thick] ([xshift=-1.0em,yshift=1.0em]alig1.north west)--([xshift=-1.0em,yshift=-0.7em]alig4.south west);
 \draw [->,thick] ([xshift=-1.0em,yshift=1.0em]alig1.north west)--([xshift=0.8em,yshift=1.0em]alig13.north east);
-\node[anchor=north] (l) at ([xshift=0em,yshift=-1.5em]alig4.south) {\scriptsize{(c)}};
+\node[anchor=north] (l) at ([xshift=0em,yshift=-1.5em]alig4.south) {\small{(c)}};
 \end{scope}
@@ -194,7 +194,7 @@
 \draw [->,thick] ([xshift=-1.0em,yshift=1.0em]alig1.north west)--([xshift=-1.0em,yshift=-0.7em]alig4.south west);
 \draw [->,thick] ([xshift=-1.0em,yshift=1.0em]alig1.north west)--([xshift=0.8em,yshift=1.0em]alig13.north east);
-\node[anchor=north] (l) at ([xshift=0em,yshift=-1.5em]alig4.south) {\scriptsize{(d)}};
+\node[anchor=north] (l) at ([xshift=0em,yshift=-1.5em]alig4.south) {\small{(d)}};
 \end{scope}

--- a/Chapter8/Figures/figure-structure-of-chart.tex
+++ b/Chapter8/Figures/figure-structure-of-chart.tex
@@ -4,7 +4,7 @@
 \begin{tikzpicture}
 \begin{scope}
-\node [anchor=south west,draw,fill=ugreen!20,minimum width=2.8em,minimum height=2.8em,inner sep=1pt] (cell11) at (0,0) {\scriptsize{cell[1,2]}};
+\node [anchor=south west,draw,fill=green!20,minimum width=2.8em,minimum height=2.8em,inner sep=1pt] (cell11) at (0,0) {\scriptsize{cell[1,2]}};
 \node [anchor=south west,draw,fill=red!20,minimum width=2.8em,minimum height=2.8em,inner sep=1pt] (cell12) at (cell11.south east) {\scriptsize{cell[0,2]}};
 \node [anchor=south west,draw,fill=orange!30,minimum width=2.8em,minimum height=2.8em,inner sep=1pt] (cell21) at (cell11.north west) {\scriptsize{cell[0,1]}};
 \node [anchor=south west,draw,fill=gray!20,minimum width=2.8em,minimum height=2.8em,inner sep=1pt] (cell22) at (cell21.south east) {\scriptsize{N/A}};
@@ -12,7 +12,7 @@
 \draw [->,thick] ([xshift=-1em,yshift=1em]cell21.north west)--([xshift=1em,yshift=1em]cell22.north east);
 \node [anchor=north west,fill=orange!30,draw,drop shadow,align=left,minimum width=4em] (cell11label) at ([xshift=4em,yshift=1em]cell22.north east) {\footnotesize{VV[0,1]}};
-\node [anchor=north west,fill=ugreen!20,draw,drop shadow,align=left,minimum width=4em] (cell12label) at ([yshift=-1em]cell11label.south west) {\footnotesize{NN[1,2]}\\\footnotesize{NP[1,2]}};
+\node [anchor=north west,fill=green!20,draw,drop shadow,align=left,minimum width=4em] (cell12label) at ([yshift=-1em]cell11label.south west) {\footnotesize{NN[1,2]}\\\footnotesize{NP[1,2]}};
 \node [anchor=north west,fill=red!20,draw,drop shadow,align=left,minimum width=4em] (cell21label) at ([yshift=-1em]cell12label.south west) {\footnotesize{VP[0,2]}\\\footnotesize{NP[0,2]}};
 \draw [->,very thick,dotted] ([yshift=0.3em]cell11label.west) .. controls +(west:2em)  and +(north:1.5em) .. ([xshift=1em,yshift=-0.5em]cell21.north);

--- a/Chapter8/chapter8.tex
+++ b/Chapter8/chapter8.tex
@@ -266,7 +266,7 @@ r_4:\quad \funp{X}\ &\to\ &\langle \ \text{了},\quad \textrm{have}\ \rangle \no
 \noindent 其中，每使用一次规则就会同步替换源语言和目标语言符号串中的一个非终结符，替换结果用红色表示。通常，可以把上面这个过程称作翻译推导，记为：
 \begin{eqnarray}
-d = {r_1} \circ {r_2} \circ {r_3} \circ {r_4}
+d & = & {r_1} \circ {r_2} \circ {r_3} \circ {r_4}
 \label{eq:8-1}
 \end{eqnarray}
@@ -402,19 +402,19 @@ y&=&\beta_0 y_{\pi_1} \beta_1 y_{\pi_2} ... \beta_{m-1} y_{\pi_m} \beta_m
 \parinterval 这些特征可以被具体描述为：
 \begin{eqnarray}
-h_i (d,\seq{t},\seq{s})=\sum_{r \in d}h_i (r)
+h_i (d,\seq{t},\seq{s}) & = & \sum_{r \in d}h_i (r)
 \label{eq:8-4}
 \end{eqnarray}
 \parinterval 公式\eqref{eq:8-4}中，$r$表示推导$d$中的一条规则，$h_i (r)$表示规则$r$上的第$i$个特征。可以看出，推导$d$的特征值就是所有包含在$d$中规则的特征值的和。进一步，可以定义
 \begin{eqnarray}
-\textrm{rscore}(d,\seq{t},\seq{s})=\sum_{i=1}^7 \lambda_i \cdot h_i (d,\seq{t},\seq{s})
+\textrm{rscore}(d,\seq{t},\seq{s}) & = & \sum_{i=1}^7 \lambda_i \cdot h_i (d,\seq{t},\seq{s})
 \label{eq:8-5}
 \end{eqnarray}
 \parinterval 最终，模型得分被定义为：
 \begin{eqnarray}
-\textrm{score}(d,\seq{t},\seq{s})=\textrm{rscore}(d,\seq{t},\seq{s})+ \lambda_8 \textrm{log}⁡(\funp{P}_{\textrm{lm}}(\seq{t}))+\lambda_9 \mid \seq{t} \mid
+\textrm{score}(d,\seq{t},\seq{s}) & = & \textrm{rscore}(d,\seq{t},\seq{s})+ \lambda_8 \textrm{log}⁡(\funp{P}_{\textrm{lm}}(\seq{t}))+\lambda_9 \mid \seq{t} \mid
 \label{eq:8-6}
 \end{eqnarray}
@@ -438,14 +438,14 @@ h_i (d,\seq{t},\seq{s})=\sum_{r \in d}h_i (r)
 \parinterval 层次短语模型解码的目标是找到模型得分最高的推导，即：
 \begin{eqnarray}
-\hat{d} = \argmax_{d}\ \textrm{score}(d,\seq{t},\seq{s})
+\hat{d} & = & \argmax_{d}\ \textrm{score}(d,\seq{t},\seq{s})
 \label{eq:8-7}
 \end{eqnarray}
 \noindent 这里，$\hat{d}$的目标语部分即最佳译文$\hat{\seq{t}}$。令函数$t(\cdot)$返回翻译推导的目标语词串，于是有：
 \begin{eqnarray}
-\hat{\seq{t}}=t(\hat{d})
+\hat{\seq{t}} & = & t(\hat{d})
 \label{eq:8-8}
 \end{eqnarray}
@@ -1408,15 +1408,15 @@ r_9: \quad \textrm{IP(}\textrm{NN}_1\ \textrm{VP}_2) \rightarrow \textrm{S(}\tex
 \parinterval 从句法分析的角度看，超图最大程度地复用了局部的分析结果，使得分析可以“结构化”。比如，有两个推导：
 \begin{eqnarray}
-d_1 = {r_1} \circ {r_2} \circ {r_3} \circ {r_4} \label{eqa4.30}\\
+d_1 & = & {r_1} \circ {r_2} \circ {r_3} \circ {r_4} \label{eqa4.30}\\
-d_2 = {r_1} \circ {r_2} \circ {r_3} \circ {r_5}
+d_2 & = & {r_1} \circ {r_2} \circ {r_3} \circ {r_5}
 \label{eq:8-10}
 \end{eqnarray}
 \noindent 其中，$r_1 - r_5$分别表示不同的规则。${r_1} \circ {r_2} \circ {r_3}$是两个推导的公共部分。在超图表示中，${r_1} \circ {r_2} \circ {r_3}$可以对应一个子图，显然这个子图也是一个推导，记为${d'}= {r_1} \circ {r_2} \circ {r_3}$。这样，$d_1$和$d_2$不需要重复记录${r_1} \circ {r_2} \circ {r_3}$，重新写作：
 \begin{eqnarray}
-d_1 = {d'} \circ {r_4} \label{eqa4.32}\\
+d_1 & = & {d'} \circ {r_4} \label{eqa4.32}\\
-d_1 = {d'} \circ {r_5}
+d_1 & = & {d'} \circ {r_5}
 \label{eq:8-12}
 \end{eqnarray}
@@ -1458,7 +1458,7 @@ d_1 = {d'} \circ {r_5}
 \parinterval 解码的目标是找到得分score($d$)最高的推导$d$。这个过程通常被描述为：
 \begin{eqnarray}
-\hat{d} = \argmax_d\ \textrm{score} (d,\seq{s},\seq{t})
+\hat{d} & = & \argmax_d\ \textrm{score} (d,\seq{s},\seq{t})
 \label{eq:8-13}
 \end{eqnarray}

--- a/Chapter9/chapter9.tex
+++ b/Chapter9/chapter9.tex
@@ -69,13 +69,13 @@
 \parinterval 虽然第一代神经网络受到了打击，但是在20世纪80年代，第二代人工神经网络开始萌发新的生机。在这个发展阶段，生物属性已经不再是神经网络的唯一灵感来源，在{\small\bfnew{连接主义}}\index{连接主义}（Connectionism）\index{Connectionism}和分布式表示两种思潮的影响下，神经网络方法再次走入了人们的视线。
 \vspace{0.3em}
-\parinterval （1）符号主义与连接主义
+\parinterval 1）符号主义与连接主义
 \vspace{0.3em}
 \parinterval 人工智能领域始终存在着符号主义和连接主义之争。早期的人工智能研究在认知学中被称为{\small\bfnew{符号主义}}\index{符号主义}（Symbolicism）\index{Symbolicism}，符号主义认为人工智能源于数理逻辑，希望将世界万物的所有运转方式归纳成像文法一样符合逻辑规律的推导过程。符号主义的支持者们坚信基于物理符号系统（即符号操作系统）假设和有限合理性原理，就能通过逻辑推理来模拟智能。但被他们忽略的一点是，模拟智能的推理过程需要大量的先验知识支持，哪怕是在现代，生物学界也很难准确解释大脑中神经元的工作原理，因此也很难用符号系统刻画人脑逻辑。另一方面，连接主义则侧重于利用人工神经网络中神经元的连接去探索并模拟输入与输出之间存在的某种关系，这个过程不需要任何先验知识，其核心思想是“大量简单的计算单元连接到一起可以实现智能行为”，这种思想也推动了反向传播等多种神经网络方法的应用，并发展了包括长短时记忆模型在内的经典建模方法。2019年3月27日，ACM 正式宣布将图灵奖授予 Yoshua Bengio, Geoffrey Hinton 和 Yann LeCun，以表彰他们提出的概念和工作使得深度学习神经网络有了重大突破，这三位获奖人均是人工智能连接主义学派的主要代表，从这件事中也可以看出连接主义对当代人工智能和深度学习的巨大影响。
 \vspace{0.3em}
-\parinterval （2）分布式表示
+\parinterval 2）分布式表示
 \vspace{0.3em}
 \parinterval 分布式表示的主要思想是“一个复杂系统的任何部分的输入都应该是多个特征共同表示的结果”，这种思想在自然语言处理领域的影响尤其深刻，它改变了刻画语言世界的角度，将语言文字从离散空间映射到多维连续空间。例如，在现实世界中，“张三”这个代号就代表着一个人。如果想要知道这个人亲属都有谁，因为有“如果A和B姓氏相同且在同一个家谱中，那么A和B是本家”这个先验知识在，在知道代号“张三”的情况下，可以得知“张三”的亲属是谁。但是如果不依靠这个先验知识，就无法得知“张三”的亲属是谁。但在分布式表示中，可以用一个实数向量，如$ (0.1,0.3,0.4) $来表示“张三”这个人，这个人的所有特征信息都包含在这个实数向量中，通过在向量空间中的一些操作（如计算距离等），哪怕没有任何先验知识的存在，也完全可以找到这个人的所有亲属。在自然语言处理中，一个单词也用一个实数向量（词向量或词嵌入）表示，通过这种方式将语义空间重新刻画，将这个离散空间转化成了一个连续空间，这时单词就不再是一个简单的词条，而是由成百上千个特征共同描述出来的，其中每个特征分别代表这个词的某个“ 方面”。
@@ -516,7 +516,7 @@ l_p({\mathbi{x}}) & = & {\Vert{\mathbi{x}}\Vert}_p \nonumber \\
 \vspace{0.5em}
 \parinterval 感知机是人工神经元的一种实例，在上世纪50-60年代被提出后，对神经网络研究产生了深远的影响。感知机模型如图\ref {fig:9-5}所示，其输入是一个$n$维二值向量$ {\mathbi{x}}=(x_1,x_2,\dots,x_n) $，其中$ x_i=0 $或$ 1 $。权重${\mathbi{w}}=(w_1,w_2,\dots,w_n) $，每个输入变量对应一个权重$ w_i $。偏置$ b $是一个实数变量（$ -\sigma $）。输出也是一个二值结果，即$ y=0 $或$ 1 $。$ y $值的判定由输入的加权和是否大于（或小于）一个阈值$ \sigma $决定（公式\eqref{eq:9-19}）：
 \begin{eqnarray}
-y=\begin{cases} 0 & \sum_{i}{x_i\cdot w_i}-\sigma <0\\1 & \sum_{i}{x_i\cdot w_i}-\sigma \geqslant 0\end{cases}
+y&=&\begin{cases} 0 & \sum_{i}{x_i\cdot w_i}-\sigma <0\\1 & \sum_{i}{x_i\cdot w_i}-\sigma \geqslant 0\end{cases}
 \label{eq:9-19}
 \end{eqnarray}
@@ -675,7 +675,7 @@ x_1\cdot w_1+x_2\cdot w_2+x_3\cdot w_3 & = & 0\cdot 1+0\cdot 1+1\cdot 1 \nonumbe
 \parinterval 举个简单的例子，预报天气时，往往需要预测温度、湿度和风力，这就意味着如果使用单层神经网络进行预测，需要设置3个神经元。如图\ref{fig:9-10}所示，此时权重矩阵如公式\eqref{eq:9-105}所示：
 \begin{eqnarray}
-{\mathbi{W}}=\begin{pmatrix} w_{11} & w_{12} & w_{13}\\ w_{21} & w_{22} & w_{23}\end{pmatrix}
+{\mathbi{W}}&=&\begin{pmatrix} w_{11} & w_{12} & w_{13}\\ w_{21} & w_{22} & w_{23}\end{pmatrix}
 \label{eq:9-105}
 \end{eqnarray}
@@ -710,7 +710,7 @@ x_1\cdot w_1+x_2\cdot w_2+x_3\cdot w_3 & = & 0\cdot 1+0\cdot 1+1\cdot 1 \nonumbe
 \item 从几何角度看，公式中的${\mathbi{x}}\cdot {\mathbi{W}}+{\mathbi{b}}$将${\mathbi{x}}$右乘${\mathbi{W}}$相当于对$ {\mathbi{x}} $进行旋转变换。例如，对三个点$ (0,0) $，$ (0,1) $，$ (1,0) $及其围成的矩形区域右乘公式\eqref{eq:9-106}所示矩阵：
    \begin{eqnarray}
-    {\mathbi{W}}=\begin{pmatrix} 1 & 0 & 0\\ 0 & -1 & 0\\ 0 & 0 & 1\end{pmatrix}
+    {\mathbi{W}}&=&\begin{pmatrix} 1 & 0 & 0\\ 0 & -1 & 0\\ 0 & 0 & 1\end{pmatrix}
    \label{eq:9-106}
    \end{eqnarray}
@@ -945,7 +945,7 @@ x_1\cdot w_1+x_2\cdot w_2+x_3\cdot w_3 & = & 0\cdot 1+0\cdot 1+1\cdot 1 \nonumbe
 \parinterval 将矩阵乘法扩展到高阶张量中：一个张量${\mathbi{x}}$若要与矩阵$ {\mathbi{W}}$做矩阵乘法，则$ {\mathbi{x}} $的最后一维度需要与${\mathbi{W}}$的行数大小相等，即：若张量${\mathbi{x}} $的形状为$ \cdot \times n $，${\mathbi{W}} $须为$ n\times \cdot $的矩阵。公式\eqref{eq:9-25}是一个例子:
 \begin{eqnarray}
-{\mathbi{x}}(1:4,1:4,{\red{1:4}})\times {{\mathbi{W}}({\red{1:4}},1:2)}={\mathbi{s}}(1:4,1:4,1:2)
+{\mathbi{x}}(1:4,1:4,{\red{1:4}})\;\;\times\;\; {{\mathbi{W}}({\red{1:4}},1:2)}&=&{\mathbi{s}}(1:4,1:4,1:2)
 \label{eq:9-25}
 \end{eqnarray}
@@ -984,7 +984,7 @@ x_1\cdot w_1+x_2\cdot w_2+x_3\cdot w_3 & = & 0\cdot 1+0\cdot 1+1\cdot 1 \nonumbe
 \vspace{0.5em}
 \item 除了单位加之外，张量之间也可以使用减法操作、乘法操作。此外也可以对张量作激活操作，这里将其称作为函数的{\small\bfnew{向量化}}\index{向量化}（Vectorization）\index{Vectorization}。例如，对向量（1阶张量）作ReLU激活，公式\eqref{eq:9-26}为ReLU激活函数：
 \begin{eqnarray}
-f(x)=\begin{cases} 0 & x\le 0 \\x & x>0\end{cases}
+f(x)&=&\begin{cases} 0 & x\le 0 \\x & x>0\end{cases}
 \label{eq:9-26}
 \end{eqnarray}
 \vspace{-0.5em}
@@ -1215,7 +1215,7 @@ y&=&{\textrm{Sigmoid}}({\textrm{Tanh}}({\mathbi{x}}\cdot {\mathbi{W}}^{[1]}+{\ma
 %----------------------------------------------------------------------------------------
 \vspace{0.5em}
-\noindent {\small\sffamily\bfseries{（1）批量梯度下降\index{批量梯度下降}（Batch Gradient Descent）\index{Batch Gradient Descent}}}
+\noindent {\small\sffamily\bfseries{1）批量梯度下降\index{批量梯度下降}（Batch Gradient Descent）\index{Batch Gradient Descent}}}
 \vspace{0.5em}
 \parinterval 批量梯度下降是梯度下降方法中最原始的形式，这种梯度下降方法在每一次迭代时使用所有的样本进行参数更新。参数优化的目标函数如公式\eqref{eq:9-30}所示：
@@ -1233,7 +1233,7 @@ J({\bm \theta})&=&\frac{1}{n}\sum_{i=1}^{n}{L({\mathbi{x}}_i,\widetilde{\mathbi{
 %----------------------------------------------------------------------------------------
 \vspace{0.5em}
-\noindent {\small\sffamily\bfseries{（2）随机梯度下降\index{随机梯度下降}（Stochastic Gradient Descent）\index{Stochastic Gradient Descent}}}
+\noindent {\small\sffamily\bfseries{2）随机梯度下降\index{随机梯度下降}（Stochastic Gradient Descent）\index{Stochastic Gradient Descent}}}
 \vspace{0.5em}
 \parinterval 随机梯度下降（简称SGD）不同于批量梯度下降，每次迭代只使用一个样本对参数进行更新。SGD的目标函数如公式\eqref{eq:9-31}所示
@@ -1251,7 +1251,7 @@ J({\bm \theta})&=&L({\mathbi{x}}_i,\widetilde{\mathbi{y}}_i;{\bm \theta})
 %----------------------------------------------------------------------------------------
 \vspace{0.5em}
-\noindent {\small\sffamily\bfseries{（3）小批量梯度下降\index{小批量梯度下降}（Mini-batch Gradient Descent）\index{Mini-batch Gradient Descent}}}
+\noindent {\small\sffamily\bfseries{3）小批量梯度下降\index{小批量梯度下降}（Mini-batch Gradient Descent）\index{Mini-batch Gradient Descent}}}
 \vspace{0.5em}
 \parinterval 为了综合批量梯度下降和随机梯度下降的优缺点，在实际应用中一般采用这两个算法的折中\ \dash \ 小批量梯度下降。其思想是：每次迭代计算一小部分训练数据的损失函数，并对参数进行更新。这一小部分数据被称为一个批次（mini-batch或者batch）。小批量梯度下降的参数优化的目标函数如公式\eqref{eq:9-32}所示：
@@ -1275,7 +1275,7 @@ J({\bm \theta})&=&\frac{1}{m}\sum_{i=j}^{j+m-1}{L({\mathbi{x}}_i,\widetilde{\mat
 %----------------------------------------------------------------------------------------
 \vspace{0.5em}
-\noindent {\small\sffamily\bfseries{（1）数值微分\index{数值微分}（Numerical Differentiation）\index{Numerical Differentiation}}}
+\noindent {\small\sffamily\bfseries{1）数值微分\index{数值微分}（Numerical Differentiation）\index{Numerical Differentiation}}}
 \vspace{0.5em}
 \parinterval 数学上，梯度的求解其实就是求函数偏导的问题。导数是用极限来定义的，如公式\eqref{eq:9-33}所示：
@@ -1298,7 +1298,7 @@ J({\bm \theta})&=&\frac{1}{m}\sum_{i=j}^{j+m-1}{L({\mathbi{x}}_i,\widetilde{\mat
 %
 %----------------------------------------------------------------------------------------
-\noindent {\small\sffamily\bfseries{（2）符号微分\index{符号微分}（Symbolic Differentiation）\index{Symbolic Differentiation}}}
+\noindent {\small\sffamily\bfseries{2）符号微分\index{符号微分}（Symbolic Differentiation）\index{Symbolic Differentiation}}}
 \vspace{0.5em}
 \parinterval 顾名思义，符号微分就是通过建立符号表达式求解微分的方法：借助符号表达式和求导公式，推导出目标函数关于自变量的微分表达式，最后再带入具体数值得到微分结果。例如，对于表达式$ L({\bm \theta})={\mathbi{x}}\cdot {\bm \theta}+2{\bm \theta}^2 $，可以手动推导出微分表达式$ \frac{\partial L({\bm \theta})}{\partial {\bm \theta}}=\mathbi{x}+4{\bm \theta}  $，最后将具体数值$ \mathbi{x} = {(\begin{array}{cc} 2 & -3\end{array})} $和$ {\bm \theta} = {(\begin{array}{cc} -1 & 1\end{array})} $带入后，得到微分结果$\frac{\partial L({\bm \theta})}{\partial {\bm \theta}}= {(\begin{array}{cc} 2 & -3\end{array})}+4{(\begin{array}{cc} -1 & 1\end{array})}= {(\begin{array}{cc} -2 & 1\end{array})}$。
@@ -1334,7 +1334,7 @@ $+2x^2+x+1)$ & \ \ $(x^4+2x^3+2x^2+x+1)$ & $+6x+1$ \\
 %----------------------------------------------------------------------------------------
 \vspace{0.5em}
-\noindent {\small\sffamily\bfseries{（3）自动微分\index{自动微分}（Automatic Differentiation）\index{Automatic Differentiation}}}
+\noindent {\small\sffamily\bfseries{3）自动微分\index{自动微分}（Automatic Differentiation）\index{Automatic Differentiation}}}
 \vspace{0.5em}
 \parinterval  自动微分是一种介于数值微分和符号微分的方法：将符号微分应用于最基本的算子，如常数、幂函数、指数函数、对数函数、三角函数等，然后代入数值，保留中间结果，最后再应用于整个函数。通过这种方式，将复杂的微分变成了简单的步骤，这些步骤完全自动化，而且容易进行存储和计算。
@@ -1344,7 +1344,7 @@ $+2x^2+x+1)$ & \ \ $(x^4+2x^3+2x^2+x+1)$ & $+6x+1$ \\
 \parinterval  自动微分可以用一种{\small\sffamily\bfseries{反向模式}}\index{反向模式}（Reverse Mode\index{Reverse Mode}/Backward Mode\index{Backward Mode}）即反向传播思想进行描述\upcite{baydin2017automatic}。令${\mathbi{h}}_i$是神经网络的计算图中第$i$个节点的输出。反向模式的自动微分是要计算：
 \begin{eqnarray}
-\bar{{\mathbi{h}}_i} = \frac{\partial L}{\partial {\mathbi{h}}_i} \label{eq:reverse-mode-v}
+\bar{{\mathbi{h}}_i} &=& \frac{\partial L}{\partial {\mathbi{h}}_i} \label{eq:reverse-mode-v}
 \end{eqnarray}
 \noindent 这里，$\bar{{\mathbi{h}}_i}$表示损失函数$L$相对于${\mathbi{h}}_i$的梯度信息，它会被保存在节点$i$处。为了计算$\bar{{\mathbi{h}}_i}$，需要从网络的输出反向计算每一个节点处的梯度。具体实现时，这个过程由一个包括前向计算和反向计算的两阶段方法实现。
@@ -1404,7 +1404,7 @@ $+2x^2+x+1)$ & \ \ $(x^4+2x^3+2x^2+x+1)$ & $+6x+1$ \\
 %----------------------------------------------------------------------------------------
 \vspace{0.5em}
-\noindent {\small\sffamily\bfseries{（1）Momentum \index{Momentum}}}
+\noindent {\small\sffamily\bfseries{1）Momentum \index{Momentum}}}
 \vspace{0.5em}
 \parinterval  Momentum梯度下降算法的参数更新方式如公式\eqref{eq:9-34}和\eqref{eq:9-35}所示\footnote{在梯度下降算法的几种改进方法的公式中，其更新对象是某个具体参数而非参数矩阵，因此不再使用加粗样式。}：
@@ -1434,7 +1434,7 @@ v_t&=&\beta v_{t-1}+(1-\beta)\frac{\partial J}{\partial \theta_t}
 %----------------------------------------------------------------------------------------
 \vspace{0.5em}
-\noindent {\small\sffamily\bfseries{（2）AdaGrad \index{AdaGrad}}}
+\noindent {\small\sffamily\bfseries{2）AdaGrad \index{AdaGrad}}}
 \vspace{0.5em}
 \parinterval  在神经网络的学习中，学习率的设置很重要。学习率过小， 会导致学习花费过多时间；反过来，学习率过大，则会导致学习发散，甚至造成模型的“跑偏”。在深度学习实现过程中，有一种被称为学习率{\small\bfnew{衰减}}\index{衰减}（Decay）\index{Decay}的方法，即最初设置较大的学习率，随着学习的进行，使学习率逐渐减小，这种方法相当于将“全体”参数的学习率值一起降低。AdaGrad梯度下降算法进一步发展了这个思想\upcite{duchi2011adaptive}。
@@ -1454,7 +1454,7 @@ z_t&=&z_{t-1}+\frac{\partial J}{\partial {\theta}_t} \cdot \frac{\partial J}{\pa
 %----------------------------------------------------------------------------------------
 \vspace{0.5em}
-\noindent {\small\sffamily\bfseries{（3）RMSProp \index{RMSProp}}}
+\noindent {\small\sffamily\bfseries{3）RMSProp \index{RMSProp}}}
 \vspace{0.5em}
 \parinterval  RMSProp算法是一种自适应学习率的方法\upcite{tieleman2012rmsprop}，它是对AdaGrad算法的一种改进，可以避免AdaGrad算法中学习率不断单调下降以至于过早衰减的缺点。
@@ -1476,7 +1476,7 @@ z_t&=&\gamma z_{t-1}+(1-\gamma) \frac{\partial J}{\partial {\theta}_t} \cdot  \f
 %----------------------------------------------------------------------------------------
 \vspace{0.5em}
-\noindent {\small\sffamily\bfseries{（4）Adam \index{Adam} }}
+\noindent {\small\sffamily\bfseries{4）Adam \index{Adam} }}
 \vspace{0.5em}
 \parinterval  Adam梯度下降算法是在RMSProp算法的基础上进行改进的，可以将其看成是带有动量项的RMSProp算法\upcite{kingma2014adam}。该算法在自然语言处理领域非常流行。Adam 算法的参数更新方式如公式\eqref{eq:9-40}\eqref{eq:9-41}\eqref{eq:9-42}所示：
@@ -1637,16 +1637,16 @@ z_t&=&\gamma z_{t-1}+(1-\gamma) \frac{\partial J}{\partial {\theta}_t} \cdot  \f
 \item  $ h_i^k $：第$ k $层第$ i $个神经元的输出；
 \vspace{0.5em}
 \item  $ {\mathbi{h}}^k $：第$ k $层的输出。若第$ k $层有$ n $个神经元，则：
-       \begin{equation}
+       \begin{eqnarray}
-       {\mathbi{h}}^k=(h_1^k,h_2^k,\dots,h_n^k)
+       {\mathbi{h}}^k&=&(h_1^k,h_2^k,\dots,h_n^k)
-       \end{equation}
+       \end{eqnarray}
 \vspace{0.5em}
 \item  $ w_{j,i}^k $：第$ k-1 $层神经元$ j $与第$ k $层神经元$ i $的连接权重；
 \vspace{0.5em}
 \item  $ {\mathbi{W}}^k $：第$ k-1 $层与第$ k $层的连接权重。若第$ k-1 $层有$ m $个神经元，第$ k $层有$ n $个神经元，则：
-       \begin{equation}
+       \begin{eqnarray}
-       {\mathbi{W}}^k = \begin{pmatrix} w_{1,1}^k & w_{1,2}^k & \dots & w_{1,n}^k\\w_{2,1}^k & \dots & \dots & \dots\\ \dots & \dots & \dots & \dots\\w_{m,1}^k & \dots & \dots & w_{m,n}^k\end{pmatrix}
+       {\mathbi{W}}^k &=& \begin{pmatrix} w_{1,1}^k & w_{1,2}^k & \dots & w_{1,n}^k\\w_{2,1}^k & \dots & \dots & \dots\\ \dots & \dots & \dots & \dots\\w_{m,1}^k & \dots & \dots & w_{m,n}^k\end{pmatrix}
-       \end{equation}
+       \end{eqnarray}
 \vspace{0.5em}
 \item  $ {\mathbi{h}}^K $：整个网络的输出；
 \vspace{0.5em}
@@ -1942,7 +1942,7 @@ z_t&=&\gamma z_{t-1}+(1-\gamma) \frac{\partial J}{\partial {\theta}_t} \cdot  \f
 \parinterval Softmax($\cdot$)的作用是根据输入的$|V|$维向量（即${\mathbi{h}}_0{\mathbi{U}}$），得到一个$|V|$维的分布。令${\bm \tau}$表示Softmax($\cdot$)的输入向量，Softmax函数可以被定义为公式\eqref{eq:9-120}：
 \begin{eqnarray}
-\textrm{Softmax}(\tau_i)=\frac{\textrm{exp}(\tau_i)}  {\sum_{i'=1}^{|V|} \textrm{exp}(\tau_{i'})}
+\textrm{Softmax}(\tau_i)&=&\frac{\textrm{exp}(\tau_i)}  {\sum_{i'=1}^{|V|} \textrm{exp}(\tau_{i'})}
 \label{eq:9-120}
 \end{eqnarray}

--- a/bibliography.bib
+++ b/bibliography.bib