合并分支 'caorunzhe' 到 'master'

Caorunzhe 查看合并请求 !790

合并分支 'caorunzhe' 到 'master'
Caorunzhe 查看合并请求 !790
682f6c21 · 曹润柘 · ce15a6eb · 7fe7ed60 · 682f6c21 · 682f6c21
Commit 682f6c21 authored Jan 06, 2021 by 曹润柘
--- a/Chapter10/chapter10.tex
+++ b/Chapter10/chapter10.tex
@@ -25,7 +25,7 @@
 \parinterval {\small\sffamily\bfseries{神经机器翻译}} \index{神经机器翻译}（Neural Machine Translation）\index{Neural Machine Translation}是机器翻译的前沿方法。近几年，随着深度学习技术的发展和在各领域中的深入应用，基于端到端表示学习的方法正在改变着我们处理自然语言的方式，神经机器翻译在这种趋势下应运而生。一方面，神经机器翻译仍然延续着统计建模和基于数据驱动的思想，因此在基本问题的定义上与前人的研究是一致的；另一方面，神经机器翻译脱离了统计机器翻译中对隐含翻译结构的假设，同时使用分布式表示来对文字序列进行建模，这使得它可以从一个全新的视角看待翻译问题。现在，神经机器翻译已经成为了机器翻译研究及应用的热点，译文质量得到了巨大的提升。
-\parinterval 本章将介绍神经机器翻译中的一种基础模型\ \dash \ 基于循环神经网络的模型。该模型是神经机器翻译中最早被成功应用的模型之一。基于这个模型框架，研究者进行了大量的探索和改进工作，包括使用LSTM等循环单元结构、引入注意力机制等。这些内容都会在本章进行讨论。
+\parinterval 本章将介绍神经机器翻译中的一种基础模型\ \dash \ 基于循环神经网络的模型。该模型是神经机器翻译中最早被成功应用的模型之一。基于这个模型框架，研究人员进行了大量的探索和改进工作，包括使用LSTM等循环单元结构、引入注意力机制等。这些内容都会在本章进行讨论。
 %----------------------------------------------------------------------------------------
 %    NEW SECTION  10.1
@@ -35,7 +35,7 @@
 \parinterval 纵观机器翻译的发展历程，神经机器翻译诞生较晚。无论是早期的基于规则的方法，还是逐渐发展起来的基于实例的方法，再或是上世纪末的统计方法，每次机器翻译框架级的创新都需要很长时间的酝酿，而技术走向成熟甚至需要更长的时间。但是，神经机器翻译的出现和后来的发展速度多少有些“出人意料”。神经机器翻译的概念出现在2013-2014年间，当时机器翻译领域的主流方法仍然是统计机器翻译。虽然那个时期深度学习已经在图像、语音等领域取得令人瞩目的效果，但是对于自然语言处理来说深度学习仍然不是主流。
-\parinterval 不过，有人也意识到了神经机器翻译在表示学习等方面的优势。这一时期，很多研究团队对包括机器翻译在内的序列到序列问题进行了广泛而深入的研究，注意力机制等新的方法不断被推出。这使得神经机器翻译系统在翻译品质上逐渐体现出优势，甚至超越了当时的统计机器翻译系统。正当大家在讨论神经机器翻译是否能取代统计机器翻译成为下一代机器翻译范式的时候，一些互联网企业推出了以神经机器翻译技术为内核的在线机器翻译服务，在很多场景下的翻译品质显著超越了当时最好的统计机器翻译系统。这也引发了学术界和产业界对神经机器翻译的讨论。随着关注度的不断升高，神经机器翻译的研究吸引了更多的科研机构和企业的投入，神经机器翻译系统的翻译品质得到进一步提升。
+\parinterval 不过，研究人员也意识到了神经机器翻译在表示学习等方面的优势。这一时期，很多研究团队对包括机器翻译在内的序列到序列问题进行了广泛而深入的研究，注意力机制等新的方法不断被推出。这使得神经机器翻译系统在翻译品质上逐渐体现出优势，甚至超越了当时的统计机器翻译系统。正当大家在讨论神经机器翻译是否能取代统计机器翻译成为下一代机器翻译范式的时候，一些互联网企业推出了以神经机器翻译技术为内核的在线机器翻译服务，在很多场景下的翻译品质显著超越了当时最好的统计机器翻译系统。这也引发了学术界和产业界对神经机器翻译的讨论。随着关注度的不断升高，神经机器翻译的研究吸引了更多的科研机构和企业的投入，神经机器翻译系统的翻译品质得到进一步提升。
 \parinterval 在短短5-6年间，神经机器翻译从一个新生的概念已经成长为机器翻译领域的最前沿技术之一，在各种机器翻译评测和应用中呈全面替代统计机器翻译之势。比如，从近几年WMT、CCMT等评测的结果来看，神经机器翻译已经处于绝对的统治地位，在不同语种和领域的翻译任务中，成为各参赛系统的标配。此外，从ACL等自然语言处理顶级会议的发表论文看，神经机器翻译在论文数量上呈明显的增长趋势，这也体现了学术界对该方法的热情。至今，国内外的很多机构都推出了自己研发的神经机器翻译系统，整个研究和产业生态欣欣向荣。图\ref{fig:10-1}展示了包含神经机器翻译在内的机器翻译发展简史。
@@ -48,11 +48,11 @@
 \end{figure}
 %----------------------------------------------
-\parinterval 神经机器翻译的迅速崛起确实让所有人都有些措手不及，甚至有一种一觉醒来天翻地覆的感觉。也有人评价，神经机器翻译的出现给整个机器翻译领域带来了前所未有的发展机遇。不过，客观地看，机器翻译达到今天这样的状态也是一种历史必然，其中有几方面原因：
+\parinterval 神经机器翻译的迅速崛起确实让所有研究人员都有些措手不及，甚至有一种一觉醒来天翻地覆的感觉。也有研究人员评价，神经机器翻译的出现给整个机器翻译领域带来了前所未有的发展机遇。不过，客观地看，机器翻译达到今天这样的状态也是一种历史必然，其中有几方面原因：
 \begin{itemize}
 \vspace{0.3em}
-\item 自上世纪末所发展起来的基于数据驱动的方法为神经机器翻译提供了很好的基础。本质上，神经机器翻译仍然是一种基于统计建模的数据驱动的方法，因此无论是对问题的基本建模方式，还是训练统计模型所使用到的带标注数据，都可以复用机器翻译领域以前的研究成果。特别是机器翻译长期的发展已经积累了大量的双语、单语数据，这些数据在统计机器翻译时代就发挥了很大作用。随着时间的推移，数据规模和质量又得到进一步提升，包括一些评测基准、任务设置都已经非常完备，研究者可以直接在数据条件全部具备的情况下开展神经机器翻译的研究工作，这些都节省了大量的时间成本。从这个角度说，神经机器翻译是站在巨人的肩膀上才发展起来的。
+\item 自上世纪末所发展起来的基于数据驱动的方法为神经机器翻译提供了很好的基础。本质上，神经机器翻译仍然是一种基于统计建模的数据驱动的方法，因此无论是对问题的基本建模方式，还是训练统计模型所使用到的带标注数据，都可以复用机器翻译领域以前的研究成果。特别是机器翻译长期的发展已经积累了大量的双语、单语数据，这些数据在统计机器翻译时代就发挥了很大作用。随着时间的推移，数据规模和质量又得到进一步提升，包括一些评测基准、任务设置都已经非常完备，研究人员可以直接在数据条件全部具备的情况下开展神经机器翻译的研究工作，这些都节省了大量的时间成本。从这个角度说，神经机器翻译是站在巨人的肩膀上才发展起来的。
 \vspace{0.3em}
 \item 深度学习经过长时间的酝酿终于爆发，为机器翻译等自然语言处理任务提供了新的思路和技术手段。神经机器翻译的不断壮大伴随着深度学习技术的发展。在深度学习的视角下，语言文字可以被表示成抽象的实数向量，这种文字的表示结果可以被自动学习，为机器翻译建模提供了更大的灵活性。相对于神经机器翻译，深度学习的发展更加曲折。虽然深度学习经过了漫长的起伏过程，但是神经机器翻译恰好出现在深度学习逐渐走向成熟的阶段。反过来说，受到深度学习及相关技术空前发展的影响，自然语言处理的范式也发生了变化，神经机器翻译的出现只是这种趋势下的一种必然。
 \vspace{0.3em}
@@ -68,9 +68,9 @@
 %    NEW SUB-SECTION 10.1.1
 %----------------------------------------------------------------------------------------
 \subsection{神经机器翻译的起源}
-\parinterval 从广义上讲，神经机器翻译是一种基于人工神经网络的方法，它把翻译过程描述为可以用人工神经网络表示的函数，所有的训练和推断都在这些函数上进行。由于神经机器翻译中的神经网络可以用连续可微函数表示，因此这类方法也可以用基于梯度的方法进行优化，相关技术非常成熟。更为重要的是，在神经网络的设计中，研究者引入了分布式表示的概念，这也是近些年自然语言处理领域的重要成果之一。传统统计机器翻译仍然把词序列看作离散空间里的由多个特征函数描述的点，类似于$n$-gram语言模型，这类模型对数据稀疏问题非常敏感。此外，人工设计特征也在一定程度上限制了模型对问题的表示能力。神经机器翻译把文字序列表示为实数向量，一方面避免了特征工程繁重的工作，另一方面使得系统可以对文字序列的“表示”进行学习。可以说，神经机器翻译的成功很大程度上源自“ 表示学习”这种自然语言处理的新范式的出现。在表示学习的基础上，注意力机制、深度神经网络等技术都被应用于神经机器翻译，使其得以进一步发展。
+\parinterval 从广义上讲，神经机器翻译是一种基于人工神经网络的方法，它把翻译过程描述为可以用人工神经网络表示的函数，所有的训练和推断都在这些函数上进行。由于神经机器翻译中的神经网络可以用连续可微函数表示，因此这类方法也可以用基于梯度的方法进行优化，相关技术非常成熟。更为重要的是，在神经网络的设计中，研究人员引入了分布式表示的概念，这也是近些年自然语言处理领域的重要成果之一。传统统计机器翻译仍然把词序列看作离散空间里的由多个特征函数描述的点，类似于$n$-gram语言模型，这类模型对数据稀疏问题非常敏感。此外，人工设计特征也在一定程度上限制了模型对问题的表示能力。神经机器翻译把文字序列表示为实数向量，一方面避免了特征工程繁重的工作，另一方面使得系统可以对文字序列的“表示”进行学习。可以说，神经机器翻译的成功很大程度上源自“ 表示学习”这种自然语言处理的新范式的出现。在表示学习的基础上，注意力机制、深度神经网络等技术都被应用于神经机器翻译，使其得以进一步发展。
-\parinterval 虽然神经机器翻译中大量地使用了人工神经网络方法，但是它并不是最早在机器翻译中使用人工神经网络的框架。实际上，人工神经网络在机器翻译中应用的历史要远早于现在的神经机器翻译。 在统计机器翻译时代，也有很多研究者利用人工神经网络进行机器翻译系统模块的构建\upcite{devlin-etal-2014-fast,Schwenk_continuousspace}，比如，研究人员成功地在统计机器翻译系统中使用了基于神经网络的联合表示模型，取得了很好的效果\upcite{devlin-etal-2014-fast}。
+\parinterval 虽然神经机器翻译中大量地使用了人工神经网络方法，但是它并不是最早在机器翻译中使用人工神经网络的框架。实际上，人工神经网络在机器翻译中应用的历史要远早于现在的神经机器翻译。 在统计机器翻译时代，也有很多研究人员利用人工神经网络进行机器翻译系统模块的构建\upcite{devlin-etal-2014-fast,Schwenk_continuousspace}，比如，研究人员成功地在统计机器翻译系统中使用了基于神经网络的联合表示模型，取得了很好的效果\upcite{devlin-etal-2014-fast}。
 \parinterval 不过，以上这些工作大多都是在系统的局部模块中使用人工神经网络和深度学习方法。与之不同的是，神经机器翻译是用人工神经网络完成整个翻译过程的建模，这样做的一个好处是，整个系统可以进行端到端学习，无需引入对任何翻译的隐含结构假设。这种利用端到端学习对机器翻译进行神经网络建模的方式也就成为了现在大家所熟知的神经机器翻译。这里简单列出部分代表性的工作：
@@ -82,7 +82,7 @@
 \vspace{0.3em}
 \item 同年Dzmitry Bahdanau等人首次将{\small\bfnew{注意力机制}}\index{注意力机制}（Attention Mechanism\index{Attention Mechanism}）应用到机器翻译领域，在机器翻译任务上对翻译和局部翻译单元之间的对应关系同时建模\upcite{bahdanau2014neural}。Bahdanau等人工作的意义在于，使用了更加有效的模型来表示源语言的信息，同时使用注意力机制对两种语言不同部分之间的相互联系进行建模。这种方法可以有效地处理长句子的翻译，而且注意力的中间结果具有一定的可解释性\footnote{比如，目标语言和源语言句子不同单词之间的注意力强度能够在一定程度上反应单词之间的互译程度。} 。然而相比于前人的神经机器翻译模型，注意力模型也引入了额外的成本，计算量较大。
 \vspace{0.3em}
-\item 2016年谷歌公司发布了基于多层循环神经网络方法的GNMT系统。该系统集成了当时的神经机器翻译技术，并进行了诸多的改进。它的性能显著优于基于短语的机器翻译系统\upcite{Wu2016GooglesNM}，引起了研究者的广泛关注。在之后不到一年的时间里，脸书公司采用卷积神经网络（CNN）研发了新的神经机器翻译系统\upcite{DBLP:journals/corr/GehringAGYD17}，实现了比基于循环神经网络（RNN）系统更高的翻译水平，并大幅提升翻译速度。
+\item 2016年谷歌公司发布了基于多层循环神经网络方法的GNMT系统。该系统集成了当时的神经机器翻译技术，并进行了诸多的改进。它的性能显著优于基于短语的机器翻译系统\upcite{Wu2016GooglesNM}，引起了研究人员的广泛关注。在之后不到一年的时间里，脸书公司采用卷积神经网络（CNN）研发了新的神经机器翻译系统\upcite{DBLP:journals/corr/GehringAGYD17}，实现了比基于循环神经网络（RNN）系统更高的翻译水平，并大幅提升翻译速度。
 \vspace{0.3em}
 \item 2017年，Ashish Vaswani等人提出了新的翻译模型Transformer。其完全摒弃了循环神经网络和卷积神经网络，仅仅通过多头注意力机制和前馈神经网络，不需要使用序列对齐的循环框架就展示出强大的性能，并且巧妙地解决了翻译中长距离依赖问题\upcite{vaswani2017attention}。Transformer是第一个完全基于注意力机制搭建的模型，不仅训练速度更快，在翻译任务上也获得了更好的结果，一跃成为目前最主流的神经机器翻译框架。
 \vspace{0.3em}
@@ -141,7 +141,7 @@
 \end{figure}
 %----------------------------------------------
-\parinterval  神经机器翻译在其他评价指标上的表现也全面超越统计机器翻译。比如，在IWSLT 2015英语-德语任务中，研究者搭建了四个较为先进的机器翻译系统\upcite{Bentivogli2016NeuralVP}：
+\parinterval  神经机器翻译在其他评价指标上的表现也全面超越统计机器翻译。比如，在IWSLT 2015英语-德语任务中，研究人员搭建了四个较为先进的机器翻译系统\upcite{Bentivogli2016NeuralVP}：
 \begin{itemize}
 \vspace{0.3em}
@@ -253,7 +253,7 @@ NMT                     & 21.7          & 18.7           & -13.7      \\
 \vspace{0.5em}
 \end{itemize}
-\parinterval  当然，神经机器翻译也并不完美，很多问题有待解决。首先，神经机器翻译需要大规模浮点运算的支持，模型的推断速度较低。为了获得优质的翻译结果，往往需要大量GPU设备的支持，计算资源成本很高；其次，由于缺乏人类的先验知识对翻译过程的指导，神经机器翻译的运行过程缺乏可解释性，系统的可干预性也较差；此外，虽然脱离了繁重的特征工程，神经机器翻译仍然需要人工设计网络结构，在模型的各种超参数的设置、训练策略的选择等方面，仍然需要大量的人工参与。这也导致很多实验结果不容易复现。显然，完全不依赖人工的机器翻译还很遥远。不过，随着研究者的不断攻关，很多问题也得到了解决。
+\parinterval  当然，神经机器翻译也并不完美，很多问题有待解决。首先，神经机器翻译需要大规模浮点运算的支持，模型的推断速度较低。为了获得优质的翻译结果，往往需要大量GPU设备的支持，计算资源成本很高；其次，由于缺乏人类的先验知识对翻译过程的指导，神经机器翻译的运行过程缺乏可解释性，系统的可干预性也较差；此外，虽然脱离了繁重的特征工程，神经机器翻译仍然需要人工设计网络结构，在模型的各种超参数的设置、训练策略的选择等方面，仍然需要大量的人工参与。这也导致很多实验结果不容易复现。显然，完全不依赖人工的机器翻译还很遥远。不过，随着研究人员的不断攻关，很多问题也得到了解决。
 %----------------------------------------------------------------------------------------
 %    NEW SECTION  10.2
@@ -272,7 +272,7 @@ NMT                     & 21.7          & 18.7           & -13.7      \\
 \parinterval  编码器-解码器框架是一种典型的基于“表示”的模型。编码器的作用是将输入的文字序列通过某种转换变为一种新的“表示”形式，这种“表示”包含了输入序列的所有信息。之后，解码器把这种“表示”重新转换为输出的文字序列。这其中的一个核心问题是表示学习，即：如何定义对输入文字序列的表示形式，并自动学习这种表示，同时应用它生成输出序列。一般来说，不同的表示学习方法可以对应不同的机器翻译模型，比如，在最初的神经机器翻译模型中，源语言句子都被表示为一个独立的向量，这时表示结果是静态的；而在注意力机制中，源语言句子的表示是动态的，也就是翻译目标语言的每个单词时都会使用不同的表示结果。
-\parinterval  图\ref{fig:10-5}是一个应用编码器-解码器结构来解决机器翻译问题的简单实例。给定一个中文句子“我/对/你/感到/满意”，编码器会将这句话编码成一个实数向量$(0.2, -1, 6, \\ 5, 0.7, -2)$，这个向量就是源语言句子的“表示”结果。虽然有些不可思议，但是神经机器翻译模型把这个向量等同于输入序列。向量中的数字并没有实际的意义，然而解码器却能从中提取到源语言句子中所包含的信息。也有研究者把向量的每一个维度看作是一个“特征”，这样源语言句子就被表示成多个“特征”的联合，而且这些特征可以被自动学习。有了这样的源语言句子的“表示”，解码器可以把这个实数向量作为输入，然后逐词生成目标语言句子“I am satisfied with you”。
+\parinterval  图\ref{fig:10-5}是一个应用编码器-解码器结构来解决机器翻译问题的简单实例。给定一个中文句子“我/对/你/感到/满意”，编码器会将这句话编码成一个实数向量$(0.2, -1, 6, \\ 5, 0.7, -2)$，这个向量就是源语言句子的“表示”结果。虽然有些不可思议，但是神经机器翻译模型把这个向量等同于输入序列。向量中的数字并没有实际的意义，然而解码器却能从中提取到源语言句子中所包含的信息。也有研究人员把向量的每一个维度看作是一个“特征”，这样源语言句子就被表示成多个“特征”的联合，而且这些特征可以被自动学习。有了这样的源语言句子的“表示”，解码器可以把这个实数向量作为输入，然后逐词生成目标语言句子“I am satisfied with you”。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -415,7 +415,7 @@ NMT                     & 21.7          & 18.7           & -13.7      \\
 \parinterval 显然，根据上下文中提到的“没/吃饭”、“很/饿”，最佳的答案是“吃饭”或者“吃东西”。也就是，对序列中某个位置的答案进行预测时需要记忆当前时刻之前的序列信息，因此，循环神经网络应运而生。实际上循环神经网络有着极为广泛的应用，例如语音识别、语言建模以及即将要介绍的神经机器翻译。
-\parinterval {\chapternine}已经对循环神经网络的基本知识进行过介绍，这里再回顾一下。简单来说，循环神经网络由循环单元组成。对于序列中的任意时刻，都有一个循环单元与之对应，它会融合当前时刻的输入和上一时刻循环单元的输出，生成当前时刻的输出。这样每个时刻的信息都会被传递到下一时刻，这也间接达到了记录历史信息的目的。比如，对于序列$\seq{x}=\{x_1, x_2,..., x_m\}$，循环神经网络会按顺序输出一个序列$\seq{h}=\{ \mathbi{h}_1, \mathbi{h}_2,..., \mathbi{h}_m \}$，其中$\mathbi{h}_i$表示$i$时刻循环神经网络的输出（通常为一个向量）。
+\parinterval {\chapternine}已经对循环神经网络的基本知识进行过介绍，这里再回顾一下。简单来说，循环神经网络由循环单元组成。对于序列中的任意时刻，都有一个循环单元与之对应，它会融合当前时刻的输入和上一时刻循环单元的输出，生成当前时刻的输出。这样每个时刻的信息都会被传递到下一时刻，这也间接达到了记录历史信息的目的。比如，对于序列$\seq{x}=\{x_1,..., x_m\}$，循环神经网络会按顺序输出一个序列$\seq{h}=\{ \mathbi{h}_1,..., \mathbi{h}_m \}$，其中$\mathbi{h}_i$表示$i$时刻循环神经网络的输出（通常为一个向量）。
 \parinterval 图\ref{fig:10-8}展示了一个循环神经网络处理序列问题的实例。当前时刻循环单元的输入由上一个时刻的输出和当前时刻的输入组成，因此也可以理解为，网络当前时刻计算得到的输出是由之前的序列共同决定的，即网络在不断地传递信息的过程中记忆了历史信息。以最后一个时刻的循环单元为例，它在对“开始”这个单词的信息进行处理时，参考了之前所有词（“<sos>\ 让\ 我们”）的信息。
@@ -445,14 +445,14 @@ NMT                     & 21.7          & 18.7           & -13.7      \\
 \label{eq:10-1}
 \end{eqnarray}
-\noindent 这里，用$\seq{{x}}=\{ x_1,x_2,..., x_m \}$表示输入的源语言单词序列，$\seq{{y}}=\{ y_1,y_2,..., y_n \}$ 表示生成的目标语言单词序列。由于神经机器翻译在生成译文时采用的是自左向右逐词生成的方式，并在翻译每个单词时考虑已经生成的翻译结果，因此对$ \funp{P} (\seq{{y}} | \seq{{x}})$的求解可以转换为下式：
+\noindent 这里，用$\seq{{x}}=\{ x_1,..., x_m \}$表示输入的源语言单词序列，$\seq{{y}}=\{ y_1,..., y_n \}$ 表示生成的目标语言单词序列。由于神经机器翻译在生成译文时采用的是自左向右逐词生成的方式，并在翻译每个单词时考虑已经生成的翻译结果，因此对$ \funp{P} (\seq{{y}} | \seq{{x}})$的求解可以转换为下式：
 \begin{eqnarray}
 \funp{P} (\seq{{y}} | \seq{{x}}) &=& \prod_{j=1}^{n} \funp{P} ( y_j | \seq{{y}}_{<j }, \seq{{x}}  )
 \label{eq:10-2}
 \end{eqnarray}
 \vspace{-0.5em}
-\noindent 其中，$ \seq{{y}}_{<j }$表示目标语言第$j$个位置之前已经生成的译文单词序列。$ \funp{P} ( y_j | \seq{{y}}_{<j }, \seq{{x}})$可以被解释为：根据源语言句子$\seq{{x}} $和已生成的目标语言译文片段$\seq{{y}}_{<j }=\{ y_1, y_2,..., y_{j-1} \}$,生成第$j$个目标语言单词$y_j$的概率。
+\noindent 其中，$ \seq{{y}}_{<j }$表示目标语言第$j$个位置之前已经生成的译文单词序列。$ \funp{P} ( y_j | \seq{{y}}_{<j }, \seq{{x}})$可以被解释为：根据源语言句子$\seq{{x}} $和已生成的目标语言译文片段$\seq{{y}}_{<j }=\{ y_1,..., y_{j-1} \}$,生成第$j$个目标语言单词$y_j$的概率。
 \parinterval 求解$\funp{P}(y_j | \seq{{y}}_{<j},\seq{{x}})$有三个关键问题（图\ref{fig:10-10}）：
@@ -490,7 +490,7 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm
 %----------------------------------------------
 \parinterval 输入层（词嵌入）和输出层（Softmax）的内容已在{\chapternine}进行了介绍，因此这里的核心内容是设计循环神经网络结构，即设计循环单元的结构。至今，研究人员已经提出了很多优秀的循环单元结构。其中循环神经网络（RNN）
-是最原始的循环单元结构。在RNN中，对于序列$\seq{{x}}=\{ \mathbi{x}_1, \mathbi{x}_2,...,\mathbi{x}_m \}$，每个时刻$t$都对应一个循环单元，它的输出是一个向量$\mathbi{h}_t$，可以被描述为：
+是最原始的循环单元结构。在RNN中，对于序列$\seq{{x}}=\{ \mathbi{x}_1,...,\mathbi{x}_m \}$，每个时刻$t$都对应一个循环单元，它的输出是一个向量$\mathbi{h}_t$，可以被描述为：
 \begin{eqnarray}
 \mathbi{h}_t &=& f(\mathbi{x}_t \mathbi{U}+\mathbi{h}_{t-1} \mathbi{W}+\mathbi{b})
 \label{eq:10-5}
@@ -507,7 +507,7 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm
 \subsection{长短时记忆网络}
 \label{sec:lstm-cell}
-\parinterval RNN结构使得当前时刻循环单元的状态包含了之前时间步的状态信息。但是这种对历史信息的记忆并不是无损的，随着序列变长，RNN的记忆信息的损失越来越严重。在很多长序列处理任务中（如长文本生成）都观测到了类似现象。对于这个问题，研究者们提出了{\small\bfnew{长短时记忆}}\index{长短时记忆}（Long Short-term Memory）\index{Long Short-term Memory}模型，也就是常说的LSTM模型\upcite{HochreiterLong}。
+\parinterval RNN结构使得当前时刻循环单元的状态包含了之前时间步的状态信息。但是这种对历史信息的记忆并不是无损的，随着序列变长，RNN的记忆信息的损失越来越严重。在很多长序列处理任务中（如长文本生成）都观测到了类似现象。对于这个问题，研究人员提出了{\small\bfnew{长短时记忆}}\index{长短时记忆}（Long Short-term Memory）\index{Long Short-term Memory}模型，也就是常说的LSTM模型\upcite{HochreiterLong}。
 \parinterval LSTM模型是RNN模型的一种改进。相比RNN仅传递前一时刻的状态$\mathbi{h}_{t-1}$，LSTM会同时传递两部分信息：状态信息$\mathbi{h}_{t-1}$和记忆信息$\mathbi{c}_{t-1}$。这里，$\mathbi{c}_{t-1}$是新引入的变量，它也是循环单元的一部分，用于显性地记录需要记录的历史内容，$\mathbi{h}_{t-1}$和$\mathbi{c}_{t-1}$在循环单元中会相互作用。LSTM通过“门”单元来动态地选择遗忘多少以前的信息和记忆多少当前的信息。LSTM中所使用的门单元结构如图\ref{fig:10-11}所示，包括遗忘门，输入门和输出门。图中$\sigma$代表Sigmoid函数，它将函数输入映射为0-1范围内的实数，用来充当门控信号。
@@ -719,13 +719,13 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm
 \parinterval 神经机器翻译中，注意力机制的核心是：针对不同目标语言单词生成不同的上下文向量。这里，可以将注意力机制看做是一种对接收到的信息的加权处理。对于更重要的信息赋予更高的权重即更高的关注度，对于贡献度较低的信息分配较低的权重，弱化其对结果的影响。这样，$\mathbi{C}_j$可以包含更多对当前目标语言位置有贡献的源语言片段的信息。
-\parinterval 根据这种思想，上下文向量$\mathbi{C}_j$被定义为对不同时间步编码器输出的状态序列$\{ \mathbi{h}_1, \mathbi{h}_2,...,\mathbi{h}_m \}$进行加权求和，如下式：
+\parinterval 根据这种思想，上下文向量$\mathbi{C}_j$被定义为对不同时间步编码器输出的状态序列$\{ \mathbi{h}_1,...,\mathbi{h}_m \}$进行加权求和，如下式：
 \begin{eqnarray}
 \mathbi{C}_j&=&\sum_{i} \alpha_{i,j} \mathbi{h}_i
 \label{eq:10-16}
 \end{eqnarray}
-\noindent 其中，$\alpha_{i,j}$是{\small\sffamily\bfseries{注意力权重}}\index{注意力权重}（Attention Weight）\index{Attention Weight}，它表示目标语言第$j$个位置与源语言第$i$个位置之间的相关性大小。这里，将每个时间步编码器的输出$\mathbi{h}_i$ 看作源语言位置$i$的表示结果。进行翻译时，解码器可以根据当前的位置$j$，通过控制不同$\mathbi{h}_i$的权重得到$\mathbi{C}_j$，使得对目标语言位置$j$贡献大的$\mathbi{h}_i$对$\mathbi{C}_j$的影响增大。也就是说，$\mathbi{C}_j$实际上就是\{${\mathbi{h}_1, \mathbi{h}_2,...,\mathbi{h}_m}$\}的一种组合，只不过不同的$\mathbi{h}_i$会根据对目标端的贡献给予不同的权重。图\ref{fig:10-19}展示了上下文向量$\mathbi{C}_j$的计算过程。
+\noindent 其中，$\alpha_{i,j}$是{\small\sffamily\bfseries{注意力权重}}\index{注意力权重}（Attention Weight）\index{Attention Weight}，它表示目标语言第$j$个位置与源语言第$i$个位置之间的相关性大小。这里，将每个时间步编码器的输出$\mathbi{h}_i$ 看作源语言位置$i$的表示结果。进行翻译时，解码器可以根据当前的位置$j$，通过控制不同$\mathbi{h}_i$的权重得到$\mathbi{C}_j$，使得对目标语言位置$j$贡献大的$\mathbi{h}_i$对$\mathbi{C}_j$的影响增大。也就是说，$\mathbi{C}_j$实际上就是\{${\mathbi{h}_1,...,\mathbi{h}_m}$\}的一种组合，只不过不同的$\mathbi{h}_i$会根据对目标端的贡献给予不同的权重。图\ref{fig:10-19}展示了上下文向量$\mathbi{C}_j$的计算过程。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -876,7 +876,7 @@ a (\mathbi{s},\mathbi{h}) &=&  \left\{ \begin{array}{ll}
 \subsection{实例 - GNMT}
 \vspace{0.5em}
-\parinterval 循环神经网络在机器翻译中有很多成功的应用，比如：RNNSearch\upcite{bahdanau2014neural}、Nematus\upcite{DBLP:journals/corr/SennrichFCBHHJL17}等系统就被很多研究者作为实验系统。在众多基于循环神经网络的系统中，GNMT系统是非常成功的一个\upcite{Wu2016GooglesNM}。GNMT是谷歌2016年发布的神经机器翻译系统。
+\parinterval 循环神经网络在机器翻译中有很多成功的应用，比如：RNNSearch\upcite{bahdanau2014neural}、Nematus\upcite{DBLP:journals/corr/SennrichFCBHHJL17}等系统就被很多研究人员作为实验系统。在众多基于循环神经网络的系统中，GNMT系统是非常成功的一个\upcite{Wu2016GooglesNM}。GNMT是谷歌2016年发布的神经机器翻译系统。
 \parinterval GNMT使用了编码器-解码器结构，构建了一个8层的深度网络，每层网络均由LSTM组成，且在编码器-解码器之间使用了多层注意力连接。其结构如图\ref{fig:10-24}，编码器只有最下面2层为双向LSTM。GNMT在束搜索中也加入了长度惩罚和覆盖度因子来确保输出高质量的翻译结果。
 \vspace{0.5em}
@@ -945,7 +945,7 @@ L_{\textrm{ce}}(\mathbi{y},\hat{\mathbi{y}}) &=& - \sum_{k=1}^{|V|} \mathbi{y}[k
 \label{eq:10-25}
 \end{eqnarray}
-\noindent 其中$\mathbi{y}[k]$ 和$\hat{\mathbi{y}}[k]$分别表示向量$\mathbi{y}$和$\hat{\mathbi{y}}$的第$k$维，$|V|$表示输出向量的维度（等于词表大小）。假设有$n$个训练样本，模型输出的概率分布为$\mathbi{Y} = \{ \mathbi{y}_1,\mathbi{y}_2,..., \mathbi{y}_n \}$，标准答案的分布$\widehat{\mathbi{Y}}=\{ \hat{\mathbi{y}}_1, \hat{\mathbi{y}}_2,...,\hat{\mathbi{y}}_n \}$。这个训练样本集合上的损失函数可以被定义为：
+\noindent 其中$\mathbi{y}[k]$ 和$\hat{\mathbi{y}}[k]$分别表示向量$\mathbi{y}$和$\hat{\mathbi{y}}$的第$k$维，$|V|$表示输出向量的维度（等于词表大小）。假设有$n$个训练样本，模型输出的概率分布为$\mathbi{Y} = \{ \mathbi{y}_1,..., \mathbi{y}_n \}$，标准答案的分布$\widehat{\mathbi{Y}}=\{ \hat{\mathbi{y}}_1,...,\hat{\mathbi{y}}_n \}$。这个训练样本集合上的损失函数可以被定义为：
 \begin{eqnarray}
 L(\mathbi{Y},\widehat{\mathbi{Y}}) &=& \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j,\hat{\mathbi{y}}_j)
 \label{eq:10-26}
@@ -1187,7 +1187,7 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) &=& \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j
 \subsubsection{2. 束搜索}
 \vspace{0.5em}
-\parinterval 束搜索是一种启发式图搜索算法。相比于全搜索，它可以减少搜索所占用的空间和时间，在每一步扩展的时候，剪掉一些质量比较差的结点，保留下一些质量较高的结点。具体到机器翻译任务，对于每一个目标语言位置，束搜索选择了概率最大的前$k$个单词进行扩展（其中$k$叫做束宽度，或简称为束宽）。如图\ref{fig:10-31}所示，假设\{$y_1, y_2,..., y_n$\}表示生成的目标语言序列，且$k=3$，则束搜索的具体过程为：在预测第一个位置时，可以通过模型得到$y_1$的概率分布，选取概率最大的前3个单词作为候选结果（假设分别为“have”, “has”, “it”）。在预测第二个位置的单词时，模型针对已经得到的三个候选结果（“have”, “has”, “it”）计算第二个单词的概率分布。因为$y_2$对应$|V|$种可能，总共可以得到$3 \times |V|$种结果。然后从中选取使序列概率$\funp{P}(y_2,y_1| \seq{{x}})$最大的前三个$y_2$作为新的输出结果，这样便得到了前两个位置的top-3译文。在预测其他位置时也是如此，不断重复此过程直到推断结束。可以看到，束搜索的搜索空间大小与束宽度有关，也就是：束宽度越大，搜索空间越大，更有可能搜索到质量更高的译文，但同时搜索会更慢。束宽度等于3，意味着每次只考虑三个最有可能的结果，贪婪搜索实际上便是束宽度为1的情况。在神经机器翻译系统实现中，一般束宽度设置在4～8之间。
+\parinterval 束搜索是一种启发式图搜索算法。相比于全搜索，它可以减少搜索所占用的空间和时间，在每一步扩展的时候，剪掉一些质量比较差的结点，保留下一些质量较高的结点。具体到机器翻译任务，对于每一个目标语言位置，束搜索选择了概率最大的前$k$个单词进行扩展（其中$k$叫做束宽度，或简称为束宽）。如图\ref{fig:10-31}所示，假设\{$y_1,..., y_n$\}表示生成的目标语言序列，且$k=3$，则束搜索的具体过程为：在预测第一个位置时，可以通过模型得到$y_1$的概率分布，选取概率最大的前3个单词作为候选结果（假设分别为“have”, “has”, “it”）。在预测第二个位置的单词时，模型针对已经得到的三个候选结果（“have”, “has”, “it”）计算第二个单词的概率分布。因为$y_2$对应$|V|$种可能，总共可以得到$3 \times |V|$种结果。然后从中选取使序列概率$\funp{P}(y_2,y_1| \seq{{x}})$最大的前三个$y_2$作为新的输出结果，这样便得到了前两个位置的top-3译文。在预测其他位置时也是如此，不断重复此过程直到推断结束。可以看到，束搜索的搜索空间大小与束宽度有关，也就是：束宽度越大，搜索空间越大，更有可能搜索到质量更高的译文，但同时搜索会更慢。束宽度等于3，意味着每次只考虑三个最有可能的结果，贪婪搜索实际上便是束宽度为1的情况。在神经机器翻译系统实现中，一般束宽度设置在4～8之间。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -1252,7 +1252,7 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) &=& \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j
 \vspace{0.5em}
 \item 循环神经网络有很多变种结构。比如，除了RNN、LSTM、GRU，还有其他改进的循环单元结构，如LRN\upcite{DBLP:journals/corr/abs-1905-13324}、SRU\upcite{Lei2017TrainingRA}、ATR\upcite{Zhang2018SimplifyingNM}。
 \vspace{0.5em}
-\item 注意力机制的使用是机器翻译乃至整个自然语言处理近几年获得成功的重要因素之一\upcite{bahdanau2014neural,DBLP:journals/corr/LuongPM15}。早期，有研究者尝试将注意力机制和统计机器翻译的词对齐进行统一\upcite{WangNeural,He2016ImprovedNM,li-etal-2019-word}。最近，也有大量的研究工作对注意力机制进行改进，比如，使用自注意力机制构建翻译模型等\upcite{vaswani2017attention}。而对注意力模型的改进也成为了自然语言处理中的热点问题之一。在{\chapterfifteen}会对机器翻译中不同注意力模型进行进一步讨论。
+\item 注意力机制的使用是机器翻译乃至整个自然语言处理近几年获得成功的重要因素之一\upcite{bahdanau2014neural,DBLP:journals/corr/LuongPM15}。早期，有研究人员尝试将注意力机制和统计机器翻译的词对齐进行统一\upcite{WangNeural,He2016ImprovedNM,li-etal-2019-word}。最近，也有大量的研究工作对注意力机制进行改进，比如，使用自注意力机制构建翻译模型等\upcite{vaswani2017attention}。而对注意力模型的改进也成为了自然语言处理中的热点问题之一。在{\chapterfifteen}会对机器翻译中不同注意力模型进行进一步讨论。
 \vspace{0.5em}
 \item 一般来说，神经机器翻译的计算过程是没有人工干预的，翻译流程也无法用人类的知识直接进行解释，因此一个有趣的方向是在神经机器翻译中引入先验知识，使得机器翻译的行为更“像”人。比如，可以使用句法树来引入人类的语言学知识\upcite{Yang2017TowardsBH,Wang2019TreeTI}，基于句法的神经机器翻译也包含大量的树结构的神经网络建模\upcite{DBLP:journals/corr/abs-1809-01854,DBLP:journals/corr/abs-1808-09374}。此外，也可以把用户定义的词典或者翻译记忆加入到翻译过程中\upcite{DBLP:journals/corr/ZhangZ16c,zhang-etal-2017-prior,duan-etal-2020-bilingual,cao-xiong-2018-encoding}，使得用户的约束可以直接反映到机器翻译的结果上来。先验知识的种类还有很多，包括词对齐\upcite{li-etal-2019-word,DBLP:conf/emnlp/MiWI16,DBLP:conf/coling/LiuUFS16}、 篇章信息\upcite{Werlen2018DocumentLevelNM,DBLP:journals/corr/abs-1805-10163,DBLP:conf/acl/LiLWJXZLL20} 等等，都是神经机器翻译中能够使用的信息。
 \end{itemize}

--- a/Chapter11/chapter11.tex
+++ b/Chapter11/chapter11.tex
@@ -266,9 +266,9 @@
 \subsection{位置编码}
 \label{sec:11.2.1}
-\parinterval 与基于循环神经网络的翻译模型类似，基于卷积神经网络的翻译模型同样用词嵌入序列来表示输入序列，记为$\seq{w}=\{\mathbi{w}_1,\mathbi{w}_2,...,\mathbi{w}_m\}$。序列$\seq{w}$ 是维度大小为$m \times d$的矩阵，第$i$个单词$\mathbi{w}_i$是维度为$d$的向量，其中$m$为序列长度，$d$为词嵌入向量维度。和循环神经网络不同的是，基于卷积神经网络的模型需要对每个输入单词位置进行表示。这是由于，在卷积神经网络中，受限于卷积核的大小，单层的卷积神经网络只能捕捉序列局部的相对位置信息。虽然多层的卷积神经网络可以扩大感受野，但是对全局的位置表示并不充分。而相较于基于卷积神经网络的模型，基于循环神经网络的模型按时间步对输入的序列进行建模，这样间接的对位置信息进行了建模。而词序又是自然语言处理任务中重要信息，因此这里需要单独考虑。
+\parinterval 与基于循环神经网络的翻译模型类似，基于卷积神经网络的翻译模型同样用词嵌入序列来表示输入序列，记为$\seq{w}=\{\mathbi{w}_1,...,\mathbi{w}_m\}$。序列$\seq{w}$ 是维度大小为$m \times d$的矩阵，第$i$个单词$\mathbi{w}_i$是维度为$d$的向量，其中$m$为序列长度，$d$为词嵌入向量维度。和循环神经网络不同的是，基于卷积神经网络的模型需要对每个输入单词位置进行表示。这是由于，在卷积神经网络中，受限于卷积核的大小，单层的卷积神经网络只能捕捉序列局部的相对位置信息。虽然多层的卷积神经网络可以扩大感受野，但是对全局的位置表示并不充分。而相较于基于卷积神经网络的模型，基于循环神经网络的模型按时间步对输入的序列进行建模，这样间接的对位置信息进行了建模。而词序又是自然语言处理任务中重要信息，因此这里需要单独考虑。
-\parinterval 为了更好地引入序列的词序信息，该模型引入了位置编码$\seq{p}=\{\mathbi{p}_1,\mathbi{p}_2,...,\mathbi{p}_m\}$，其中$\mathbi{p}_i$的维度大小为$d$，一般和词嵌入维度相等，其中具体数值作为网络可学习的参数。简单来说，$\mathbi{p}_i$是一个可学习的参数向量，对应位置$i$的编码。这种编码的作用就是对位置信息进行表示，不同序列中的相同位置都对应一个唯一的位置编码向量。之后将词嵌入矩阵和位置编码进行相加，得到模型的输入序列$\seq{e}=\{\mathbi{w}_1+\mathbi{p}_1,\mathbi{w}_2+\mathbi{p}_2,...,\mathbi{w}_m+\mathbi{p}_m\}$。 也有研究人员发现卷积神经网络本身具备一定的编码位置信息的能力\upcite{Islam2020HowMP}，而这里额外的位置编码模块可以被看作是对卷积神经网络位置编码能力的一种补充。
+\parinterval 为了更好地引入序列的词序信息，该模型引入了位置编码$\seq{p}=\{\mathbi{p}_1,...,\mathbi{p}_m\}$，其中$\mathbi{p}_i$的维度大小为$d$，一般和词嵌入维度相等，其中具体数值作为网络可学习的参数。简单来说，$\mathbi{p}_i$是一个可学习的参数向量，对应位置$i$的编码。这种编码的作用就是对位置信息进行表示，不同序列中的相同位置都对应一个唯一的位置编码向量。之后将词嵌入矩阵和位置编码进行相加，得到模型的输入序列$\seq{e}=\{\mathbi{w}_1+\mathbi{p}_1,...,\mathbi{w}_m+\mathbi{p}_m\}$。 也有研究人员发现卷积神经网络本身具备一定的编码位置信息的能力\upcite{Islam2020HowMP}，而这里额外的位置编码模块可以被看作是对卷积神经网络位置编码能力的一种补充。
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -461,7 +461,7 @@
 \subsection{深度可分离卷积}
 \label{sec:11.3.1}
-\parinterval 根据前面的介绍，可以看到卷积神经网络容易用于局部检测和处理位置不变的特征。对于特定的表达，比如地点、情绪等，使用卷积神经网络能达到不错的识别效果，因此它常被用在文本分类中\upcite{Kalchbrenner2014ACN,Kim2014ConvolutionalNN,DBLP:conf/naacl/Johnson015,DBLP:conf/acl/JohnsonZ17}。不过机器翻译所面临的情况更复杂，除了局部句子片段信息，我们还希望模型能够捕获句子结构、语义等信息。虽然单层卷积神经网络在文本分类中已经取得了很好的效果\upcite{Kim2014ConvolutionalNN}，但是神经机器翻译等任务仍然需要有效的卷积神经网络。随着深度可分离卷积在机器翻译中的探索\upcite{Kaiser2018DepthwiseSC}，更高效的网络结构被设计出来，获得了比ConvS2S模型更好的性能。
+\parinterval 根据前面的介绍，可以看到卷积神经网络容易用于局部检测和处理位置不变的特征。对于特定的表达，比如地点、情绪等，使用卷积神经网络能达到不错的识别效果，因此它常被用在文本分类中\upcite{Kalchbrenner2014ACN,Kim2014ConvolutionalNN,DBLP:conf/naacl/Johnson015,DBLP:conf/acl/JohnsonZ17}。不过机器翻译所面临的情况更复杂，除了局部句子片段信息，研究人员还希望模型能够捕获句子结构、语义等信息。虽然单层卷积神经网络在文本分类中已经取得了很好的效果\upcite{Kim2014ConvolutionalNN}，但是神经机器翻译等任务仍然需要有效的卷积神经网络。随着深度可分离卷积在机器翻译中的探索\upcite{Kaiser2018DepthwiseSC}，更高效的网络结构被设计出来，获得了比ConvS2S模型更好的性能。
 %----------------------------------------------
 % 图17.
@@ -475,7 +475,7 @@
 \parinterval 深度可分离卷积由深度卷积和逐点卷积两部分结合而成\upcite{sifre2014rigid}。图\ref{fig:11-17}对比了标准卷积、深度卷积和逐点卷积，为了方便显示，图中只画出了部分连接。
-\parinterval 给定输入序列表示$\seq{x} = \{ \mathbi{x}_1,\mathbi{x}_2,...,\mathbi{x}_m \}$，其中$m$为序列长度，$\mathbi{x}_i \in \mathbb{R}^{O} $ ，$O$ 即输入序列的通道数。为了获得与输入序列长度相同的卷积输出结果，首先需要进行填充。为了方便描述，这里在输入序列尾部填充 $K-1$ 个元素（$K$为卷积核窗口的长度），其对应的卷积结果为$\seq{z} = \{ \mathbi{z}_1,\mathbi{z}_2,...,\mathbi{z}_m \}$。
+\parinterval 给定输入序列表示$\seq{x} = \{ \mathbi{x}_1,...,\mathbi{x}_m \}$，其中$m$为序列长度，$\mathbi{x}_i \in \mathbb{R}^{O} $ ，$O$ 即输入序列的通道数。为了获得与输入序列长度相同的卷积输出结果，首先需要进行填充。为了方便描述，这里在输入序列尾部填充 $K-1$ 个元素（$K$为卷积核窗口的长度），其对应的卷积结果为$\seq{z} = \{ \mathbi{z}_1,...,\mathbi{z}_m \}$。
 在标准卷积中，若使用N表示卷积核的个数，也就是标准卷积输出序列的通道数，那么对于第$i$个位置的第$n$个通道$ \mathbi{z}_{i,n}^\textrm{\,std}$，其标准卷积具体计算如下：
 \begin{eqnarray}
 \mathbi{z}_{i,n}^\textrm{\,std} &=& \sum_{o=1}^{O} \sum_{k=0}^{K-1} \mathbi{W}_{k,o,n}^\textrm{\,std} \mathbi{x}_{i+k,o}

--- a/Chapter12/chapter12.tex
+++ b/Chapter12/chapter12.tex
@@ -319,7 +319,7 @@
 \subsection{多头注意力机制}
-\parinterval Transformer中使用的另一项重要技术是{\small\sffamily\bfseries{多头注意力机制}}\index{多头注意力机制}（Multi-head Attention）\index{Multi-head Attention}。“多头”可以理解成将原来的$\mathbi{Q}$、$\mathbi{K}$、$\mathbi{V}$按照隐层维度平均切分成多份。假设切分$h$份，那么最终会得到$\mathbi{Q} = \{ \mathbi{Q}_1, \mathbi{Q}_2,...,\mathbi{Q}_h \}$，$\mathbi{K}=\{ \mathbi{K}_1,\mathbi{K}_2,...,\mathbi{K}_h \}$，$\mathbi{V}=\{ \mathbi{V}_1, \mathbi{V}_2,...,\mathbi{V}_h \}$。多头注意力就是用每一个切分得到的$\mathbi{Q}$，$\mathbi{K}$，$\mathbi{V}$独立的进行注意力计算，即第$i$个头的注意力计算结果$\mathbi{head}_i = \textrm{Attention}(\mathbi{Q}_i,\mathbi{K}_i, \mathbi{V}_i)$。
+\parinterval Transformer中使用的另一项重要技术是{\small\sffamily\bfseries{多头注意力机制}}\index{多头注意力机制}（Multi-head Attention）\index{Multi-head Attention}。“多头”可以理解成将原来的$\mathbi{Q}$、$\mathbi{K}$、$\mathbi{V}$按照隐层维度平均切分成多份。假设切分$h$份，那么最终会得到$\mathbi{Q} = \{ \mathbi{Q}_1,...,\mathbi{Q}_h \}$，$\mathbi{K}=\{ \mathbi{K}_1,...,\mathbi{K}_h \}$，$\mathbi{V}=\{ \mathbi{V}_1,...,\mathbi{V}_h \}$。多头注意力就是用每一个切分得到的$\mathbi{Q}$，$\mathbi{K}$，$\mathbi{V}$独立的进行注意力计算，即第$i$个头的注意力计算结果$\mathbi{head}_i = \textrm{Attention}(\mathbi{Q}_i,\mathbi{K}_i, \mathbi{V}_i)$。
 \parinterval 下面根据图\ref{fig:12-12}详细介绍多头注意力的计算过程：

--- a/Chapter13/chapter13.tex
+++ b/Chapter13/chapter13.tex
@@ -942,10 +942,10 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\tilde{\seq{y}} | \seq{x})
 \begin{itemize}
 \vspace{0.5em}
-\item 对抗样本除了用于提高模型的健壮性之外，还有很多其他的应用场景，比如评估模型。通过构建由对抗样本构造的数据集，可以验证模型对于不同类型噪声健壮性\upcite{DBLP:conf/emnlp/MichelN18}。但是在生成对抗样本时常常要注意或考虑很多问题，比如扰动是否足够细微，在人类难以察觉的同时做到欺骗模型的目的，对抗样本在不同的模型结构或数据集上是否具有足够的泛化能力。生成的方法是否足够高效等等。（{\color{red}参考文献是不是有些少？加个2-3篇？} ）
+\item 对抗样本除了用于提高模型的健壮性之外，还有很多其他的应用场景，比如评估模型。通过构建由对抗样本构造的数据集，可以验证模型对于不同类型噪声健壮性\upcite{DBLP:conf/emnlp/MichelN18}。但是在生成对抗样本时常常要注意或考虑很多问题，比如扰动是否足够细微\upcite{DBLP:conf/cvpr/Moosavi-Dezfooli16,DBLP:conf/cvpr/NguyenYC15}，在人类难以察觉的同时做到欺骗模型的目的，对抗样本在不同的模型结构或数据集上是否具有足够的泛化能力\upcite{DBLP:conf/iclr/LiuCLS17,DBLP:journals/tnn/YuanHZL19}。生成的方法是否足够高效等等\upcite{DBLP:conf/emnlp/JiaL17,DBLP:conf/infocom/YuanHL020}。
 \vspace{0.5em}
-\item 在机器翻译中，强化学习的应用还有很多，比如，MIXER算法用混合策略梯度和极大似然估计的目标函数来更新模型{\red Sequence Level Training with Recurrent Neural Networks}，DAgger{\red A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning}以及DAD{\red Improving Multi-step Prediction of Learned Time Series Models}等算法在训练过程之中逐渐让模型适应推断阶段的模式。此外，强化学习的效果目前还相当不稳定，研究人员提出了大量的方法来进行改善，比如降低方差{\red An Actor-Critic Algorithm for Sequence Prediction;Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback}、使用单语语料{\red Improving Neural Machine Translation Models with Monolingual Data;A Study of Reinforcement Learning for Neural Machine Translation}等等。由于强化学习能从反馈的奖励中学习的特性，有不少研究探究如何在交互式场景中使用强化学习来提升系统性能。典型的例子就是对话系统，人类的反馈可以被用来训练系统，例如small-talk{\red A Deep Reinforcement Learning Chatbot}以及面向任务的对话{\red 	Continuously Learning Neural Dialogue Management}。
+\item 在机器翻译中，强化学习的应用还有很多，比如，MIXER算法用混合策略梯度和极大似然估计的目标函数来更新模型\upcite{Ranzato2016SequenceLT}，DAgger\upcite{DBLP:journals/jmlr/RossGB11}以及DAD\upcite{DBLP:conf/aaai/VenkatramanHB15}等算法在训练过程之中逐渐让模型适应推断阶段的模式。此外，强化学习的效果目前还相当不稳定，研究人员提出了大量的方法来进行改善，比如降低方差\upcite{DBLP:conf/iclr/BahdanauBXGLPCB17,DBLP:conf/emnlp/NguyenDB17}、使用单语语料\upcite{Sennrich2016ImprovingNM,DBLP:conf/emnlp/WuTQLL18}等等。由于强化学习能从反馈的奖励中学习的特性，有不少研究探究如何在交互式场景中使用强化学习来提升系统性能。典型的例子就是对话系统，人类的反馈可以被用来训练系统，例如small-talk\upcite{DBLP:journals/corr/abs-1709-02349}以及面向任务的对话\upcite{DBLP:journals/corr/SuGMRUVWY16a}。
 \vspace{0.5em}
 \item 从广义上说，大多数课程学习方法都是遵循由易到难的原则，然而在实践过程中人们逐渐赋予了课程学习更多的内涵，课程学习的含义早已超越了最原始的定义。一方面，课程学习可以与许多任务相结合，此时，评估准则并不一定总是样本的困难度，这取决于具体的任务。另一方面，在一些任务或数据中，由易到难并不总是有效，有时困难优先反而会取得更好的效果\upcite{DBLP:conf/medprai/SurendranathJ18,zhang2018empirical}，实际上这和我们的直觉不太符合，一种合理的解释是课程学习更适合标签噪声、离群值较多或者是目标任务困难的场景，能提高模型的健壮性和收敛速度，而困难优先的策略则更适合数据集干净的场景\upcite{DBLP:conf/nips/ChangLM17}。

--- a/Chapter15/Figures/figure-encoder-of-bidirectional-tree-structure.png
+++ b/Chapter15/Figures/figure-encoder-of-bidirectional-tree-structure.png
--- a/Chapter15/Figures/figure-encoder-tree-structure-modeling.png
+++ b/Chapter15/Figures/figure-encoder-tree-structure-modeling.png
--- a/Chapter15/Figures/figure-evolution-and-change-of-ml-methods.jpg
+++ b/Chapter15/Figures/figure-evolution-and-change-of-ml-methods.jpg
--- a/Chapter15/Figures/figure-evolution-and-change-of-ml-methods.tex
+++ b/Chapter15/Figures/figure-evolution-and-change-of-ml-methods.tex
 \begin{tikzpicture}
-\node[rounded corners=4pt, minimum width=10.4em, minimum height=7em,fill=yellow!15!gray!15] (box1) at (0em,0em){};
+\tikzstyle{opnode}=[rectangle,inner sep=0mm,minimum height=2em,minimum width=4em,rounded corners=5pt,fill=teal!17]
-\node[anchor=west,rounded corners=4pt, minimum width=10.4em, minimum height=7em,fill=yellow!15!gray!15] (box2) at ([xshift=2.8em]box1.east){};
+\tikzstyle{cnode}=[circle,draw,minimum size=1.2em]
-\node[anchor=west,rounded corners=4pt, minimum width=10.4em, minimum height=7em,fill=yellow!15!gray!15] (box3) at ([xshift=2.8em]box2.east){};
+\tikzstyle{mnode}=[rectangle,inner sep=0mm,minimum height=5em,minimum width=11em,rounded corners=5pt,fill=yellow!20]
+\tikzstyle{wnode}=[inner sep=0mm,minimum height=1.5em]
-\draw[densely dotted,line width=1.2pt] ([xshift=0.8em]box1.90) -- ([xshift=0.8em]box1.-90);
-\draw[densely dotted,line width=1.2pt] ([xshift=0.8em]box2.90) -- ([xshift=0.8em]box2.-90);
+\begin{pgfonlayer}{background}
-\draw[densely dotted,line width=1.2pt] ([xshift=0.8em]box3.90) -- ([xshift=0.8em]box3.-90);
+\node[anchor=west,mnode] (m1) at (0em,0em){};
+\node[anchor=west,mnode] (m2) at ([xshift=1em,yshift=0em]m1.east){};
-\node[anchor=west,draw,rounded corners=2pt, minimum width=5em, minimum height=3em,font=\scriptsize,align=center,inner sep=1pt,fill=yellow!10] (n1) at ([xshift=0.5em]box1.west){机器学习算法：\\决策树、支持 \\ 向量机$\cdots$};
+\node[anchor=west,mnode] (m3) at ([xshift=1em,yshift=0em]m2.east){};
-\node[anchor=west,draw,rounded corners=2pt, minimum width=5em, minimum height=3em,font=\scriptsize,align=center,inner sep=1pt,fill=yellow!10] (n2) at ([xshift=0.5em]box2.west){神经网络：\\RNN、CNN、 \\ Transformer$\cdots$};
-\node[anchor=west,draw,rounded corners=2pt, minimum width=5em, minimum height=3em,font=\scriptsize,align=center,inner sep=1pt,fill=ugreen!10] (n3) at ([xshift=0.5em]box3.west){神经网络：\\RNN、CNN、 \\ Transformer$\cdots$};
+\node[anchor=north west,rectangle,inner sep=0mm,minimum height=2.6em,minimum width=3.5em,rounded corners=5pt,fill=blue!20] (ml1) at ([xshift=0em,yshift=-0.5em]m1.south west){};
+\node[anchor=west,rectangle,inner sep=0mm,minimum height=2.6em,minimum width=3.5em,rounded corners=5pt,fill=ugreen!20] (ml2) at ([xshift=0.25em,yshift=0em]ml1.east){};
-\foreach \x/\c in {1/yellow,2/ugreen,3/ugreen}{
+\node[anchor=north east,rectangle,inner sep=0mm,minimum height=2.6em,minimum width=3.5em,rounded corners=5pt,fill=red!20] (ml3) at ([xshift=0em,yshift=-0.5em]m1.south east){};
-\node[anchor=north,font=\scriptsize,inner ysep=0.1em] (output_\x)at ([xshift=-2.2em,yshift=-0.5em]box\x.north){输出};
-\node[anchor=north,inner ysep=0.1em] at ([xshift=3em,yshift=-0.5em]box\x.north){\scriptsize\bfnew{执行步骤}};
+\node[anchor=north west,rectangle,inner sep=0mm,minimum height=2.6em,minimum width=5.25em,rounded corners=5pt,fill=blue!20] (mc1) at ([xshift=0em,yshift=-0.5em]m2.south west){};
-\node[anchor=south,font=\scriptsize,inner ysep=0.1em,fill=\c!10,rounded corners=2pt] at ([xshift=-2.2em,yshift=0.5em]box\x.south)(input_\x){输入};
+\node[anchor=north east,rectangle,inner sep=0mm,minimum height=2.6em,minimum width=5.25em,rounded corners=5pt,fill=red!20] (mc2) at ([xshift=0em,yshift=-0.5em]m2.south east){};
-\draw[->,thick] (input_\x.90) -- (n\x.-90);
-\draw[->,thick] (n\x.90) -- (output_\x.-90);
+\node[anchor=north,rectangle,inner sep=0mm,minimum height=2.6em,minimum width=11em,rounded corners=5pt,fill=blue!20] (mr1) at ([xshift=0em,yshift=-0.5em]m3.south){};
+\end{pgfonlayer}
+{\scriptsize
+\node[anchor=south,opnode] (op1) at ([xshift=0em,yshift=1em]m1.north){输出};
+\node[anchor=south,opnode] (op2) at ([xshift=0em,yshift=1em]m2.north){输出};
+\node[anchor=south,opnode] (op3) at ([xshift=0em,yshift=1em]m3.north){输出};
+\node[anchor=north west,wnode,font=\footnotesize,align=left] (w1) at ([xshift=0.3em,yshift=-0.3em]m1.north west){传统机器\\学习};
+\node[anchor=north west,wnode,font=\footnotesize] (w2) at ([xshift=0.3em,yshift=-0.3em]m2.north west){深度学习};
+\node[anchor=north west,wnode,align=left] (w3) at ([xshift=0.3em,yshift=-0.3em]m3.north west){深度学习和网\\络结构搜索};
+{%subfigure-left
+\node[anchor=north,wnode,font=\footnotesize] (wl1) at ([xshift=0em,yshift=0em]ml1.north){训练数据};
+\node[anchor=north,wnode,font=\footnotesize] (wl2) at ([xshift=0em,yshift=0em]ml2.north){特征信息};
+\node[anchor=north,wnode,font=\footnotesize] (wl3) at ([xshift=0em,yshift=0em]ml3.north){模型结构};
+\node[anchor=south,wnode,font=\tiny] (wl4) at ([xshift=0em,yshift=0em]ml1.south){人工/自动收集};
+\node[anchor=south,wnode] (wl5) at ([xshift=0em,yshift=0em]ml2.south){人工设计};
+\node[anchor=south,wnode] (wl6) at ([xshift=0em,yshift=0em]ml3.south){人工设计};
+\node[anchor=south,cnode,fill=white] (cl1) at ([xshift=-4em,yshift=1.5em]m1.south){};
+\node[anchor=north,cnode,fill=white] (cl2) at ([xshift=0em,yshift=-1em]m1.north){};
+\node[anchor=south west,wnode,align=left,font=\tiny] (wl7) at ([xshift=0.5em,yshift=0em]cl1.east){使用{\color{ugreen!60}特征}对{\color{blue!60}数据}\\中信息进行提取};
+\node[anchor=west,wnode,align=right,font=\tiny] (wl8) at ([xshift=0.5em,yshift=0em]cl2.east){使用提取的信息对\\{\color{red!50}模型}中的参数\\进行训练};
+\draw [-,thick,dotted] ([xshift=0em,yshift=0em]ml1.west) -- ([xshift=0em,yshift=0em]ml1.east);
+\draw [-,thick,dotted] ([xshift=0em,yshift=0em]ml2.west) -- ([xshift=0em,yshift=0em]ml2.east);
+\draw [-,thick,dotted] ([xshift=0em,yshift=0em]ml3.west) -- ([xshift=0em,yshift=0em]ml3.east);
+\draw[->,thick] ([xshift=-1.5em,yshift=-0em]ml1.north)..controls +(north:3em) and +(west:0em)..([xshift=-0em,yshift=-0em]cl1.west) ;
+\draw[->,thick] ([xshift=0em,yshift=-0em]ml2.north)..controls +(north:3em) and +(west:0em)..([xshift=-0em,yshift=-0em]cl1.east) ;
+\draw[->,thick] ([xshift=0em,yshift=-0em]cl1.north)..controls +(north:2em) and +(west:0em)..([xshift=-0em,yshift=-0em]cl2.west) ;
+\draw[->,thick] ([xshift=0em,yshift=-0em]ml3.north)..controls +(north:6em) and +(west:0em)..([xshift=-0em,yshift=-0em]cl2.east) ;
+\draw [->,thick] ([xshift=0em,yshift=0em]cl2.north) -- ([xshift=0em,yshift=0em]op1.south);
+}
+{%subfigure-center
+\node[anchor=north,wnode,font=\footnotesize] (wc1) at ([xshift=0em,yshift=0em]mc1.north){训练数据};
+\node[anchor=north,wnode,font=\footnotesize] (wc2) at ([xshift=0em,yshift=0em]mc2.north){模型结构};
+\node[anchor=south,wnode] (wc3) at ([xshift=0em,yshift=0em]mc1.south){人工/自动收集};
+\node[anchor=south,wnode] (wc4) at ([xshift=0em,yshift=0em]mc2.south){人工设计};
+\node[anchor=south,cnode,fill=white] (cc1) at ([xshift=-4em,yshift=1.5em]m2.south){};
+\node[anchor=north,cnode,fill=white] (cc2) at ([xshift=0em,yshift=-1em]m2.north){};
+\node[anchor=south west,wnode,align=left,font=\tiny] (wl7) at ([xshift=0.5em,yshift=0em]cc1.east){使用{\color{red!60}模型}对{\color{blue!60}数据}\\中信息进行提取};
+\node[anchor=west,wnode,align=right,font=\tiny] (wl8) at ([xshift=0.5em,yshift=0em]cc2.east){使用提取的信息对\\{\color{red!60}模型}中的参数\\进行训练};
+\draw [-,thick,dotted] ([xshift=0em,yshift=0em]mc1.west) -- ([xshift=0em,yshift=0em]mc1.east);
+\draw [-,thick,dotted] ([xshift=0em,yshift=0em]mc2.west) -- ([xshift=0em,yshift=0em]mc2.east);
+\draw[->,thick] ([xshift=-2em,yshift=-0em]mc1.north)..controls +(north:3em) and +(west:0em)..([xshift=-0em,yshift=-0em]cc1.west) ;
+\draw[->,thick] ([xshift=0em,yshift=-0em]mc2.north)..controls +(north:2em) and +(west:0em)..([xshift=-0em,yshift=-0em]cc1.east) ;
+\draw[->,thick] ([xshift=0em,yshift=-0em]cc1.north)..controls +(north:2em) and +(west:0em)..([xshift=-0em,yshift=-0em]cc2.west) ;
+\draw[->,thick] ([xshift=0em,yshift=-0em]mc2.north)..controls +(north:6em) and +(west:0em)..([xshift=-0em,yshift=-0em]cc2.east) ;
+\draw [->,thick] ([xshift=0em,yshift=0em]cc2.north) -- ([xshift=0em,yshift=0em]op2.south);
 }
-\node[anchor=east,font=\scriptsize,align=center,inner xsep=0pt] at ([xshift=-0.2em]box1.east){1.特征提取；\\2.模型设计； \\3.实验验证。};
+{%subfigure-right
-\node[anchor=east,font=\scriptsize,align=center,inner xsep=0pt] at ([xshift=-0.2em]box2.east){1.模型设计； \\2.实验验证。\\ };
+\node[anchor=north,wnode,font=\footnotesize] (wr1) at ([xshift=0em,yshift=0em]mr1.north){训练数据};
-\node[anchor=east,font=\scriptsize,align=center,inner xsep=0pt] at ([xshift=-0.2em]box3.east){1.实验验证。 \\ \\};
+\node[anchor=south,wnode] (wr2) at ([xshift=0em,yshift=0em]mr1.south){人工/自动收集};
+\node[anchor=south,cnode,fill=white] (cr1) at ([xshift=-2.5em,yshift=2.8em]m3.south){};
+\node[anchor=north,cnode,fill=white] (cr2) at ([xshift=0em,yshift=-1em]m3.north){};
+\node[anchor=south,cnode,fill=white] (cr3) at ([xshift=-5.8em,yshift=0.7em]m3.south){};
+\node[anchor=north,wnode,align=right,font=\tiny] (wr3) at ([xshift=1em,yshift=-0.5em]cr2.south){使用{\color{red!60}模型}提\\取{\color{blue!60}数据}\\中的\\信息};
+\node[anchor=west,wnode,align=right,font=\tiny] (wr4) at ([xshift=0.5em,yshift=0em]cr2.east){使用提取的信息对\\{\color{red!60}模型}中的参数\\进行训练};
+\node[anchor=west,wnode,align=left,font=\tiny] (wr5) at ([xshift=0.2em,yshift=0em]cr3.east){使用{\color{blue!60}数据}对{\color{red!60}模型}\\的结构进行搜索};
+\draw [-,thick,dotted] ([xshift=0em,yshift=0em]mr1.west) -- ([xshift=0em,yshift=0em]mr1.east);
+\draw[->,thick] ([xshift=-5.8em,yshift=0em]mr1.north) -- ([xshift=0em,yshift=0em]cr3.south);
+\draw[->,thick] ([xshift=0em,yshift=-0em]cr3.north)..controls +(north:1.3em) and +(west:0em)..([xshift=-0em,yshift=-0em]cr1.west) ;
+\draw[->,thick] ([xshift=1em,yshift=-0em]mr1.north)..controls +(north:4em) and +(west:0em)..([xshift=-0em,yshift=-0em]cr1.east) ;
+\draw[->,thick] ([xshift=0em,yshift=-0em]cr1.north east)..controls +(north:1em) and +(west:0em)..([xshift=-0em,yshift=-0em]cr2.west) ;
+\draw[->,thick] ([xshift=5.7em,yshift=-0em]mr1.north)..controls +(north:6em) and +(west:0em)..([xshift=-0em,yshift=-0em]cr2.east) ;
+\draw[->,thick] ([xshift=0em,yshift=0em]cr2.north) -- ([xshift=0em,yshift=0em]op3.south);
+}
+}
-\node [draw,thick,anchor=west,single arrow,minimum height=1.6em,single arrow head extend=0.4em] at ([xshift=0.6em]box1.east) {};
-\node [draw,thick,anchor=west,single arrow,minimum height=1.6em,single arrow head extend=0.4em] at ([xshift=0.6em]box2.east) {};
-\node[font=\footnotesize, anchor=north] at ([yshift=-0.1em]box1.south){传统机器学习};
-\node[font=\footnotesize, anchor=north] at ([yshift=-0.1em]box2.south){深度学习};
-\node[font=\footnotesize, anchor=north] at ([yshift=-0.1em]box3.south){深度学习\&网络结构搜索};
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter15/Figures/figure-layer-fusion-method-2d.png
+++ b/Chapter15/Figures/figure-layer-fusion-method-2d.png
--- a/Chapter15/Figures/figure-layer-fusion-method-2d.tex
+++ b/Chapter15/Figures/figure-layer-fusion-method-2d.tex
+\begin{tikzpicture}
+\begin{scope}
+\tikzstyle{lnode}=[rectangle,inner sep=0mm,minimum height=1.5em,minimum width=3.5em,rounded corners=2pt,draw]
+\tikzstyle{snode}=[rectangle,inner sep=0mm,minimum height=1.5em,minimum width=0.8em,rounded corners=2pt,draw]
+\tikzstyle{vlnode}=[rectangle,inner sep=0mm,minimum height=1em,minimum width=5em,rounded corners=2pt,draw]
+\node [anchor=west,lnode] (n1) at (0, 0) {$\mathbi{g}^3$};
+\node [anchor=north west,lnode] (n2) at ([xshift=0em,yshift=-0.5em]n1.south west) {$\mathbi{g}^2$};
+\node [anchor=north west,lnode] (n3) at ([xshift=0em,yshift=-0.5em]n2.south west) {$\mathbi{g}^1$};
+\node [anchor=south] (d1) at ([xshift=0em,yshift=0.2em]n1.north) {1D};
+\node [anchor=west,lnode] (n4) at ([xshift=1.2em,yshift=0em]n1.east) {};
+\node [anchor=west,lnode] (n5) at ([xshift=1.2em,yshift=0em]n2.east) {};
+\node [anchor=west,lnode] (n6) at ([xshift=1.2em,yshift=0em]n3.east) {};
+\node [anchor=south,lnode] (n7) at ([xshift=0em,yshift=1em]n4.north) {$\mathbi{W}_1$};
+\node [anchor=west] (sig) at ([xshift=0em,yshift=0.4em]n5.east) {$\sigma$};
+\node [anchor=west,snode,fill=purple!30] (nc11) at ([xshift=1.2em,yshift=0em]n4.east) {};
+\node [anchor=west,snode,fill=yellow!30] (nc12) at ([xshift=0em,yshift=0em]nc11.east) {};
+\node [anchor=west,snode,fill=red!30] (nc13) at ([xshift=0em,yshift=0em]nc12.east) {};
+\node [anchor=west,snode,fill=blue!30] (nc14) at ([xshift=0em,yshift=0em]nc13.east) {};
+\node [anchor=west,snode,font=\footnotesize,fill=ugreen!30] (nc15) at ([xshift=0em,yshift=0em]nc14.east) {$\mathbi{o}_5^3$};
+\node [anchor=west,snode,fill=purple!30] (nc21) at ([xshift=1.2em,yshift=0em]n5.east) {};
+\node [anchor=west,snode,fill=yellow!30] (nc22) at ([xshift=0em,yshift=0em]nc21.east) {};
+\node [anchor=west,snode,fill=red!30] (nc23) at ([xshift=0em,yshift=0em]nc22.east) {};
+\node [anchor=west,snode,fill=blue!30] (nc24) at ([xshift=0em,yshift=0em]nc23.east) {};
+\node [anchor=west,snode,font=\footnotesize,fill=ugreen!30] (nc25) at ([xshift=0em,yshift=0em]nc24.east) {$\mathbi{o}_5^2$};
+\node [anchor=west,snode,fill=purple!30] (nc31) at ([xshift=1.2em,yshift=0em]n6.east) {};
+\node [anchor=west,snode,fill=yellow!30] (nc32) at ([xshift=0em,yshift=0em]nc31.east) {};
+\node [anchor=west,snode,fill=red!30] (nc33) at ([xshift=0em,yshift=0em]nc32.east) {};
+\node [anchor=west,snode,fill=blue!30] (nc34) at ([xshift=0em,yshift=0em]nc33.east) {};
+\node [anchor=west,snode,font=\footnotesize,fill=ugreen!30] (nc35) at ([xshift=0em,yshift=0em]nc34.east) {$\mathbi{o}_5^1$};
+\node [anchor=south,lnode] (n8) at ([xshift=0em,yshift=1em]nc13.north) {$\mathbi{W}_2$};
+\node [anchor=west,font=\footnotesize] (n9) at ([xshift=0.1em,yshift=0.5em]nc25.east) {Softmax};
+\node [anchor=west,snode,fill=purple!30] (ns11) at ([xshift=3.5em,yshift=0em]nc15.east) {};
+\node [anchor=west,snode,fill=yellow!30] (ns12) at ([xshift=0em,yshift=0em]ns11.east) {};
+\node [anchor=west,snode,fill=red!30] (ns13) at ([xshift=0em,yshift=0em]ns12.east) {};
+\node [anchor=west,snode,fill=blue!30] (ns14) at ([xshift=0em,yshift=0em]ns13.east) {};
+\node [anchor=west,snode,font=\tiny,fill=ugreen!30] (ns15) at ([xshift=0em,yshift=0em]ns14.east) {0.3};
+\node [anchor=west,snode,fill=purple!30] (ns21) at ([xshift=3.5em,yshift=0em]nc25.east) {};
+\node [anchor=west,snode,fill=yellow!30] (ns22) at ([xshift=0em,yshift=0em]ns21.east) {};
+\node [anchor=west,snode,fill=red!30] (ns23) at ([xshift=0em,yshift=0em]ns22.east) {};
+\node [anchor=west,snode,fill=blue!30] (ns24) at ([xshift=0em,yshift=0em]ns23.east) {};
+\node [anchor=west,snode,font=\tiny,fill=ugreen!30] (ns25) at ([xshift=0em,yshift=0em]ns24.east) {0.2};
+\node [anchor=west,snode,fill=purple!30] (ns31) at ([xshift=3.5em,yshift=0em]nc35.east) {};
+\node [anchor=west,snode,fill=yellow!30] (ns32) at ([xshift=0em,yshift=0em]ns31.east) {};
+\node [anchor=west,snode,fill=red!30] (ns33) at ([xshift=0em,yshift=0em]ns32.east) {};
+\node [anchor=west,snode,fill=blue!30] (ns34) at ([xshift=0em,yshift=0em]ns33.east) {};
+\node [anchor=west,snode,font=\tiny,fill=ugreen!30] (ns35) at ([xshift=0em,yshift=0em]ns34.east) {0.5};
+\node [anchor=west,vlnode,fill=purple!30] (ln1) at ([xshift=3.5em,yshift=-1.5em]ns15.east) {};
+\node [anchor=north west,vlnode,fill=yellow!30] (ln2) at ([xshift=-0.4em,yshift=-0.4em]ln1.north west) {};
+\node [anchor=north west,vlnode,fill=red!30] (ln3) at ([xshift=-0.4em,yshift=-0.4em]ln2.north west) {};
+\node [anchor=north west,vlnode,fill=blue!30] (ln4) at ([xshift=-0.4em,yshift=-0.4em]ln3.north west) {};
+\node [anchor=north west,vlnode,fill=ugreen!30] (ln5) at ([xshift=-0.4em,yshift=-0.4em]ln4.north west) {};
+\node [anchor=south] (d2) at ([xshift=0em,yshift=0.2em]ln1.north) {2D};
+\node [anchor=south,vlnode,rotate=-90] (ffn) at ([xshift=2em,yshift=0em]ln3.east) {FFN};
+\node [anchor=west,rectangle,inner sep=0mm,minimum height=3.5em,minimum width=0.8em,rounded corners=2pt,draw] (fn) at ([xshift=1.5em,yshift=0em]ffn.north) {$\mathbi{g}$};
+\draw [->,thick] ([xshift=0em,yshift=0em]n1.east) -- ([xshift=0em,yshift=0em]n4.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n2.east) -- ([xshift=0em,yshift=0em]n5.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n3.east) -- ([xshift=0em,yshift=0em]n6.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n4.east) -- ([xshift=0em,yshift=0em]nc11.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n5.east) -- ([xshift=0em,yshift=0em]nc21.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n6.east) -- ([xshift=0em,yshift=0em]nc31.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n7.south) -- ([xshift=0em,yshift=0em]n4.north);
+\draw [->,thick] ([xshift=0em,yshift=0em]n8.south) -- ([xshift=0em,yshift=0em]nc13.north);
+\draw [->,thick] ([xshift=0em,yshift=0em]nc25.east) -- ([xshift=0em,yshift=0em]ns21.west);
+\draw[->,thick,dotted] ([xshift=0em,yshift=-0em]ns15.east)..controls +(east:1.5em) and +(west:1.5em)..([xshift=-0em,yshift=-0em]ln5.west) ;
+\draw[->,thick,dotted] ([xshift=0em,yshift=-0em]ns25.east)..controls +(east:1em) and +(west:1em)..([xshift=-0em,yshift=-0em]ln5.west) ;
+\draw[->,thick,dotted] ([xshift=0em,yshift=-0em]ns35.east)..controls +(east:1.5em) and +(west:1.5em)..([xshift=-0em,yshift=-0em]ln5.west) ;
+\draw [->,thick] ([xshift=0.8em,yshift=0em]ln3.east) -- ([xshift=0em,yshift=0em]ffn.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]ffn.north) -- ([xshift=0em,yshift=0em]fn.west);
+\draw [decorate,decoration={brace,mirror}] ([xshift=0em]n3.south west) to node [midway,font=\small,align=center,xshift=0em,yshift=-0.8em] {$d$} ([xshift=0em]n3.south east);
+\draw [decorate,decoration={brace,mirror}] ([xshift=0em]n6.south west) to node [midway,font=\small,align=center,xshift=0em,yshift=-0.8em] {$d_a$} ([xshift=0em]n6.south east);
+\draw [decorate,decoration={brace,mirror}] ([xshift=0em]n7.north west) to node [midway,font=\small,align=center,xshift=-0.7em,yshift=-0em] {$d$} ([xshift=0em]n7.south west);
+\draw [decorate,decoration={brace}] ([xshift=0em]n7.north west) to node [midway,font=\small,align=center,xshift=0em,yshift=0.7em] {$d$} ([xshift=0em]n7.north east);
+\draw [decorate,decoration={brace,mirror}] ([xshift=0em]n8.north west) to node [midway,font=\small,align=center,xshift=-0.8em,yshift=-0em] {$d_a$} ([xshift=0em]n8.south west);
+\draw [decorate,decoration={brace}] ([xshift=0em]n8.north west) to node [midway,font=\small,align=center,xshift=0em,yshift=0.8em] {$n_{hop}$} ([xshift=0em]n8.north east);
+\draw [decorate,decoration={brace,mirror}] ([xshift=0em]nc31.south west) to node [midway,font=\small,align=center,xshift=0em,yshift=-0.8em] {$n_{hop}$} ([xshift=0em]nc35.south east);
+\draw [decorate,decoration={brace,mirror}] ([xshift=0em]ln5.south west) to node [midway,font=\small,align=center,xshift=0em,yshift=-0.8em] {$d$} ([xshift=0em]ln5.south east);
+\draw [decorate] ([xshift=0em]ln5.south east) to node [midway,font=\footnotesize,align=center,xshift=1em,yshift=-0.5em] {$n_{hop}$} ([xshift=0em]ln1.south east);
+\draw [decorate,decoration={brace,mirror}] ([xshift=0em]fn.south east) to node [midway,font=\small,align=center,xshift=0.7em,yshift=-0em] {$d$} ([xshift=0em]fn.north east);
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter15/Figures/figure-light-weight-transformer-module.png
+++ b/Chapter15/Figures/figure-light-weight-transformer-module.png
--- a/Chapter15/Figures/figure-light-weight-transformer-module.tex
+++ b/Chapter15/Figures/figure-light-weight-transformer-module.tex
+%%%------------------------------------------------------------------------------------------------------------
+%%% 调序模型1：基于距离的调序
+\begin{center}
+\begin{tikzpicture}
+\tikzstyle{manode}=[rectangle,inner sep=0mm,minimum height=4em,minimum width=4em,rounded corners=5pt,thick,draw,fill=blue!20]
+\tikzstyle{ffnnode}=[rectangle,inner sep=0mm,minimum height=1.8em,minimum width=6em,rounded corners=5pt,thick,fill=red!20,draw]
+\tikzstyle{ebnode}=[rectangle,inner sep=0mm,minimum height=1.8em,minimum width=10em,rounded corners=5pt,thick,fill=green!20,draw]
+\begin{scope}[]
+\node [anchor=west,ffnnode] (f1) at (0, 0){FFN};
+\node [anchor=south,ebnode] (e1) at ([xshift=0em,yshift=1em]f1.north){Embedding};
+\node [anchor=south west,manode] (a1) at ([xshift=0em,yshift=1em]e1.north west){Attention};
+\node [anchor=south east,manode] (c1) at ([xshift=0em,yshift=1em]e1.north east){Conv};
+\node [anchor=south west,ebnode] (e2) at ([xshift=0em,yshift=1em]a1.north west){Embedding};
+\node [anchor=south,draw,circle,inner sep=4pt] (add1) at ([xshift=0em,yshift=0.5em]e2.north){};
+\node [anchor=south,ffnnode] (f2) at ([xshift=0em,yshift=0.5em]add1.north){FFN};
+\draw[->,thick] ([xshift=0em,yshift=0em]f1.north)--([xshift=0em,yshift=0em]e1.south);
+\draw[->,thick] ([xshift=0em,yshift=-1em]a1.south)--([xshift=0em,yshift=0em]a1.south);
+\draw[->,thick] ([xshift=0em,yshift=-1em]c1.south)--([xshift=0em,yshift=0em]c1.south);
+\draw[->,thick] ([xshift=0em,yshift=0em]a1.north)--([xshift=0em,yshift=1em]a1.north);
+\draw[->,thick] ([xshift=0em,yshift=0em]c1.north)--([xshift=0em,yshift=1em]c1.north);
+\draw[-,thick] ([xshift=0em,yshift=0em]e2.north)--([xshift=0em,yshift=0em]add1.south);
+\draw[->,thick] ([xshift=0em,yshift=0em]add1.north)--([xshift=0em,yshift=0em]f2.south);
+\draw[-] ([xshift=0em,yshift=0em]add1.west)--([xshift=-0em,yshift=0em]add1.east);
+\draw[-] ([xshift=0em,yshift=0em]add1.south)--([xshift=-0em,yshift=-0em]add1.north);
+\draw[->,thick,rectangle,rounded corners=5pt] ([xshift=0em,yshift=0.5em]f1.north)--([xshift=-6em,yshift=0.5em]f1.north)--([xshift=-5.45em,yshift=0em]add1.west)--([xshift=0em,yshift=0em]add1.west);
+\end{scope}
+\end{tikzpicture}
+\end{center}
\ No newline at end of file
--- a/Chapter15/Figures/figure-linear-layer-aggregation-network.png
+++ b/Chapter15/Figures/figure-linear-layer-aggregation-network.png
--- a/Chapter15/Figures/figure-multi-branch-attention-model.png
+++ b/Chapter15/Figures/figure-multi-branch-attention-model.png
--- a/Chapter15/Figures/figure-multi-branch-attention-model.tex
+++ b/Chapter15/Figures/figure-multi-branch-attention-model.tex
+%%%------------------------------------------------------------------------------------------------------------
+%%% 调序模型1：基于距离的调序
+\begin{center}
+\begin{tikzpicture}
+\tikzstyle{manode}=[rectangle,inner sep=0mm,minimum height=1.8em,minimum width=10em,rounded corners=5pt,thick,draw,fill=teal!20]
+\tikzstyle{ffnnode}=[rectangle,inner sep=0mm,minimum height=1.8em,minimum width=3em,rounded corners=5pt,thick,fill=red!20,draw]
+\tikzstyle{lnnode}=[rectangle,inner sep=0mm,minimum height=2em,minimum width=2.5em,rounded corners=5pt,thick,fill=green!20,draw]
+\begin{scope}[]
+\node [anchor=east,circle,fill=black,inner sep = 2pt] (n1) at (-0, 0) {};
+\node [anchor=west,draw,circle,inner sep=5pt] (n2) at ([xshift=13em,yshift=0em]n1.east){};
+\node [anchor=west,lnnode] (n3) at ([xshift=1.5em,yshift=0em]n2.east){LN};
+\node [anchor=west,circle,fill=black,inner sep=2pt] (n4) at ([xshift=1.5em,yshift=0em]n3.east){};
+\node [anchor=west,draw,circle,inner sep=5pt] (n5) at ([xshift=5em,yshift=0em]n4.east){};
+\node [anchor=west,lnnode] (n6) at ([xshift=1.5em,yshift=0em]n5.east){LN};
+\node [anchor=west,manode] (a1) at ([xshift=1.5em,yshift=2em]n1.east){Multi-Head Attention};
+\node [anchor=south] (a2) at ([xshift=0em,yshift=0.2em]a1.north){$\cdots$};
+\node [anchor=south,manode] (a3) at ([xshift=0em,yshift=0.2em]a2.north){Multi-Head Attention};
+\node [anchor=west,ffnnode] (f1) at ([xshift=1em,yshift=2em]n4.east){FFN};
+\draw[->,thick] ([xshift=-1em,yshift=0em]n1.west)--([xshift=0em,yshift=0em]n1.west);
+\draw[->,thick] ([xshift=0em,yshift=0em]n1.east)--([xshift=0em,yshift=0em]n2.west);
+\draw[->,thick] ([xshift=0em,yshift=0em]n2.east)--([xshift=0em,yshift=0em]n3.west);
+\draw[->,thick] ([xshift=0em,yshift=0em]n3.east)--([xshift=0em,yshift=0em]n4.west);
+\draw[->,thick] ([xshift=0em,yshift=0em]n4.east)--([xshift=0em,yshift=0em]n5.west);
+\draw[->,thick] ([xshift=0em,yshift=0em]n5.east)--([xshift=0em,yshift=0em]n6.west);
+\draw[->,thick] ([xshift=0em,yshift=0em]n6.east)--([xshift=1em,yshift=0em]n6.east);
+\draw[->,thick] ([xshift=0em,yshift=0em]n1.east)--([xshift=0em,yshift=0em]a1.west);
+\draw[->,thick] ([xshift=0em,yshift=0em]n1.east)--([xshift=0em,yshift=0em]a3.west);
+\draw[->,thick] ([xshift=0em,yshift=0em]n4.east)--([xshift=0em,yshift=0em]f1.west);
+\draw[->,thick,ublue,dashed] ([xshift=0em,yshift=0em]a1.east)--([xshift=0em,yshift=0em]n2.west);
+\draw[->,thick,ublue,dashed] ([xshift=0em,yshift=0em]a3.east)--([xshift=0em,yshift=0em]n2.west);
+\draw[->,thick,ublue,dashed] ([xshift=0em,yshift=0em]f1.east)--([xshift=0em,yshift=0em]n5.west);
+\node [anchor=west,ublue,font=\footnotesize,align=left] (w1) at ([xshift=5em,yshift=-0.5em]a2.east){以概率\\$p$丢弃};
+\node [anchor=west,ublue,font=\footnotesize,align=left] (w2) at ([xshift=0.5em,yshift=0em]f1.east){以概率\\$p$丢弃};
+\draw[-] ([xshift=0em,yshift=0em]n2.west)--([xshift=-0em,yshift=0em]n2.east);
+\draw[-] ([xshift=0em,yshift=0em]n2.south)--([xshift=-0em,yshift=-0em]n2.north);
+\draw[-] ([xshift=0em,yshift=0em]n5.west)--([xshift=-0em,yshift=0em]n5.east);
+\draw[-] ([xshift=0em,yshift=0em]n5.south)--([xshift=-0em,yshift=-0em]n5.north);
+\end{scope}
+\end{tikzpicture}
+\end{center}
\ No newline at end of file
--- a/Chapter15/Figures/figure-multi-cell-transformer.png
+++ b/Chapter15/Figures/figure-multi-cell-transformer.png
--- a/Chapter15/Figures/figure-multi-task-structure.png
+++ b/Chapter15/Figures/figure-multi-task-structure.png
--- a/Chapter15/Figures/figure-parallel-RNN-structure.png
+++ b/Chapter15/Figures/figure-parallel-RNN-structure.png
--- a/Chapter15/Figures/figure-parallel-RNN-structure.tex
+++ b/Chapter15/Figures/figure-parallel-RNN-structure.tex
+%%%------------------------------------------------------------------------------------------------------------
+\begin{center}
+\begin{tikzpicture}
+\tikzstyle{wrnode}=[rectangle,inner sep=0mm,minimum height=1.8em,minimum width=4em,rounded corners=5pt,fill=blue!30]
+\tikzstyle{arnode}=[rectangle,inner sep=0mm,minimum height=1.8em,minimum width=4em,rounded corners=5pt,fill=red!30]
+\tikzstyle{dotnode}=[inner sep=0mm,minimum height=0.5em,minimum width=1.5em]
+\tikzstyle{wnode}=[inner sep=0mm,minimum height=1.8em]
+{\small
+\begin{scope}[]
+\node [anchor=north west,wnode] (w1) at (0,0) {词预测模型};
+\node [anchor=west,wrnode] (w2) at ([xshift=1.5em,yshift=0em]w1.east) {$\mathbi{h}_{1}^{\textrm{word}}$};
+\node [anchor=west,wrnode] (w3) at ([xshift=1.5em,yshift=0em]w2.east) {$\mathbi{h}_{2}^{\textrm{word}}$};
+\node [anchor=west,wrnode] (w4) at ([xshift=7em,yshift=0em]w3.east) {$\mathbi{h}_{4}^{\textrm{word}}$};
+\node [anchor=west,dotnode] (dot1) at ([xshift=1.5em,yshift=0em]w4.east) {$\cdots$};
+\node [anchor=north east,wnode] (a1) at ([xshift=0em,yshift=-6.6em]w1.south east) {动作模型};
+\node [anchor=west,arnode] (a2) at ([xshift=1.5em,yshift=0em]a1.east) {$\mathbi{h}_{1}^{\textrm{action}}$};
+\node [anchor=west,arnode] (a3) at ([xshift=1.5em,yshift=0em]a2.east) {$\mathbi{h}_{2}^{\textrm{action}}$};
+\node [anchor=west,arnode] (a4) at ([xshift=1.5em,yshift=0em]a3.east) {$\mathbi{h}_{3}^{\textrm{action}}$};
+\node [anchor=west,arnode] (a5) at ([xshift=1.5em,yshift=0em]a4.east) {$\mathbi{h}_{4}^{\textrm{action}}$};
+\node [anchor=west,arnode] (a6) at ([xshift=1.5em,yshift=0em]a5.east) {$\mathbi{h}_{5}^{\textrm{action}}$};
+\node [anchor=south,wnode] (word1) at ([xshift=0em,yshift=1em]w2.north) {你};
+\node [anchor=south,wnode] (word2) at ([xshift=0em,yshift=1em]w3.north) {是};
+\node [anchor=south,wnode] (word3) at ([xshift=0em,yshift=1em]w4.north) {谁};
+\node [anchor=north,wnode] (word4) at ([xshift=0em,yshift=-1em]w2.south) {$\langle$sos$\rangle$};
+\node [anchor=north,wnode] (word5) at ([xshift=0em,yshift=-1em]w3.south) {你};
+\node [anchor=north,wnode] (word6) at ([xshift=0em,yshift=-1em]w4.south) {谁};
+\node [anchor=south,wnode] (word7) at ([xshift=0em,yshift=1em]a2.north) {移位};
+\node [anchor=south,wnode] (word8) at ([xshift=0em,yshift=1em]a3.north) {移位};
+\node [anchor=south,wnode] (word9) at ([xshift=0em,yshift=1em]a4.north) {左规约};
+\node [anchor=south,wnode] (word10) at ([xshift=0em,yshift=1em]a5.north) {移位};
+\node [anchor=south,wnode] (word11) at ([xshift=0em,yshift=1em]a6.north) {右规约};
+\node [anchor=north,wnode] (word12) at ([xshift=0em,yshift=-1em]a2.south) {$\langle$sos$\rangle$};
+\node [anchor=north,wnode] (word13) at ([xshift=0em,yshift=-1em]a3.south) {移位};
+\node [anchor=north,wnode] (word14) at ([xshift=0em,yshift=-1em]a4.south) {移位};
+\node [anchor=north,wnode] (word15) at ([xshift=0em,yshift=-1em]a5.south) {左规约};
+\node [anchor=north,wnode] (word16) at ([xshift=0em,yshift=-1em]a6.south) {移位};
+\node [anchor=south,wnode] (wl1) at ([xshift=6em,yshift=-1em]dot1.north) {是};
+\node [anchor=north,wnode] (wl2) at ([xshift=-2em,yshift=-2em]wl1.south) {你};
+\node [anchor=north,wnode] (wl3) at ([xshift=2em,yshift=-2em]wl1.south) {谁};
+\node [anchor=north,font=\tiny,rotate=45] (e1) at ([xshift=-2.2em,yshift=-0.4em]wl1.south) {左规约生成};
+\node [anchor=north,font=\tiny,rotate=-45] (e2) at ([xshift=2.2em,yshift=-0.4em]wl1.south) {右规约生成};
+\draw [->,thick] ([xshift=0em,yshift=0em]wl1.south) -- ([xshift=0em,yshift=0em]wl2.north);
+\draw [->,thick] ([xshift=0em,yshift=0em]wl1.south) -- ([xshift=0em,yshift=0em]wl3.north);
+\draw [->,thick] ([xshift=0em,yshift=0em]w1.east) -- ([xshift=0em,yshift=0em]w2.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]w2.east) -- ([xshift=0em,yshift=0em]w3.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]w3.east) -- ([xshift=0em,yshift=0em]w4.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]w4.east) -- ([xshift=0em,yshift=0em]dot1.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]a1.east) -- ([xshift=0em,yshift=0em]a2.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]a2.east) -- ([xshift=0em,yshift=0em]a3.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]a3.east) -- ([xshift=0em,yshift=0em]a4.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]a4.east) -- ([xshift=0em,yshift=0em]a5.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]a5.east) -- ([xshift=0em,yshift=0em]a6.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]w2.north) -- ([xshift=0em,yshift=0em]word1.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]w3.north) -- ([xshift=0em,yshift=0em]word2.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]w4.north) -- ([xshift=0em,yshift=0em]word3.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]word4.north) -- ([xshift=0em,yshift=0em]w2.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]word5.north) -- ([xshift=0em,yshift=0em]w3.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]word6.north) -- ([xshift=0em,yshift=0em]w4.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]word7.north) -- ([xshift=0em,yshift=0em]word4.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]word8.north) -- ([xshift=0em,yshift=0em]word5.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]word10.north) -- ([xshift=0em,yshift=0em]word6.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]a2.north) -- ([xshift=0em,yshift=0em]word7.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]a3.north) -- ([xshift=0em,yshift=0em]word8.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]a4.north) -- ([xshift=0em,yshift=0em]word9.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]a5.north) -- ([xshift=0em,yshift=0em]word10.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]a6.north) -- ([xshift=0em,yshift=0em]word11.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]word12.north) -- ([xshift=0em,yshift=0em]a2.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]word13.north) -- ([xshift=0em,yshift=0em]a3.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]word14.north) -- ([xshift=0em,yshift=0em]a4.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]word15.north) -- ([xshift=0em,yshift=0em]a5.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]word16.north) -- ([xshift=0em,yshift=0em]a6.south);
+\draw[->,thick,dashed] ([xshift=0em,yshift=-0em]word1.east)..controls +(east:4em) and +(west:3em)..([xshift=-0em,yshift=-0em]a3.west) ;
+\draw[->,thick,dashed] ([xshift=0em,yshift=-0em]word2.east)..controls +(east:4em) and +(west:3em)..([xshift=-0em,yshift=-0em]a4.west) ;
+\draw[->,thick,dashed] ([xshift=0em,yshift=-0em]word2.east)..controls +(east:10em) and +(west:4em)..([xshift=-0em,yshift=-0em]a5.west) ;
+\draw[->,thick,dashed] ([xshift=0em,yshift=-0em]word3.east)..controls +(east:4em) and +(west:3em)..([xshift=-0em,yshift=-0em]a6.west) ;
+\end{scope}
+}
+\end{tikzpicture}
+\end{center}
\ No newline at end of file
--- a/Chapter15/Figures/figure-parsing-tree-of-a-sentence.png
+++ b/Chapter15/Figures/figure-parsing-tree-of-a-sentence.png
--- a/Chapter15/Figures/figure-parsing-tree-of-a-sentence.tex
+++ b/Chapter15/Figures/figure-parsing-tree-of-a-sentence.tex
+%%%------------------------------------------------------------------------------------------------------------
+\begin{center}
+\begin{tikzpicture}
+\tikzstyle{wnode}=[inner sep=0mm,minimum height=1.5em,minimum width=2em]
+\begin{scope}[sibling distance=15pt, level distance = 30pt]
+\Tree[.\node(n1){{S}};
+        [.\node(n2){{NP}};
+	        [.\node(n3){{PRN}}; \node(w1){{I}};]
+		    ]
+	        [.\node(n4){{VP}};
+	            [. \node(n5){VBP}; \node(w2){{love}};]
+	            [. \node(cw4){NP};
+                   [. \node(n6){NNS}; \node(w3){{dogs}};]
+                ]
+	        ]
+        ]
+     ]
+\node [anchor=north] (label1) at ([xshift=0em,yshift=-4em]w2.south) {(a)句法树};
+\end{scope}
+\begin{scope}[xshift=1.8in,yshift=0em]
+\node [anchor=west,wnode] (w1) at (0,0) {I};
+\node [anchor=west,wnode] (w2) at ([xshift=3em,yshift=0em]w1.east) {love};
+\node [anchor=west,wnode] (w3) at ([xshift=3em,yshift=0em]w2.east) {dogs};
+\node [anchor=north,wnode] (w4) at ([xshift=0em,yshift=-2em]w1.south) {$w_1$};
+\node [anchor=north,wnode] (w5) at ([xshift=0em,yshift=-2em]w2.south) {$w_2$};
+\node [anchor=north,wnode] (w6) at ([xshift=0em,yshift=-2em]w3.south) {$w_3$};
+\node [anchor=north] (label2) at ([xshift=0em,yshift=-1.5em]w5.south) {(b)词序列};
+\end{scope}
+\begin{scope}[xshift=1.2in,yshift=-1.5in]
+\node [anchor=west,wnode] (l1) at (0,0) {S};
+\node [anchor=west,wnode] (l2) at ([xshift=1em,yshift=0em]l1.east) {NP};
+\node [anchor=west,wnode] (l3) at ([xshift=1em,yshift=0em]l2.east) {PRN};
+\node [anchor=west,wnode] (l4) at ([xshift=1em,yshift=0em]l3.east) {VP};
+\node [anchor=west,wnode] (l5) at ([xshift=1em,yshift=0em]l4.east) {VBP};
+\node [anchor=west,wnode] (l6) at ([xshift=1em,yshift=0em]l5.east) {NP};
+\node [anchor=west,wnode] (l7) at ([xshift=1em,yshift=0em]l6.east) {NNS};
+\node [anchor=north,wnode] (l8) at ([xshift=0em,yshift=-1em]l1.south) {$l_1$};
+\node [anchor=north,wnode] (l9) at ([xshift=0em,yshift=-1em]l2.south) {$l_2$};
+\node [anchor=north,wnode] (l10) at ([xshift=0em,yshift=-1em]l3.south) {$l_3$};
+\node [anchor=north,wnode] (l11) at ([xshift=0em,yshift=-1em]l4.south) {$l_4$};
+\node [anchor=north,wnode] (l12) at ([xshift=0em,yshift=-1em]l5.south) {$l_5$};
+\node [anchor=north,wnode] (l13) at ([xshift=0em,yshift=-1em]l6.south) {$l_6$};
+\node [anchor=north,wnode] (l14) at ([xshift=0em,yshift=-1em]l7.south) {$l_7$};
+\node [anchor=north] (label3) at ([xshift=0em,yshift=-1.5em]l11.south) {(c)句法序列};
+\end{scope}
+\end{tikzpicture}
+\end{center}
\ No newline at end of file
--- a/Chapter15/Figures/figure-syntax-tree-linearization-example.png
+++ b/Chapter15/Figures/figure-syntax-tree-linearization-example.png
--- a/Chapter15/Figures/figure-syntax-tree-linearization-example.tex
+++ b/Chapter15/Figures/figure-syntax-tree-linearization-example.tex
+%%%------------------------------------------------------------------------------------------------------------
+\begin{center}
+\begin{tikzpicture}
+\tikzstyle{wnode}=[inner sep=0mm,minimum height=1.5em,minimum width=2em]
+{\small
+\begin{scope}[sibling distance=15pt, level distance = 30pt]
+\Tree[.\node(n1){{S}};
+        [.\node(n2){{NP}}; \node(w1){{Jane}};]
+        [.\node(n3){{VP}};
+            [. \node(w2){had};]
+            [. \node(n4){NP}; \node(w3){{a cat}};]
+            ]
+        [. \node(w4){.};]
+     ]
+\end{scope}
+}
+{\small
+\begin{scope}[xshift=1in,yshift=-0.7in]
+\node [anchor=west] (n1) at (0.5em,0em) {(Root(S(NP Jane)NP(VP had(NP a cat)NP)VP .)S)Root};
+\draw [->,very thick] ([xshift=-2.3em,yshift=0em]n1.west) -- ([xshift=-0.5em,yshift=0em]n1.west);
+\end{scope}
+}
+\end{tikzpicture}
+\end{center}
\ No newline at end of file
--- a/Chapter15/Figures/figure-three-fusion-methods-of-tree-structure-information-1.png
+++ b/Chapter15/Figures/figure-three-fusion-methods-of-tree-structure-information-1.png
--- a/Chapter15/Figures/figure-three-fusion-methods-of-tree-structure-information-1.tex
+++ b/Chapter15/Figures/figure-three-fusion-methods-of-tree-structure-information-1.tex
+%%%------------------------------------------------------------------------------------------------------------
+\begin{center}
+\begin{tikzpicture}
+\tikzstyle{wrnode}=[rectangle,inner sep=0mm,minimum height=1.8em,minimum width=3em,rounded corners=5pt,fill=blue!30]
+\tikzstyle{srnode}=[rectangle,inner sep=0mm,minimum height=1.8em,minimum width=3em,rounded corners=5pt,fill=yellow!30]
+\tikzstyle{dotnode}=[inner sep=0mm,minimum height=0.5em,minimum width=1.5em]
+\tikzstyle{wnode}=[inner sep=0mm,minimum height=1.8em]
+{\small
+\begin{scope}[]
+\node [anchor=west,wrnode] (wr1) at (0,0) {$\mathbi{h}_{w_1}$};
+\node [anchor=west,wrnode] (wr2) at ([xshift=1em,yshift=0em]wr1.east) {$\mathbi{h}_{w_2}$};
+\node [anchor=west,wrnode] (wr3) at ([xshift=1em,yshift=0em]wr2.east) {$\mathbi{h}_{w_3}$};
+\node [anchor=west,srnode] (sr1) at ([xshift=2em,yshift=0em]wr3.east) {$\mathbi{h}_{l_1}$};
+\node [anchor=west,dotnode] (dot1) at ([xshift=0.8em,yshift=0em]sr1.east) {$\cdots$};
+\node [anchor=west,srnode] (sr2) at ([xshift=0.8em,yshift=0em]dot1.east) {$\mathbi{h}_{l_3}$};
+\node [anchor=west,dotnode] (dot2) at ([xshift=0.8em,yshift=0em]sr2.east) {$\cdots$};
+\node [anchor=west,srnode] (sr3) at ([xshift=0.8em,yshift=0em]dot2.east) {$\mathbi{h}_{l_5}$};
+\node [anchor=west,dotnode] (dot3) at ([xshift=0.8em,yshift=0em]sr3.east) {$\cdots$};
+\node [anchor=west,srnode] (sr4) at ([xshift=0.8em,yshift=0em]dot3.east) {$\mathbi{h}_{l_7}$};
+\node [anchor=north,wnode,font=\footnotesize] (w1) at ([xshift=0em,yshift=-1em]wr1.south) {$w_1$\ :\ I};
+\node [anchor=north,wnode,font=\footnotesize] (w2) at ([xshift=0em,yshift=-1em]wr2.south) {$w_2$\ :\ love};
+\node [anchor=north,wnode,font=\footnotesize] (w3) at ([xshift=0em,yshift=-1em]wr3.south) {$w_3$\ :\ dogs};
+\node [anchor=north,wnode,font=\footnotesize] (w4) at ([xshift=0em,yshift=-1em]sr1.south) {$l_1$\ :\ S};
+\node [anchor=north,dotnode] (dot4) at ([xshift=0em,yshift=-2.4em]dot1.south) {$\cdots$};
+\node [anchor=north,wnode,font=\footnotesize] (w5) at ([xshift=0em,yshift=-1em]sr2.south) {$l_3$\ :\ PRN};
+\node [anchor=north,dotnode] (dot5) at ([xshift=0em,yshift=-2.2em]dot2.south) {$\cdots$};
+\node [anchor=north,wnode,font=\footnotesize] (w6) at ([xshift=0em,yshift=-1em]sr3.south) {$l_5$\ :\ VBP};
+\node [anchor=north,dotnode] (dot6) at ([xshift=0em,yshift=-2.3em]dot3.south) {$\cdots$};
+\node [anchor=north,wnode,font=\footnotesize] (w7) at ([xshift=0em,yshift=-1em]sr4.south) {$l_7$\ :\ NNS};
+\node [anchor=south,circle,draw,minimum size=1.2em] (c1) at ([xshift=2.5em,yshift=2em]wr2.north){};
+\node [anchor=west,circle,draw,minimum size=1.2em] (c2) at ([xshift=8em,yshift=0em]c1.east){};
+\node [anchor=west,circle,draw,minimum size=1.2em] (c3) at ([xshift=8em,yshift=0em]c2.east){};
+\node [anchor=south,srnode] (m1) at ([xshift=0em,yshift=2em]c1.north) {$\mathbi{h}_{l_1}$};
+\node [anchor=south,wrnode] (m2) at ([xshift=0em,yshift=0em]m1.north) {$\mathbi{h}_{w_1}$};
+\node [anchor=south,srnode] (m3) at ([xshift=0em,yshift=2em]c2.north) {$\mathbi{h}_{l_5}$};
+\node [anchor=south,wrnode] (m4) at ([xshift=0em,yshift=0em]m3.north) {$\mathbi{h}_{w_2}$};
+\node [anchor=south,srnode] (m5) at ([xshift=0em,yshift=2em]c3.north) {$\mathbi{h}_{l_7}$};
+\node [anchor=south,wrnode] (m6) at ([xshift=0em,yshift=0em]m5.north) {$\mathbi{h}_{w_3}$};
+\draw[-] (c1.west)--(c1.east);
+\draw[-] (c1.north)--(c1.south);
+\draw[-] (c2.west)--(c2.east);
+\draw[-] (c2.north)--(c2.south);
+\draw[-] (c3.west)--(c3.east);
+\draw[-] (c3.north)--(c3.south);
+\begin{pgfonlayer}{background}
+\node [rectangle,inner sep=0.5em,draw=blue!80,dashed,very thick,rounded corners=10pt] [fit = (wr1) (wr3) (w1) (w3)] (box1) {};
+\node [rectangle,inner sep=0.5em,draw=yellow!80,dashed,very thick,rounded corners=10pt] [fit = (sr1) (sr4) (w4) (w7)] (box2) {};
+\node [rectangle,minimum height=5em,inner sep=0.6em,fill=gray!20,draw=black,dashed,very thick,rounded corners=8pt] [fit = (m1) (m2)] (box3) {};
+\node [rectangle,minimum height=5em,inner sep=0.6em,fill=gray!20,draw=black,dashed,very thick,rounded corners=8pt] [fit = (m3) (m4)] (box4) {};
+\node [rectangle,minimum height=5em,inner sep=0.6em,fill=gray!20,draw=black,dashed,very thick,rounded corners=8pt] [fit = (m5) (m6)] (box5) {};
+\end{pgfonlayer}
+\node [anchor=south,wnode] (h1) at ([xshift=0em,yshift=0.1em]box3.north) {${\mathbi{h}'}_1$\ :\ };
+\node [anchor=south,wnode] (h2) at ([xshift=0em,yshift=0.1em]box4.north) {${\mathbi{h}'}_2$\ :\ };
+\node [anchor=south,wnode] (h3) at ([xshift=0em,yshift=0.1em]box5.north) {${\mathbi{h}'}_3$\ :\ };
+\draw [->,thick] ([xshift=0em,yshift=0em]w1.north) -- ([xshift=0em,yshift=0em]wr1.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]w2.north) -- ([xshift=0em,yshift=0em]wr2.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]w3.north) -- ([xshift=0em,yshift=0em]wr3.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]w4.north) -- ([xshift=0em,yshift=0em]sr1.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]w5.north) -- ([xshift=0em,yshift=0em]sr2.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]w6.north) -- ([xshift=0em,yshift=0em]sr3.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]w7.north) -- ([xshift=0em,yshift=0em]sr4.south);
+\draw [->,thick] ([xshift=0em,yshift=0.7em]dot4.north) -- ([xshift=0em,yshift=-0.7em]dot1.south);
+\draw [->,thick] ([xshift=0em,yshift=0.7em]dot5.north) -- ([xshift=0em,yshift=-0.7em]dot2.south);
+\draw [->,thick] ([xshift=0em,yshift=0.7em]dot6.north) -- ([xshift=0em,yshift=-0.7em]dot3.south);
+\draw [<->,thick] ([xshift=0em,yshift=0em]wr1.east) -- ([xshift=0em,yshift=0em]wr2.west);
+\draw [<->,thick] ([xshift=0em,yshift=0em]wr2.east) -- ([xshift=0em,yshift=0em]wr3.west);
+\draw [<->,thick] ([xshift=0em,yshift=0em]sr1.east) -- ([xshift=0em,yshift=0em]dot1.west);
+\draw [<->,thick] ([xshift=0em,yshift=0em]dot1.east) -- ([xshift=0em,yshift=0em]sr2.west);
+\draw [<->,thick] ([xshift=0em,yshift=0em]sr2.east) -- ([xshift=0em,yshift=0em]dot2.west);
+\draw [<->,thick] ([xshift=0em,yshift=0em]dot2.east) -- ([xshift=0em,yshift=0em]sr3.west);
+\draw [<->,thick] ([xshift=0em,yshift=0em]sr3.east) -- ([xshift=0em,yshift=0em]dot3.west);
+\draw [<->,thick] ([xshift=0em,yshift=0em]dot3.east) -- ([xshift=0em,yshift=0em]sr4.west);
+\draw[->,thick] ([xshift=0em,yshift=-0em]wr1.north)..controls +(north:2em) and +(west:0em)..([xshift=-0em,yshift=-0em]c1.west) ;
+\draw[->,thick] ([xshift=0em,yshift=-0em]sr2.north)..controls +(north:2em) and +(south:1em)..([xshift=-0em,yshift=-0em]c1.south east) ;
+\draw[->,thick] ([xshift=0em,yshift=-0em]wr2.north)..controls +(north:2em) and +(west:0em)..([xshift=-0em,yshift=-0em]c2.west) ;
+\draw[->,thick] ([xshift=0em,yshift=-0em]sr3.north)..controls +(north:2em) and +(east:0em)..([xshift=-0em,yshift=-0em]c2.east) ;
+\draw[->,thick] ([xshift=0em,yshift=-0em]wr3.north)..controls +(north:2em) and +(south:1em)..([xshift=-0em,yshift=-0em]c3.south west) ;
+\draw[->,thick] ([xshift=0em,yshift=-0em]sr4.north)..controls +(north:2em) and +(east:0em)..([xshift=-0em,yshift=-0em]c3.east) ;
+\draw [->,thick] ([xshift=0em,yshift=0em]c1.north) -- ([xshift=0em,yshift=0em]box3.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]c2.north) -- ([xshift=0em,yshift=0em]box4.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]c3.north) -- ([xshift=0em,yshift=0em]box5.south);
+\node [anchor=north] (r1) at ([xshift=0em,yshift=-1em]w2.south) {词语RNN};
+\node [anchor=north] (r2) at ([xshift=3em,yshift=-1em]w5.south) {句法RNN};
+\node [anchor=north] (label1) at ([xshift=0em,yshift=-4em]dot4.south) {(a)平行结构};
+\end{scope}
+}
+\end{tikzpicture}
+\end{center}
\ No newline at end of file
--- a/Chapter15/Figures/figure-three-fusion-methods-of-tree-structure-information-2.png
+++ b/Chapter15/Figures/figure-three-fusion-methods-of-tree-structure-information-2.png
--- a/Chapter15/Figures/figure-three-fusion-methods-of-tree-structure-information-2.tex
+++ b/Chapter15/Figures/figure-three-fusion-methods-of-tree-structure-information-2.tex
+%%%------------------------------------------------------------------------------------------------------------
+\begin{center}
+\begin{tikzpicture}
+\tikzstyle{wrnode}=[rectangle,inner sep=0mm,minimum height=1.8em,minimum width=3em,rounded corners=5pt,fill=blue!30]
+\tikzstyle{srnode}=[rectangle,inner sep=0mm,minimum height=1.8em,minimum width=3em,rounded corners=5pt,fill=yellow!30]
+\tikzstyle{dotnode}=[inner sep=0mm,minimum height=0.5em,minimum width=1.5em]
+\tikzstyle{wnode}=[inner sep=0mm,minimum height=1.8em]
+{\small
+\begin{scope}[]
+\node [anchor=west,srnode] (sr1) at (0,0) {$\mathbi{h}_{l_1}$};
+\node [anchor=west,dotnode] (dot1) at ([xshift=0.8em,yshift=0em]sr1.east) {$\cdots$};
+\node [anchor=west,srnode] (sr2) at ([xshift=0.8em,yshift=0em]dot1.east) {$\mathbi{h}_{l_3}$};
+\node [anchor=west,dotnode] (dot2) at ([xshift=0.8em,yshift=0em]sr2.east) {$\cdots$};
+\node [anchor=west,srnode] (sr3) at ([xshift=0.8em,yshift=0em]dot2.east) {$\mathbi{h}_{l_5}$};
+\node [anchor=west,dotnode] (dot3) at ([xshift=0.8em,yshift=0em]sr3.east) {$\cdots$};
+\node [anchor=west,srnode] (sr4) at ([xshift=0.8em,yshift=0em]dot3.east) {$\mathbi{h}_{l_7}$};
+\node [anchor=north,wnode,font=\footnotesize] (w4) at ([xshift=0em,yshift=-1em]sr1.south) {$l_1$\ :\ S};
+\node [anchor=north,dotnode] (dot4) at ([xshift=0em,yshift=-2.4em]dot1.south) {$\cdots$};
+\node [anchor=north,wnode,font=\footnotesize] (w5) at ([xshift=0em,yshift=-1em]sr2.south) {$l_3$\ :\ PRN};
+\node [anchor=north,dotnode] (dot5) at ([xshift=0em,yshift=-2.2em]dot2.south) {$\cdots$};
+\node [anchor=north,wnode,font=\footnotesize] (w6) at ([xshift=0em,yshift=-1em]sr3.south) {$l_5$\ :\ VBP};
+\node [anchor=north,dotnode] (dot6) at ([xshift=0em,yshift=-2.3em]dot3.south) {$\cdots$};
+\node [anchor=north,wnode,font=\footnotesize] (w7) at ([xshift=0em,yshift=-1em]sr4.south) {$l_7$\ :\ NNS};
+\node [anchor=south,circle,draw,minimum size=1.2em] (c1) at ([xshift=0em,yshift=4.5em]sr2.north){};
+\node [anchor=south,circle,draw,minimum size=1.2em] (c2) at ([xshift=0em,yshift=4.5em]sr3.north){};
+\node [anchor=south,circle,draw,minimum size=1.2em] (c3) at ([xshift=0em,yshift=4.5em]sr4.north){};
+\draw[-] (c1.west)--(c1.east);
+\draw[-] (c1.north)--(c1.south);
+\draw[-] (c2.west)--(c2.east);
+\draw[-] (c2.north)--(c2.south);
+\draw[-] (c3.west)--(c3.east);
+\draw[-] (c3.north)--(c3.south);
+\node [anchor=north east,wnode,font=\footnotesize] (w1) at ([xshift=-1em,yshift=-1em]c1.south west) {$w_1$\ :\ I};
+\node [anchor=north east,wnode,font=\footnotesize] (w2) at ([xshift=-1em,yshift=-1em]c2.south west) {$w_2$\ :\ love};
+\node [anchor=north east,wnode,font=\footnotesize] (w3) at ([xshift=-1em,yshift=-1em]c3.south west) {$w_3$\ :\ dogs};
+\node [anchor=south,wnode] (w8) at ([xshift=0em,yshift=0.5em]c1.north) {$\mathbi{e}_{w_1}$};
+\node [anchor=south,wnode] (w9) at ([xshift=0em,yshift=0.5em]c2.north) {$\mathbi{e}_{w_2}$};
+\node [anchor=south,wnode] (w10) at ([xshift=0em,yshift=0.5em]c3.north) {$\mathbi{e}_{w_2}$};
+\begin{pgfonlayer}{background}
+\node [rectangle,minimum height=5em,inner sep=0.6em,fill=ugreen!20,rounded corners=8pt] [fit = (c1) (w8)] (box6) {};
+\node [rectangle,minimum height=5em,inner sep=0.6em,fill=ugreen!20,rounded corners=8pt] [fit = (c2) (w9)] (box7) {};
+\node [rectangle,minimum height=5em,inner sep=0.6em,fill=ugreen!20,rounded corners=8pt] [fit = (c3) (w10)] (box8) {};
+\end{pgfonlayer}
+\node [anchor=south,wrnode] (wr1) at ([xshift=0em,yshift=1em]box6.north) {$\mathbi{h}_{w_1}$};
+\node [anchor=south,wrnode] (wr2) at ([xshift=0em,yshift=1em]box7.north) {$\mathbi{h}_{w_2}$};
+\node [anchor=south,wrnode] (wr3) at ([xshift=0em,yshift=1em]box8.north) {$\mathbi{h}_{w_3}$};
+\node [anchor=south,wnode] (h1) at ([xshift=0em,yshift=0.3em]wr1.north) {${\mathbi{h}'}_1$\ :\ };
+\node [anchor=south,wnode] (h2) at ([xshift=0em,yshift=0.3em]wr2.north) {${\mathbi{h}'}_2$\ :\ };
+\node [anchor=south,wnode] (h3) at ([xshift=0em,yshift=0.3em]wr3.north) {${\mathbi{h}'}_3$\ :\ };
+\begin{pgfonlayer}{background}
+\node [rectangle,minimum width=20em,minimum height=13em,inner sep=0.5em,draw=blue!80,dashed,very thick,rounded corners=10pt] [fit = (h1) (w1) (h3) (c3)] (box1) {};
+\node [rectangle,inner sep=0.5em,draw=yellow!80,dashed,very thick,rounded corners=10pt] [fit = (sr1) (sr4) (w4) (w7)] (box2) {};
+\node [rectangle,inner sep=0.4em,fill=gray!20,draw=black,dashed,very thick,rounded corners=8pt] [fit = (wr1)] (box3) {};
+\node [rectangle,inner sep=0.4em,fill=gray!20,draw=black,dashed,very thick,rounded corners=8pt] [fit = (wr2)] (box4) {};
+\node [rectangle,inner sep=0.4em,fill=gray!20,draw=black,dashed,very thick,rounded corners=8pt] [fit = (wr3)] (box5) {};
+\end{pgfonlayer}
+\draw [->,thick] ([xshift=0em,yshift=0em]w4.north) -- ([xshift=0em,yshift=0em]sr1.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]w5.north) -- ([xshift=0em,yshift=0em]sr2.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]w6.north) -- ([xshift=0em,yshift=0em]sr3.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]w7.north) -- ([xshift=0em,yshift=0em]sr4.south);
+\draw [->,thick] ([xshift=0em,yshift=0.7em]dot4.north) -- ([xshift=0em,yshift=-0.7em]dot1.south);
+\draw [->,thick] ([xshift=0em,yshift=0.7em]dot5.north) -- ([xshift=0em,yshift=-0.7em]dot2.south);
+\draw [->,thick] ([xshift=0em,yshift=0.7em]dot6.north) -- ([xshift=0em,yshift=-0.7em]dot3.south);
+\draw [<->,thick] ([xshift=0em,yshift=0em]wr1.east) -- ([xshift=0em,yshift=0em]wr2.west);
+\draw [<->,thick] ([xshift=0em,yshift=0em]wr2.east) -- ([xshift=0em,yshift=0em]wr3.west);
+\draw [<->,thick] ([xshift=0em,yshift=0em]sr1.east) -- ([xshift=0em,yshift=0em]dot1.west);
+\draw [<->,thick] ([xshift=0em,yshift=0em]dot1.east) -- ([xshift=0em,yshift=0em]sr2.west);
+\draw [<->,thick] ([xshift=0em,yshift=0em]sr2.east) -- ([xshift=0em,yshift=0em]dot2.west);
+\draw [<->,thick] ([xshift=0em,yshift=0em]dot2.east) -- ([xshift=0em,yshift=0em]sr3.west);
+\draw [<->,thick] ([xshift=0em,yshift=0em]sr3.east) -- ([xshift=0em,yshift=0em]dot3.west);
+\draw [<->,thick] ([xshift=0em,yshift=0em]dot3.east) -- ([xshift=0em,yshift=0em]sr4.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]sr2.north) -- ([xshift=0em,yshift=0em]c1.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]sr3.north) -- ([xshift=0em,yshift=0em]c2.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]sr4.north) -- ([xshift=0em,yshift=0em]c3.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]box6.north) -- ([xshift=0em,yshift=0em]wr1.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]box7.north) -- ([xshift=0em,yshift=0em]wr2.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]box8.north) -- ([xshift=0em,yshift=0em]wr3.south);
+\node [anchor=east] (r2) at ([xshift=-2em,yshift=0em]box2.west) {句法RNN};
+\node [anchor=south] (r1) at ([xshift=0em,yshift=8em]r2.north) {词语RNN};
+\node [anchor=north] (label2) at ([xshift=0em,yshift=-2em]w5.south) {(b)分层结构};
+\end{scope}
+}
+\end{tikzpicture}
+\end{center}
\ No newline at end of file
--- a/Chapter15/Figures/figure-three-fusion-methods-of-tree-structure-information-3.png
+++ b/Chapter15/Figures/figure-three-fusion-methods-of-tree-structure-information-3.png
--- a/Chapter15/Figures/figure-three-fusion-methods-of-tree-structure-information-3.tex
+++ b/Chapter15/Figures/figure-three-fusion-methods-of-tree-structure-information-3.tex
+%%%------------------------------------------------------------------------------------------------------------
+\begin{center}
+\begin{tikzpicture}
+\tikzstyle{hnode}=[rectangle,inner sep=0mm,minimum height=1.8em,minimum width=3em,rounded corners=5pt,fill=red!30]
+\tikzstyle{dotnode}=[inner sep=0mm,minimum height=0.5em,minimum width=1.5em]
+\tikzstyle{wnode}=[inner sep=0mm,minimum height=1.8em]
+{\small
+\begin{scope}[]
+\node [anchor=west,hnode] (n1) at (0,0) {$\mathbi{h}_{1}$};
+\node [anchor=west,hnode] (n2) at ([xshift=1em,yshift=0em]n1.east) {$\mathbi{h}_{2}$};
+\node [anchor=west,dotnode] (dot1) at ([xshift=1em,yshift=0em]n2.east) {$\cdots$};
+\node [anchor=west,hnode] (n3) at ([xshift=1em,yshift=0em]dot1.east) {$\mathbi{h}_{4}$};
+\node [anchor=west,dotnode] (dot2) at ([xshift=1em,yshift=0em]n3.east) {$\cdots$};
+\node [anchor=west,hnode] (n4) at ([xshift=1em,yshift=0em]dot2.east) {$\mathbi{h}_{7}$};
+\node [anchor=west,dotnode] (dot3) at ([xshift=1em,yshift=0em]n4.east) {$\cdots$};
+\node [anchor=west,hnode] (n5) at ([xshift=1em,yshift=0em]dot3.east) {$\mathbi{h}_{10}$};
+\node [anchor=north,wnode,font=\footnotesize] (w1) at ([xshift=0em,yshift=-1em]n1.south) {$l_1$\ :\ S};
+\node [anchor=north,wnode,font=\footnotesize] (w2) at ([xshift=0em,yshift=-1em]n2.south) {$l_3$\ :\ NP};
+\node [anchor=north,dotnode] (dot4) at ([xshift=0em,yshift=-2.4em]dot1.south) {$\cdots$};
+\node [anchor=north,wnode,font=\footnotesize] (w3) at ([xshift=0em,yshift=-1em]n3.south) {$w_1$\ :\ I};
+\node [anchor=north,dotnode] (dot5) at ([xshift=0em,yshift=-2.2em]dot2.south) {$\cdots$};
+\node [anchor=north,wnode,font=\footnotesize] (w4) at ([xshift=0em,yshift=-1em]n4.south) {$w_2$\ :\ love};
+\node [anchor=north,dotnode] (dot6) at ([xshift=0em,yshift=-2.3em]dot3.south) {$\cdots$};
+\node [anchor=north,wnode,font=\footnotesize] (w5) at ([xshift=0em,yshift=-1em]n5.south) {$w_3$\ :\ dogs};
+\node [anchor=south,wnode] (h1) at ([xshift=0em,yshift=0.3em]n3.north) {${\mathbi{h}'}_1$\ :\ };
+\node [anchor=south,wnode] (h2) at ([xshift=0em,yshift=0.3em]n4.north) {${\mathbi{h}'}_2$\ :\ };
+\node [anchor=south,wnode] (h3) at ([xshift=0em,yshift=0.3em]n5.north) {${\mathbi{h}'}_3$\ :\ };
+\begin{pgfonlayer}{background}
+\node [rectangle,inner sep=0.5em,draw=red!80,dashed,very thick,rounded corners=10pt] [fit = (w1) (w5) (n1) (h3)] (box1) {};
+\node [rectangle,inner sep=0.4em,fill=gray!20,draw=black,dashed,very thick,rounded corners=8pt] [fit = (n3)] (box3) {};
+\node [rectangle,inner sep=0.4em,fill=gray!20,draw=black,dashed,very thick,rounded corners=8pt] [fit = (n4)] (box4) {};
+\node [rectangle,inner sep=0.4em,fill=gray!20,draw=black,dashed,very thick,rounded corners=8pt] [fit = (n5)] (box5) {};
+\end{pgfonlayer}
+\node [anchor=east] (r1) at ([xshift=-2em,yshift=0em]box1.west) {词语RNN};
+\node [anchor=south west,wnode] (l1) at ([xshift=1em,yshift=6em]r1.north west) {先序遍历句法树，得到序列：};
+\node [anchor=north west,wnode,align=center] (l2) at ([xshift=0.5em,yshift=-0.6em]l1.north east) {S\\[0.5em]$l_1$};
+\node [anchor=north west,wnode,align=center] (l3) at ([xshift=0.5em,yshift=0em]l2.north east) {NP\\[0.5em]$l_2$};
+\node [anchor=north west,wnode,align=center] (l4) at ([xshift=0.5em,yshift=0em]l3.north east) {PRN\\[0.5em]$l_3$};
+\node [anchor=north west,wnode,align=center] (l5) at ([xshift=0.5em,yshift=0em]l4.north east) {I\\[0.5em]$w_1$};
+\node [anchor=north west,wnode,align=center] (l6) at ([xshift=0.5em,yshift=0em]l5.north east) {VP\\[0.5em]$l_4$};
+\node [anchor=north west,wnode,align=center] (l7) at ([xshift=0.5em,yshift=0em]l6.north east) {VBP\\[0.5em]$l_5$};
+\node [anchor=north west,wnode,align=center] (l8) at ([xshift=0.5em,yshift=0em]l7.north east) {love\\[0.5em]$w_2$};
+\node [anchor=north west,wnode,align=center] (l9) at ([xshift=0.5em,yshift=0em]l8.north east) {NP\\[0.5em]$l_6$};
+\node [anchor=north west,wnode,align=center] (l10) at ([xshift=0.5em,yshift=0em]l9.north east) {NNS\\[0.5em]$l_7$};
+\node [anchor=north west,wnode,align=center] (l11) at ([xshift=0.5em,yshift=0em]l10.north east) {dogs\\[0.5em]$w_3$};
+\draw [->,thick] ([xshift=0em,yshift=0em]w1.north) -- ([xshift=0em,yshift=0em]n1.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]w2.north) -- ([xshift=0em,yshift=0em]n2.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]w3.north) -- ([xshift=0em,yshift=0em]n3.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]w4.north) -- ([xshift=0em,yshift=0em]n4.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]w5.north) -- ([xshift=0em,yshift=0em]n5.south);
+\draw [->,thick] ([xshift=0em,yshift=0.7em]dot4.north) -- ([xshift=0em,yshift=-0.7em]dot1.south);
+\draw [->,thick] ([xshift=0em,yshift=0.7em]dot5.north) -- ([xshift=0em,yshift=-0.7em]dot2.south);
+\draw [->,thick] ([xshift=0em,yshift=0.7em]dot6.north) -- ([xshift=0em,yshift=-0.7em]dot3.south);
+\draw [<->,thick] ([xshift=0em,yshift=0em]n1.east) -- ([xshift=0em,yshift=0em]n2.west);
+\draw [<->,thick] ([xshift=0em,yshift=0em]n2.east) -- ([xshift=0em,yshift=0em]dot1.west);
+\draw [<->,thick] ([xshift=0em,yshift=0em]dot1.east) -- ([xshift=0em,yshift=0em]n3.west);
+\draw [<->,thick] ([xshift=0em,yshift=0em]n3.east) -- ([xshift=0em,yshift=0em]dot2.west);
+\draw [<->,thick] ([xshift=0em,yshift=0em]dot2.east) -- ([xshift=0em,yshift=0em]n4.west);
+\draw [<->,thick] ([xshift=0em,yshift=0em]n4.east) -- ([xshift=0em,yshift=0em]dot3.west);
+\draw [<->,thick] ([xshift=0em,yshift=0em]dot3.east) -- ([xshift=0em,yshift=0em]n5.west);
+\node [anchor=north] (label2) at ([xshift=-2em,yshift=-2em]w3.south) {(c)混合结构};
+\end{scope}
+}
+\end{tikzpicture}
+\end{center}
\ No newline at end of file
--- a/Chapter15/Figures/figure-weighted-transformer-network-structure.png
+++ b/Chapter15/Figures/figure-weighted-transformer-network-structure.png
--- a/Chapter15/chapter15.tex
+++ b/Chapter15/chapter15.tex
--- a/Chapter17/Figures/figure-picture-translation.tex
+++ b/Chapter17/Figures/figure-picture-translation.tex
 \begin{tikzpicture}[node distance = 0]
-\tikzstyle{every node}=[scale=0.9]
+\tikzstyle{every node}=[scale=0.85]
 \begin {scope}
 \node[draw=white,scale=0.6] (input) at (0,0){\includegraphics[width=0.62\textwidth]{./Chapter17/Figures/figure-bank-without-attention.png}};(1.9,-1.4);
-\node[anchor=south] (english1) at ([xshift=-0.4em,yshift=-3.5em]input.south) {\begin{tabular}{l}{\normalsize\bfnew{英语}}{\large{：A medium sized child}}\end{tabular}};
+\node[anchor=west] (label1) at ([xshift=-3.5em]input.west) {\begin{tabular}{l}{\normalsize{图片：}}\end{tabular}};
-\node[anchor=south] (english2) at ([xshift=1.8em,yshift=-1.2em]english1.south) {\begin{tabular}{l}{\large{jumps off a dusty {\red{\underline{bank}}}.}} \end{tabular}};
+\node[anchor=south] (label2) at ([yshift=-7.15em]label1.south) {\begin{tabular}{l}{\normalsize{源文：}}\end{tabular}};
-\draw[decorate,decoration={brace,amplitude=4mm},very thick] ([xshift=7em]input.90) -- ([xshift=5.7em,yshift=0.5em]english2.270);
+\node[anchor=south] (english1) at ([xshift=-0.1em,yshift=-3.5em]input.south) {\begin{tabular}{l}{\large{A\,medium\,sized\,child\,jumps\,off}}\end{tabular}};
+\node[anchor=south] (english2) at ([xshift=-3.3em,yshift=-1.2em]english1.south) {\begin{tabular}{l}{\large{a dusty {\red{\underline{bank}}}.}} \end{tabular}};
+\draw[decorate,decoration={brace,amplitude=4mm},very thick] ([xshift=7em]input.90) -- ([xshift=10.4em,yshift=0.5em]english2.270);
-\node[anchor=east,rectangle,thick,rounded corners,minimum width=3.5em,minimum height=2.5em,text centered,draw=black!70,fill=red!25](trans)at ([xshift=8em,yshift=5.3em]english1.east){\normalsize{翻译模型}};
+\node[anchor=east,rectangle,thick,rounded corners,minimum width=3.5em,minimum height=2.5em,text centered,draw=black!70,fill=red!25](trans)at ([xshift=7.5em,yshift=5.1em]english1.east){\normalsize{翻译模型}};
-\draw[->,very thick]([xshift=-1.65em]trans.west) to (trans.west);
+\draw[->,very thick]([xshift=-1.4em]trans.west) to (trans.west);
-\draw[->,very thick](trans.east) to ([xshift=1.65em]trans.east);
+\draw[->,very thick](trans.east) to ([xshift=1.4em]trans.east);
-\node[anchor=east] (de1) at ([xshift=5.85cm,yshift=-0.1em]trans.east) {\begin{tabular}{l}{\normalsize\bfnew{汉语}}{\normalsize{：一个半大孩子从尘土飞扬}}\end{tabular}};
+\node[anchor=east] (de1) at ([xshift=4.9cm,yshift=-0.1em]trans.east) {\begin{tabular}{l}{\normalsize{译文：}}{\normalsize{一个半大孩子从尘土}}\end{tabular}};
-\node[anchor=south] (de2) at ([xshift=0em,yshift=-1.5em]de1.south) {\begin{tabular}{l}{\normalsize{的{\red{\underline{河床}}}上跳下来。}} \end{tabular}};
+\node[anchor=south] (de2) at ([xshift=1.65em,yshift=-1.5em]de1.south) {\begin{tabular}{l}{\normalsize{飞扬的{\red{\underline{河床}}}上跳下来。}} \end{tabular}};
 \end {scope}
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter17/chapter17.tex
+++ b/Chapter17/chapter17.tex
@@ -495,7 +495,7 @@
 \parinterval
 区别于篇章级统计机器翻译，篇章级神经机器翻译不需要针对某一具体的上下文现象构造相应的特征，而是通过翻译模型本身从上下文句子中抽取和融合的上下文信息。通常情况下，篇章级机器翻译可以采用局部建模的手段将前一句或者周围几句作为上下文送入模型。针对需要长距离上下文的情况，也可以使用全局建模的手段直接从篇章中所有句子中提取上下文信息。近几年多数研究工作都在探索更有效的局部建模或全局建模方法，主要包括改进输入\upcite{DBLP:conf/discomt/TiedemannS17,DBLP:conf/naacl/BawdenSBH18,DBLP:conf/wmt/GonzalesMS17,DBLP:journals/corr/abs-1910-07481}、多编码器结构\upcite{DBLP:journals/corr/JeanLFC17,DBLP:journals/corr/abs-1805-10163,DBLP:conf/emnlp/ZhangLSZXZL18}、层次结构\upcite{DBLP:conf/naacl/MarufMH19,DBLP:conf/acl/HaffariM18,DBLP:conf/emnlp/YangZMGFZ19,DBLP:conf/ijcai/ZhengYHCB20}以及基于缓存的方法\upcite{DBLP:conf/coling/KuangXLZ18,DBLP:journals/tacl/TuLSZ18}四类。
-\parinterval 此外，篇章级机器翻译面临的另外一个挑战是数据稀缺。篇章级机器翻译所需要的双语数据需要保留篇章边界，数量相比于句子级双语数据要少很多。除了在之前提到的端到端方法中采用预训练或者参数共享的手段（见{\chaptersixteen}），也可以采用新的建模手段来缓解数据稀缺问题。比如，在句子级翻译模型的推断过程中，通过篇章级语言模型在目标端引入上下文信息\upcite{DBLP:conf/discomt/GarciaCE19,DBLP:journals/tacl/YuSSLKBD20,DBLP:journals/corr/abs-2010-12827}，或者对句子级的翻译结果进行修正\upcite{DBLP:conf/aaai/XiongH0W19,DBLP:conf/acl/VoitaST19,DBLP:conf/emnlp/VoitaST19}（{\color{red} 如何修正？用什么修正？修正什么？感觉这句话没有信息量}）。
+\parinterval 此外，篇章级机器翻译面临的另外一个挑战是数据稀缺。篇章级机器翻译所需要的双语数据需要保留篇章边界，数量相比于句子级双语数据要少很多。除了在之前提到的端到端方法中采用预训练或者参数共享的手段（见{\chaptersixteen}），也可以采用新的建模手段来缓解数据稀缺问题。这类方法通常将篇章级翻译流程进行分离：先训练一个句子级的翻译模型，再通过一些额外的模块来引入上下文信息，从而达到“充分利用句子级的双语数据，提升模型性能”的目的。比如，在句子级翻译模型的推断过程中，通过在目标端结合篇章级语言模型引入上下文信息\upcite{DBLP:conf/discomt/GarciaCE19,DBLP:journals/tacl/YuSSLKBD20,DBLP:journals/corr/abs-2010-12827}，或者基于句子级的翻译结果，使用两阶段解码等手段引入上下文信息，进而对句子级翻译结果进行修正\upcite{DBLP:conf/aaai/XiongH0W19,DBLP:conf/acl/VoitaST19,DBLP:conf/emnlp/VoitaST19}。
 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -505,7 +505,7 @@
 \parinterval BLEU等自动评价指标能够在一定程度上反映译文的整体质量，但是并不能有效地评估篇章级翻译模型的性能。这是由于传统测试数据中出现篇章上下文现象的比例相对较少，并且$n$-gram的匹配很难检测到一些具体的语言现象，这使得研究人员很难通过BLEU得分来判断篇章级翻译模型的效果。
-\parinterval 为此，研究人员总结了机器翻译任务中存在的上下文现象，并基于此设计了相应的自动评价指标。比如针对篇章中代词的翻译问题，首先借助词对齐工具确定源语言中的代词在译文和参考答案中的对应位置，然后通过计算{\color{red} 谁的？}准确率和召回率等指标对代词翻译质量进行评价\upcite{DBLP:conf/iwslt/HardmeierF10,DBLP:conf/discomt/WerlenP17}。针对篇章中的词汇衔接，使用{\small\sffamily\bfseries{词汇链}}\index{词汇链}（Lexical Chain\index{Lexical Chain}）\footnote{词汇链指篇章中语义相关的词所构成的序列。}等来获取能够反映词汇衔接质量的分数，然后通过加权的方式与常规的BLEU或METEOR等指标结合在一起\upcite{DBLP:conf/emnlp/WongK12,DBLP:conf/discomt/GongZZ15}。针对篇章中的连接词，使用候选词典和词对齐工具对源语中连接词的正确翻译结果进行计数，计算其准确率\upcite{DBLP:conf/cicling/HajlaouiP13}。
+\parinterval 为此，研究人员总结了机器翻译任务中存在的上下文现象，并基于此设计了相应的自动评价指标。比如针对篇章中代词的翻译问题，首先借助词对齐工具确定源语言中的代词在译文和参考答案中的对应位置，然后通过计算译文中代词的准确率和召回率等指标对代词翻译质量进行评价\upcite{DBLP:conf/iwslt/HardmeierF10,DBLP:conf/discomt/WerlenP17}。针对篇章中的词汇衔接，使用{\small\sffamily\bfseries{词汇链}}\index{词汇链}（Lexical Chain\index{Lexical Chain}）\footnote{词汇链指篇章中语义相关的词所构成的序列。}等来获取能够反映词汇衔接质量的分数，然后通过加权的方式与常规的BLEU或METEOR等指标结合在一起\upcite{DBLP:conf/emnlp/WongK12,DBLP:conf/discomt/GongZZ15}。针对篇章中的连接词，使用候选词典和词对齐工具对源语中连接词的正确翻译结果进行计数，计算其准确率\upcite{DBLP:conf/cicling/HajlaouiP13}。
 \parinterval 除了直接对译文打分，也有一些工作针对特有的上下文现象手工构造了相应的测试套件用于评价翻译质量。测试套件中每一个测试样例都包含一个正确翻译的结果，以及多个错误结果，一个理想的翻译模型应该对正确的翻译结果评价最高，排名在所有错误结果之上,此时就可以根据模型是否能挑选出正确翻译结果来评估其性能。这种方法可以很好地衡量翻译模型在某一特定上下文现象上的处理能力，比如词义消歧\upcite{DBLP:conf/wmt/RiosMS18}、代词翻译\upcite{DBLP:conf/naacl/BawdenSBH18,DBLP:conf/wmt/MullerRVS18}和一些衔接问题\upcite{DBLP:conf/acl/VoitaST19}等。但是该方法也存在使用范围受限于测试集的语种和规模的缺点，因此扩展性较差。
@@ -572,7 +572,7 @@
 \mathbi{d}&=&\textrm{Attention}(\mathbi{h},\mathbi{h}^{\textrm pre},\mathbi{h}^{\textrm pre})
 \label{eq:17-3-3}
 \end{eqnarray}
-其中，$\mathbi{h}$作为Query（查询），$\mathbi{h}^{\textrm pre}$作为Key（键）和Value（值）。然后通过门控机制将待翻译句子中每个位置的编码表示和上下文中对应位置（{\color{red} 什么叫上下文中对应位置？}）的信息进行融合，具体方式如下：
+其中，$\mathbi{h}$作为Query（查询），$\mathbi{h}^{\textrm pre}$作为Key（键）和Value（值）。然后通过门控机制将待翻译句子中每个位置的编码表示和该位置对应的上下文信息进行融合，具体方式如下：
 \begin{eqnarray}
 \widetilde{\mathbi{h}_{t}}&=&\lambda_{t}\mathbi{h}_{t}+(1-\lambda_{t})\mathbi{d}_{t}
 \label{eq:17-3-4}\\
@@ -631,9 +631,9 @@
 \subsubsection{4. 基于缓存的方法}
-\parinterval 除了以上提到的建模方法，还有一类基于缓存的方法\upcite{DBLP:journals/tacl/TuLSZ18,DBLP:conf/coling/KuangXLZ18}。这类方法最大的特点在于将篇章翻译看作一个连续的过程（{\color{red} 如何理解连续的过程？}），然后在这个过程中通过一个额外的缓存来记录一些相关信息，最后在每个句子解码的过程中使用这个缓存来提供上下文信息。图\ref{fig:17-20}描述了一种基于缓存的篇章级翻译模型结构\upcite{DBLP:journals/tacl/TuLSZ18}。 在这里，翻译模型基于循环神经网络（见{\chapterten}），但是这种方法同样适用于包括Transformer在内的其他神经机器翻译模型。
+\parinterval 除了以上提到的建模方法，还有一类基于缓存的方法\upcite{DBLP:journals/tacl/TuLSZ18,DBLP:conf/coling/KuangXLZ18}。这类方法最大的特点在于将篇章翻译看作一个连续的过程，即依次翻译篇章中的每一个句子，该过程中通过一个额外的缓存来记录一些相关信息，且在每个句子的推断过程中都使用这个缓存来提供上下文信息。图\ref{fig:17-20}描述了一种基于缓存的篇章级翻译模型结构\upcite{DBLP:journals/tacl/TuLSZ18}。 在这里，翻译模型基于循环神经网络（见{\chapterten}），但是这种方法同样适用于包括Transformer在内的其他神经机器翻译模型。
-\parinterval 模型中篇章上下文的建模依赖于缓存的读和写操作。缓存的写操作指的是：按照一定规则将翻译历史中一些译文单词对应的上下文向量$\mathbi{C}_r$作为键，而将其解码器端的隐藏状态$\mathbi{s}_r$ 对应的值写入到缓存中（{\color{red} 为啥变量下标用$r$}）。而缓存的读操作是指将待翻译句子中第$t$个单词的上下文向量$\mathbi{C}_t$作为查询，与缓存中的所有键分别进行匹配，并根据其匹配程度进行带权相加，最后得到当前待翻译句子的篇章上下文信息 $\mathbi{d}$。该方法中单词的解码器端隐藏状态$\mathbi{s}_t$与对应位置的上下文信息$\mathbi{d}_t$的融合也是基于门控机制。事实上，由于该方法中缓存空间是有限的，其内容的更新也存在一定的规则：在当前句子的翻译结束后，如果单词$y_t$的对应信息未曾写入缓存，则写入其中的空槽或者替换最久未使用的键值对；如果$y_t$ 已作为翻译历史存在于缓存中，则将对应的键值对按照以下规则进行更新:
+\parinterval 模型中篇章上下文的建模依赖于缓存的读和写操作。缓存的写操作指的是：按照一定规则将翻译历史中一些译文单词对应的上下文向量和其解码器端的隐藏状态分别作为键和值写入到缓存中。而缓存的读操作是指将待翻译句子中第$t$个单词的上下文向量$\mathbi{C}_t$作为查询，与缓存中的所有键分别进行匹配，并根据其匹配程度进行带权相加，最后得到当前待翻译句子的篇章上下文信息 $\mathbi{d}$。该方法中单词的解码器端隐藏状态$\mathbi{s}_t$与对应位置的上下文信息$\mathbi{d}_t$的融合也是基于门控机制。事实上，由于该方法中缓存空间是有限的，其内容的更新也存在一定的规则：在当前句子的翻译结束后，如果单词$y_t$的对应信息未曾写入缓存，则写入其中的空槽或者替换最久未使用的键值对；如果$y_t$ 已作为翻译历史存在于缓存中，则将对应的键值对按照以下规则进行更新:
 \begin{eqnarray}
 \mathbi{k}_{i}&=&\frac{\mathbi{k}_{i}+\mathbi{c}_{t}}{2}
 \label{eq:17-3-10}\\

--- a/bibliography.bib
+++ b/bibliography.bib
@@ -6618,6 +6618,108 @@ author    = {Yoshua Bengio and
  publisher = {{IEEE} International Conference on Computer Vision},
  year      = {2017}
 }
+@inproceedings{DBLP:journals/corr/SuGMRUVWY16a,
+  author    = {Pei{-}Hao Su and
+               Milica Gasic and
+               Nikola Mrksic and
+               Lina Maria Rojas{-}Barahona and
+               Stefan Ultes and
+               David Vandyke and
+               Tsung{-}Hsien Wen and
+               Steve J. Young},
+  title     = {Continuously Learning Neural Dialogue Management},
+  publisher   = {CoRR},
+  volume    = {abs/1606.02689},
+  year      = {2016}
+}
+@inproceedings{DBLP:journals/corr/abs-1709-02349,
+  author    = {Iulian Vlad Serban and
+               Chinnadhurai Sankar and
+               Mathieu Germain and
+               Saizheng Zhang and
+               Zhouhan Lin and
+               Sandeep Subramanian and
+               Taesup Kim and
+               Michael Pieper and
+               Sarath Chandar and
+               Nan Rosemary Ke and
+               Sai Mudumba and
+               Alexandre de Br{\'{e}}bisson and
+               Jose Sotelo and
+               Dendi Suhubdy and
+               Vincent Michalski and
+               Alexandre Nguyen and
+               Joelle Pineau and
+               Yoshua Bengio},
+  title     = {A Deep Reinforcement Learning Chatbot},
+  publisher   = {CoRR},
+  volume    = {abs/1709.02349},
+  year      = {2017}
+}
+@inproceedings{DBLP:conf/emnlp/WuTQLL18,
+  author    = {Lijun Wu and
+               Fei Tian and
+               Tao Qin and
+               Jianhuang Lai and
+               Tie{-}Yan Liu},
+  title     = {A Study of Reinforcement Learning for Neural Machine Translation},
+  pages     = {3612--3621},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2018}
+}
+@inproceedings{DBLP:journals/jmlr/RossGB11,
+  author    = {St{\'{e}}phane Ross and
+               Geoffrey J. Gordon and
+               Drew Bagnell},
+  title     = {A Reduction of Imitation Learning and Structured Prediction to No-Regret
+               Online Learning},
+  publisher = {International Conference on Artificial Intelligence and Statistics},
+  series    = {{JMLR} Proceedings},
+  volume    = {15},
+  pages     = {627--635},
+  publisher = {JMLR.org},
+  year      = {2011}
+}
+@inproceedings{DBLP:conf/aaai/VenkatramanHB15,
+  author    = {Arun Venkatraman and
+               Martial Hebert and
+               J. Andrew Bagnell},
+  title     = {Improving Multi-Step Prediction of Learned Time Series Models},
+  publisher = {AAAI Conference on Artificial Intelligence},
+  pages     = {3024--3030},
+  year      = {2015}
+}
+@inproceedings{DBLP:conf/iclr/LiuCLS17,
+  author    = {Yanpei Liu and
+               Xinyun Chen and
+               Chang Liu and
+               Dawn Song},
+  title     = {Delving into Transferable Adversarial Examples and Black-box Attacks},
+  publisher = {International Conference on Learning Representations},
+  year      = {2017}
+}
+@inproceedings{DBLP:journals/tnn/YuanHZL19,
+  author    = {Xiaoyong Yuan and
+               Pan He and
+               Qile Zhu and
+               Xiaolin Li},
+  title     = {Adversarial Examples: Attacks and Defenses for Deep Learning},
+  publisher   = {IEEE Transactions on Neural Networks and Learning Systems},
+  volume    = {30},
+  number    = {9},
+  pages     = {2805--2824},
+  year      = {2019}
+}
+@inproceedings{DBLP:conf/infocom/YuanHL020,
+  author    = {Xiaoyong Yuan and
+               Pan He and
+               Xiaolin Li and
+               Dapeng Wu},
+  title     = {Adaptive Adversarial Attack on Scene Text Recognition},
+  pages     = {358--363},
+  publisher = {IEEE Conference on Computer Communications},
+  year      = {2020}
+}
 %%%%% chapter 13------------------------------------------------------
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%