Merge branch 'shanweiqiao' of http://47.105.50.196/NiuTrans/Toy-MT-Introduction into shanweiqiao

# Conflicts: # Book/Chapter4/chapter4.tex # Book/structure.tex

Merge branch 'shanweiqiao' of http://47.105.50.196/NiuTrans/Toy-MT-Introduction into shanweiqiao
# Conflicts: # Book/Chapter4/chapter4.tex # Book/structure.tex
56b0d6a4 · 单韦乔 · d6737785 · ff00c645 · 56b0d6a4 · 56b0d6a4
Commit 56b0d6a4 authored Apr 17, 2020 by 单韦乔
--- a/Book/Chapter1/chapter1.tex
+++ b/Book/Chapter1/chapter1.tex
@@ -25,11 +25,11 @@
 \end{figure}
 %-------------------------------------------

-\parinterval 这里我们更加关注人类语言之间的翻译问题，即自然语言的翻译。如图\ref{fig:zh_en-example}所示，通过计算机可以将一段中文文字自动转化为英文文字，其中中文被称为{\small\bfnew{源语言}}（Source Language），英文被称为{\small\bfnew{目标语言}}（Target Language）。
+\parinterval 这里更加关注人类语言之间的翻译问题，即自然语言的翻译。如图\ref{fig:zh_en-example}所示，通过计算机可以将一段中文文字自动转化为英文文字，其中中文被称为{\small\bfnew{源语言}}（Source Language），英文被称为{\small\bfnew{目标语言}}（Target Language）。

 \parinterval 一直以来，自然语言文字的翻译往往是由人工完成。让计算机像人一样进行翻译似乎还是电影中的桥段，因为很难想象人类语言的多样性和复杂性可以用计算机语言进行描述。但是时至今日，人工智能技术的发展已经大大超越了人类传统的认知，用计算机进行自动翻译也不再是一种想象，它已经深入到人们生活的很多方面，并发挥着重要作用。而这种由计算机进行自动翻译的过程也被称作{\small\bfnew{机器翻译}}（Machine Translation）。类似的，自动翻译、智能翻译、多语言自动转换等概念也是指同样的事情。如果将今天的机器翻译和人工翻译进行对比，可以发现机器翻译系统所生成的译文还并不完美，甚至有时翻译质量非常差。但是其优点在于速度快并且成本低，更为重要的是机器翻译系统可以从大量数据中不断学习和进化。人工翻译尽管精度很高，但是费时费力。当需要翻译大量的文本且精度要求不那么高时，比如海量数据的浏览型任务，机器翻译的优势就体现了出来。对于人工作业无法完成的事情，使用机器翻译可能只需花几个小时甚至几分钟就能完成。这就类似于拿着锄头耕地种庄稼和使用现代化机器作业之间的区别。

-\parinterval 实现机器翻译往往需要多个学科知识的融合，如数学、语言学、计算机科学、心理学等等。而最终呈现给我们的是一套软件系统\ ——\ 即机器翻译系统。通俗来讲，机器翻译系统就是一个可以在计算机上运行的软件工具，与我们使用的其它软件一样。只不过机器翻译系统是由``不可见的程序''组成，虽然这个系统非常复杂，但是呈现出来的展示形式却很简单，比如输入是待翻译的句子或文本，输出是译文句子或文本。
+\parinterval 实现机器翻译往往需要多个学科知识的融合，如数学、语言学、计算机科学、心理学等等。而最终呈现给使用者的是一套软件系统\ ——\ 即机器翻译系统。通俗来讲，机器翻译系统就是一个可以在计算机上运行的软件工具，与人们使用的其他软件一样。只不过机器翻译系统是由``不可见的程序''组成，虽然这个系统非常复杂，但是呈现出来的展示形式却很简单，比如输入是待翻译的句子或文本，输出是译文句子或文本。

 %----------------------------------------------
 % 图1.2
@@ -59,7 +59,7 @@
 \vspace{0.5em}
 \item {\small\bfnew{计算机的``理解''与人类的``理解''存在鸿沟}}。人类一直希望把自己进行翻译所使用的知识描述出来，并用计算机程序进行实现，包括早期基于规则的机器翻译方法都源自这个思想。但是经过实践发现，人和计算机在``理解''自然语言上存在着明显差异。首先，人类的语言能力是经过长时间多种外部环境因素共同刺激形成的，这种能力很难直接准确表达。也就是说人类的语言知识本身就很难描述，更不用说让计算机来理解；其次，人和机器翻译系统理解语言的目标不一样。人理解和使用语言是为了进行生活和工作，目标非常复杂，而机器翻译系统更多的是为了对某些数学上定义的目标函数进行优化。也就是说，机器翻译系统关注的是翻译这个单一目标，而并不是像人一样进行复杂的活动；此外，人和计算机的运行方式有着本质区别。人类语言能力的生物学机理与机器翻译系统所使用的计算模型本质上是不同的，机器翻译系统使用的是其自身能够理解的``知识''，比如，统计学上的词语表示。这种知识并不需要人来理解，当然从系统开发的角度，计算机也并不需要理解人是如何思考的。
 \vspace{0.5em}
-\item {\small\bfnew{单一的方法无法解决多样的翻译问题}}。首先，语种的多样性会导致任意两种语言之间的翻译实际上都是不同的翻译任务。比如，世界上存在的语言不下几千种，如果任意两种语言进行互译就有上百万种翻译需求。虽然已经有研究者尝试用同一个框架甚至同一个翻译系统进行全语种的翻译，但是离真正可用还有相当的距离；此外，不同的领域，不同的应用场景对翻译也有不同的需求。比如，文学作品的翻译和新闻的翻译就有不同、口译和笔译也有不同，类似的情况不胜枚举。机器翻译需要适用多样的需求，这些又进一步增加了计算机建模的难度；还有，对于机器翻译来说，充足的高质量数据是必要的，但是不同语种、不同领域、不同应用场景所拥有的数据量有明显差异，甚至很多语种几乎没有可用的数据，这时开发机器翻译系统的难度可想而知。注意，现在的机器翻译还无法像人类一样在学习少量样例的情况下进行举一反三，因此数据稀缺情况下的机器翻译也给我们提出了很大挑战。
+\item {\small\bfnew{单一的方法无法解决多样的翻译问题}}。首先，语种的多样性会导致任意两种语言之间的翻译实际上都是不同的翻译任务。比如，世界上存在的语言不下几千种，如果任意两种语言进行互译就有上百万种翻译需求。虽然已经有研究者尝试用同一个框架甚至同一个翻译系统进行全语种的翻译，但是离真正可用还有相当的距离；此外，不同的领域，不同的应用场景对翻译也有不同的需求。比如，文学作品的翻译和新闻的翻译就有不同、口译和笔译也有不同，类似的情况不胜枚举。机器翻译需要适用多样的需求，这些又进一步增加了计算机建模的难度；还有，对于机器翻译来说，充足的高质量数据是必要的，但是不同语种、不同领域、不同应用场景所拥有的数据量有明显差异，甚至很多语种几乎没有可用的数据，这时开发机器翻译系统的难度可想而知。注意，现在的机器翻译还无法像人类一样在学习少量样例的情况下进行举一反三，因此数据稀缺情况下的机器翻译也给研究者提出了很大挑战。
 \end{itemize}
 \vspace{0.5em}

@@ -206,7 +206,7 @@

 \parinterval 早期的机器翻译研究都是以基于规则的方法为主，特别是在上世纪70年代，以基于规则方法为代表的专家系统是人工智能中最具代表性的研究领域。它的主要思想是以词典和人工书写的规则库作为翻译知识，用一系列规则的组合完成翻译。

-\parinterval 图\ref{fig:Example-RBMT}展示了一个使用规则进行翻译的实例。这里，利用一个简单的汉译英规则库完成对句子``我对你感到满意''的翻译。当翻译``我''时，从规则库中找到规则1，该规则表示遇到单词``我''就翻译为``I''；类似的，也可以从规则库中找到规则4，该规则表示翻译调序，即将单词``you''放到``be satisfied with''后面。可以看到，这些规则的使用和我们进行翻译时所使用的思想非常类似，可以说基于规则方法实际上在试图描述人类进行翻译的思维过程。
+\parinterval 图\ref{fig:Example-RBMT}展示了一个使用规则进行翻译的实例。这里，利用一个简单的汉译英规则库完成对句子``我对你感到满意''的翻译。当翻译``我''时，从规则库中找到规则1，该规则表示遇到单词``我''就翻译为``I''；类似的，也可以从规则库中找到规则4，该规则表示翻译调序，即将单词``you''放到``be satisfied with''后面。可以看到，这些规则的使用和进行翻译时所使用的思想非常类似，可以说基于规则方法实际上在试图描述人类进行翻译的思维过程。

 \parinterval 但是，基于规则的机器翻译也存在问题。首先，书写规则需要消耗大量人力，规则库的维护代价极高；其次，规则很难涵盖所有的语言现象；再有，自然语言存在大量的歧义现象，规则之间也会存在冲突，这也导致规则数量不可能无限制增长。
 \subsection{基于实例的机器翻译}\index{Chapter1.4.2}
@@ -228,7 +228,7 @@

 \subsection{统计机器翻译}\index{Chapter1.4.3}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\parinterval 统计机器翻译兴起于上世纪90年代\cite{brown1990statistical}\cite{koehn2003statistical}它利用统计模型从单/双语语料中自动学习翻译知识。具体来说，可以使用单语语料学习语言模型，使用双语平行语料学习翻译模型，并使用这些统计模型完成对翻译过程的建模。整个过程不需要人工编写规则，也不需要从实例中构建翻译模板。无论是词、短语，甚至句法结构，统计机器翻译系统都可以自动学习，人更多的是参与定义翻译所需的特征和基本翻译单元的形式。而翻译知识都保存在模型的参数中。
+\parinterval 统计机器翻译兴起于上世纪90年代\cite{brown1990statistical,koehn2003statistical}它利用统计模型从单/双语语料中自动学习翻译知识。具体来说，可以使用单语语料学习语言模型，使用双语平行语料学习翻译模型，并使用这些统计模型完成对翻译过程的建模。整个过程不需要人工编写规则，也不需要从实例中构建翻译模板。无论是词、短语，甚至句法结构，统计机器翻译系统都可以自动学习，人更多的是参与定义翻译所需的特征和基本翻译单元的形式。而翻译知识都保存在模型的参数中。
 %----------------------------------------------
 % 图1.11
 \begin{figure}[htp]
@@ -245,7 +245,7 @@

 \subsection{神经机器翻译}\index{Chapter1.4.4}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\parinterval 随着机器学习技术的发展，基于深度学习的神经机器翻译逐渐开始兴起。自2014年开始，它在短短几年内已经在大部分任务上取得了明显的优势\cite{sutskever2014sequence}\cite{bahdanau2014neural}神经机器翻译中，词串被表示成实数向量，即分布式向量表示。这样，翻译过程并不是在离散化的单词和短语上进行，而是在实数向量空间上计算，因此它对词序列表示的方式产生了本质的改变。通常，机器翻译可以被看作一个序列到另一个序列的转化。在神经机器翻译中，序列到序列的转化过程可以由{\small\bfnew{编码器-解码器}}（encoder-decoder）框架实现。其中，编码器把源语言序列进行编码，并提取源语言中信息进行分布式表示，之后解码器再把这种信息转换为另一种语言的表达。
+\parinterval 随着机器学习技术的发展，基于深度学习的神经机器翻译逐渐开始兴起。自2014年开始，它在短短几年内已经在大部分任务上取得了明显的优势\cite{sutskever2014sequence,bahdanau2014neural}神经机器翻译中，词串被表示成实数向量，即分布式向量表示。这样，翻译过程并不是在离散化的单词和短语上进行，而是在实数向量空间上计算，因此它对词序列表示的方式产生了本质的改变。通常，机器翻译可以被看作一个序列到另一个序列的转化。在神经机器翻译中，序列到序列的转化过程可以由{\small\bfnew{编码器-解码器}}（encoder-decoder）框架实现。其中，编码器把源语言序列进行编码，并提取源语言中信息进行分布式表示，之后解码器再把这种信息转换为另一种语言的表达。

 %----------------------------------------------
 % 图1.12
@@ -307,7 +307,7 @@
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \parinterval 机器翻译质量的评价对于机器翻译的发展具有至关重要的意义。首先，评价的结果可以用于指导研究人员不断改进机器翻译结果，并找到最具潜力的技术发展方向。同时，一个权威的翻译质量评价指标可以帮助用户更有效地使用机器翻译的结果。

-\parinterval 一般来说，机器翻译的翻译{\small\bfnew{质量评价}}（Quality Evaluation）是指在参考答案或者评价标准已知的情况下对译文进行打分。这类方法可以被称作有参考答案的评价，包括人工打分、BLEU 等自动评价方法都是典型的有参考答案评价。相对的，{\small\bfnew{无参考答案的评价}}（Quality Estimation）是指在没有人工评价和参考答案的情况下，对译文质量进行评估。这类方法可以被看作是对机器翻译译文进行质量`` 预测''，这样用户可以选择性的使用机器翻译结果。这里我们主要讨论有参考答案的评价，因为这类方法是机器翻译系统研发所使用的主要评价方法。
+\parinterval 一般来说，机器翻译的翻译{\small\bfnew{质量评价}}（Quality Evaluation）是指在参考答案或者评价标准已知的情况下对译文进行打分。这类方法可以被称作有参考答案的评价，包括人工打分、BLEU 等自动评价方法都是典型的有参考答案评价。相对的，{\small\bfnew{无参考答案的评价}}（Quality Estimation）是指在没有人工评价和参考答案的情况下，对译文质量进行评估。这类方法可以被看作是对机器翻译译文进行质量`` 预测''，这样用户可以选择性的使用机器翻译结果。这里主要讨论有参考答案的评价，因为这类方法是机器翻译系统研发所使用的主要评价方法。

 \subsection{人工评价}\index{Chapter1.5.1}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -335,7 +335,7 @@
 \parinterval 简而言之，研究者可以根据实际情况选择不同的人工评价方案，人工评价也没有统一的标准。WMT和CCMT机器翻译评测都有配套的人工评价方案，可以作为业界的参考标准。
 \subsection{自动评价}\index{Chapter1.5.2}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\parinterval 由于人工评价费事费力，同时具有一定的主观性，甚至同一篇文章不同人在不同时刻的理解都会不同，因此自动评价是也是机器翻译系统研发人员所青睐的方法。自动评价的方式虽然不如人工评价准确，但是具有速度快，成本低、一致性高的优点。而且随着评价技术的不断发展，自动评价方式已经具有了比较好的指导性，可以帮助我们快速了解当前机器翻译译文的质量。在机器翻译领域，自动评价已经成为了一个重要的分支，提出的自动评价方法不下几十种。在这里我们无法对这些方法一一列举，为了便于后续章节的描述，这里仅对具有代表性的一些方法进行简要介绍。
+\parinterval 由于人工评价费事费力，同时具有一定的主观性，甚至同一篇文章不同人在不同时刻的理解都会不同，因此自动评价是也是机器翻译系统研发人员所青睐的方法。自动评价的方式虽然不如人工评价准确，但是具有速度快，成本低、一致性高的优点。而且随着评价技术的不断发展，自动评价方式已经具有了比较好的指导性，可以帮助使用者快速了解当前机器翻译译文的质量。在机器翻译领域，自动评价已经成为了一个重要的分支，提出的自动评价方法不下几十种。这里无法对这些方法一一列举，为了便于后续章节的描述，这里仅对具有代表性的一些方法进行简要介绍。

 \subsubsection{BLEU}\index{Chapter1.5.2.1}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -441,7 +441,7 @@ His house is on the south bank of the river.

 \section{机器翻译应用}\index{Chapter1.6}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\parinterval 机器翻译有着十分广泛的应用，下面看一下机器翻译在我们生活中的具体应用形式：
+\parinterval 机器翻译有着十分广泛的应用，下面看一下机器翻译生活中的具体应用形式：

 \parinterval （一）网页翻译

@@ -453,7 +453,7 @@ His house is on the south bank of the river.

 \parinterval （三）科技文献翻译

-\parinterval 在专利等科技文献翻译中，往往需要将文献翻译为英语或者其它语言，比如摘要翻译。以往这种翻译工作通常由人工来完成。由于翻译质量要求较高，因此要求翻译人员具有相关背景知识，这导致译员资源稀缺。特别是，近几年国内专利申请数不断增加，这给人工翻译带来了很大的负担。相比于人工翻译，机器翻译可以在短时间内完成大量的专利翻译，同时结合术语词典和人工校对等方式，可以保证专利的翻译质量。同时，以专利为代表的科技文献往往具有很强的领域性，针对各类领域文本进行单独优化，机器翻译的品质可以大大提高。因此，机器翻译在专利翻译等行业有十分广泛的应用前景。
+\parinterval 在专利等科技文献翻译中，往往需要将文献翻译为英语或者其他语言，比如摘要翻译。以往这种翻译工作通常由人工来完成。由于翻译质量要求较高，因此要求翻译人员具有相关背景知识，这导致译员资源稀缺。特别是，近几年国内专利申请数不断增加，这给人工翻译带来了很大的负担。相比于人工翻译，机器翻译可以在短时间内完成大量的专利翻译，同时结合术语词典和人工校对等方式，可以保证专利的翻译质量。同时，以专利为代表的科技文献往往具有很强的领域性，针对各类领域文本进行单独优化，机器翻译的品质可以大大提高。因此，机器翻译在专利翻译等行业有十分广泛的应用前景。

 \parinterval （四）全球化

@@ -506,7 +506,7 @@ His house is on the south bank of the river.
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \vspace{0.5em}
 \begin{itemize}
-\item NiuTrans.SMT：NiuTrans\cite{Tong2012NiuTrans}是由东北大学自然语言处理实验室自主研发的统计机器翻译系统，该系统可支持基于短语的模型、基于层次短语的模型以及基于句法的模型。由于使用C++ 语言开发，所以该系统运行时间快，所占存储空间少。系统中内嵌有$n$-gram语言模型，故无需使用其它的系统即可对完成语言建模。网址：\url{http://opensource.niutrans.com/smt/index.html}
+\item NiuTrans.SMT：NiuTrans\cite{Tong2012NiuTrans}是由东北大学自然语言处理实验室自主研发的统计机器翻译系统，该系统可支持基于短语的模型、基于层次短语的模型以及基于句法的模型。由于使用C++ 语言开发，所以该系统运行时间快，所占存储空间少。系统中内嵌有$n$-gram语言模型，故无需使用其他的系统即可对完成语言建模。网址：\url{http://opensource.niutrans.com/smt/index.html}
 \vspace{0.5em}
 \item Moses：Moses\cite{Koehn2007Moses}统计机器翻译时代最著名的系统之一，（主要）由爱丁堡大学的机器翻译团队开发。最新的Moses系统支持很多的功能，例如，它既支持基于短语的模型，也支持基于句法的模型。Moses 提供因子化翻译模型（Factored Translation Model），因此该模型可以很容易的对不同层次的信息进行建模。此外，它允许将混淆网络和字格作为输入，可缓解系统的1-best输出中的错误。Moses还提供了很多有用的脚本和工具，被机器翻译研究者广泛使用。网址：\url{http://www.statmt.org/moses/}
 \vspace{0.5em}
@@ -572,11 +572,11 @@ His house is on the south bank of the river.
 \begin{itemize}
 \item CCMT（全国机器翻译大会），前身为CWMT（全国机器翻译研讨会）是国内机器翻译领域的旗舰会议，自2005年起已经组织多次机器翻译评测，对国内机器翻译相关技术的发展产生了深远影响。该评测主要针对汉语、英语以及国内的少数民族语言（蒙古语、藏语、维吾尔语等）进行评测，领域包括新闻、口语、政府文件等，不同语言方向对应的领域也有所不同。评价方式不同届略有不同，主要采用自动评价的方式，自CWMT 2013起则针对某些领域增设人工评价。自动评价的指标一般包括BLEU-SBP、BLEU-NIST、TER、METEOR、NIST、GTM、mWER、mPER 以及ICT 等，其中以BLEU-SBP 为主，汉语为目标语的翻译采用基于字符的评价方式，面向英语的翻译基于词进行评价。每年该评测吸引国内外近数十家企业及科研机构参赛，业内认可度极高。关于CCMT的更多信息可参考官网：\url{http://www.ai-ia.ac.cn/cwmt2015/evaluation.html} （链接为CWMT 2015）。
 \vspace{0.5em}
-\item WMT由Special Interest Group for Machine Translation（SIGMT）主办，会议自2006年起每年召开一次，是一个涉及机器翻译多种任务的综合性会议，包括多领域翻译评测任务、质量评价任务以及其它与机器翻译的相关任务（如文档对齐评测等）。现在WMT已经成为机器翻译领域的旗舰评测任务，很多研究工作都以WMT任务作为基准。WMT评测涉及的语言范围较广，包括英语、德语、芬兰语、捷克语、罗马尼亚语等十多种语言，翻译方向一般以英语为核心，探索英语与其它语言之间的翻译性能，领域包括新闻、信息技术、生物医学。最近，也增加了无指导机器翻译等热门问题。WMT在评价方面类似于CCMT，也采用人工评价与自动评价相结合的方式，自动评价的指标一般为BLEU、TER 等。此外，WMT公开了所有评测数据，因此也经常被机器翻译相关人员所使用。更多WMT的机器翻译评测相关信息可参考官网：\url{http://www.sigmt.org/}。
+\item WMT由Special Interest Group for Machine Translation（SIGMT）主办，会议自2006年起每年召开一次，是一个涉及机器翻译多种任务的综合性会议，包括多领域翻译评测任务、质量评价任务以及其他与机器翻译的相关任务（如文档对齐评测等）。现在WMT已经成为机器翻译领域的旗舰评测任务，很多研究工作都以WMT任务作为基准。WMT评测涉及的语言范围较广，包括英语、德语、芬兰语、捷克语、罗马尼亚语等十多种语言，翻译方向一般以英语为核心，探索英语与其他语言之间的翻译性能，领域包括新闻、信息技术、生物医学。最近，也增加了无指导机器翻译等热门问题。WMT在评价方面类似于CCMT，也采用人工评价与自动评价相结合的方式，自动评价的指标一般为BLEU、TER 等。此外，WMT公开了所有评测数据，因此也经常被机器翻译相关人员所使用。更多WMT的机器翻译评测相关信息可参考官网：\url{http://www.sigmt.org/}。
 \vspace{0.5em}
 \item NIST机器翻译评测开始于2001年，是早期机器翻译公开评测中颇具代表性的任务，现在WMT和CCMT很多任务的设置也大量参考了当年NIST评测的内容。NIST评测由美国国家标准技术研究所主办，作为美国国防高级计划署（DARPA）中TIDES计划的重要组成部分。早期，NIST评测主要评价阿拉伯语和汉语等语言到英语的翻译效果，评价方法一般采用人工评价与自动评价相结合的方式。人工评价采用5分制评价。自动评价使用多种方式，包括BLEU，METEOR，TER以及HyTER。此外NIST从2016 年起开始对稀缺语言资源技术进行评估，其中机器翻译作为其重要组成部分共同参与评测，评测指标主要为BLEU。除对机器翻译系统进行评测之外，NIST在2008 和2010年对于机器翻译的自动评价方法（MetricsMaTr）也进行了评估，以鼓励更多研究人员对现有评价方法进行改进或提出更加贴合人工评价的方法。同时NIST评测所提供的数据集由于数据质量较高受到众多科研人员喜爱，如MT04，MT06等（汉英）平行语料经常被科研人员在实验中使用。不过，近几年NIST评测已经停止。更多NIST的机器翻译评测相关信息可参考官网：\url{https://www.nist.gov/programs-projects/machine-translation}。
 \vspace{0.5em}
-\item 从2004年开始举办的IWSLT也是颇具特色的机器翻译评测，它主要关注口语相关的机器翻译任务，测试数据包括TED talks的多语言字幕以及QED 教育讲座影片字幕等，语言涉及英语、法语、德语、捷克语、汉语、阿拉伯语等众多语言。此外在IWSLT 2016 中还加入了对于日常对话的翻译评测，尝试将微软Skype中一种语言的对话翻译成其它语言。评价方式采用自动评价的模式，评价标准和WMT类似，一般为BLEU 等指标。另外，IWSLT除了对文本到文本的翻译评测外，还有自动语音识别以及语音转另一种语言的文本的评测。更多IWSLT的机器翻译评测相关信息可参考官网：\url{https://workshop2016.iwslt.org/} （链接为IWSLT2016）
+\item 从2004年开始举办的IWSLT也是颇具特色的机器翻译评测，它主要关注口语相关的机器翻译任务，测试数据包括TED talks的多语言字幕以及QED 教育讲座影片字幕等，语言涉及英语、法语、德语、捷克语、汉语、阿拉伯语等众多语言。此外在IWSLT 2016 中还加入了对于日常对话的翻译评测，尝试将微软Skype中一种语言的对话翻译成其他语言。评价方式采用自动评价的模式，评价标准和WMT类似，一般为BLEU 等指标。另外，IWSLT除了对文本到文本的翻译评测外，还有自动语音识别以及语音转另一种语言的文本的评测。更多IWSLT的机器翻译评测相关信息可参考官网：\url{https://workshop2016.iwslt.org/} （链接为IWSLT2016）
 \vspace{0.5em}
 \item 日本举办的机器翻译评测WAT是亚洲范围内的重要评测之一，由日本科学振兴机构（JST）、情报通信研究机构（NICT）等多家机构共同组织，旨在为亚洲各国之间交流融合提供便宜之处。语言方向主要包括亚洲主流语言（汉语、韩语、印地语等）以及英语对日语的翻译，领域丰富多样，包括学术论文、专利、新闻、食谱等。评价方式包括自动评价（BLEU、RIBES以及AMFM 等）以及人工评价，其特点在于对于测试语料以段落为单位进行评价，考察其上下文关联的翻译效果。更多WAT的机器翻译评测相关信息可参考官网：\url{http://lotus.kuee.kyoto-u.ac.jp/WAT/}。
 \vspace{0.5em}

--- a/Book/Chapter2/Figures/figure-probability-values-corresponding-to-different-derivations.tex
+++ b/Book/Chapter2/Figures/figure-probability-values-corresponding-to-different-derivations.tex
-
 \definecolor{ublue}{rgb}{0.152,0.250,0.545}
 \definecolor{ugreen}{rgb}{0,0.5,0}


--- a/Book/Chapter2/chapter2.tex
+++ b/Book/Chapter2/chapter2.tex
@@ -14,16 +14,16 @@

 \parinterval 机器翻译并非是一个孤立的系统，它依赖于很多模块，并且需要很多学科知识的融合。现在的机器翻译系统大多使用统计模型对翻译问题进行建模，同时也会用到一些的自然语言处理工具对不同语言的文字进行分析。因此，在正式开始机器翻译内容的介绍之前，本章将会对相关的基础知识进行概述，包括：概率论与统计建模基础、语言分析、语言建模等。

-\parinterval 概率论与统计建模是机器翻译方法的基础。这里会对机器翻译所涉及的基本数学概念进行简要描述，确保后续使用到的数学工具是完备的。我们会重点关注如何利用统计建模的方式对自然语言处理问题进行描述，这种手段在统计机器翻译和神经机器翻译中会被使用。
+\parinterval 概率论与统计建模是机器翻译方法的基础。这里会对机器翻译所涉及的基本数学概念进行简要描述，确保后续使用到的数学工具是完备的。本章会重点关注如何利用统计建模的方式对自然语言处理问题进行描述，这种手段在统计机器翻译和神经机器翻译中会被使用。

-\parinterval 语言分析部分将以汉语为例介绍词法和句法分析的基本概念。它们都是自然语言处理中的经典问题，而且在机器翻译中也会经常被使用。同样，我们会介绍这两个任务的定义和求解问题的思路。
+\parinterval 语言分析部分将以汉语为例介绍词法和句法分析的基本概念。它们都是自然语言处理中的经典问题，而且在机器翻译中也会经常被使用。同样，本章会介绍这两个任务的定义和求解问题的思路。

-\parinterval 语言建模是机器翻译中最常用的一种技术，它主要用于句子的生成和流畅度评价。我们会以传统统计语言模型为例，对语言建模的相关概念进行介绍。但是，这里并不深入探讨语言模型技术，在后面的章节中还有会单独的内容对神经网络语言模型等前沿技术进行讨论。
+\parinterval 语言建模是机器翻译中最常用的一种技术，它主要用于句子的生成和流畅度评价。本章会以传统统计语言模型为例，对语言建模的相关概念进行介绍。但是，这里并不深入探讨语言模型技术，在后面的章节中还有会单独的内容对神经网络语言模型等前沿技术进行讨论。

 %--问题概述-----------------------------------------
 \section{问题概述 }\index{Chapter2.1}

-\parinterval 很多时候机器翻译系统被看作是孤立的``黑盒''系统（图 \ref {fig:2.1-1} (a)）。我们将一段文本作为输入送入机器翻译系统，之后得到翻译好的译文输出。但是真实的机器翻译系统要复杂的多。因为系统看到的输入和输出的实际上只是一些符号串，这些符号并没有任何其它意义，因此需要进一步对这些符号串进行处理才能更好的使用它们，比如，需要定义翻译中最基本的单元是什么？符号串是否还有结构信息？如何用数学工具刻画这些基本单元和结构？
+\parinterval 很多时候机器翻译系统被看作是孤立的``黑盒''系统（图 \ref {fig:2.1-1} (a)）。可以将一段文本作为输入送入机器翻译系统，之后得到翻译好的译文输出。但是真实的机器翻译系统要复杂的多。因为系统看到的输入和输出的实际上只是一些符号串，这些符号并没有任何其他意义，因此需要进一步对这些符号串进行处理才能更好的使用它们，比如，需要定义翻译中最基本的单元是什么？符号串是否还有结构信息？如何用数学工具刻画这些基本单元和结构？

 %----------------------------------------------
 % 图2.1
@@ -60,7 +60,7 @@

 \parinterval 一般来说，在送入机器翻译系统前需要对文字序列进行处理和加工，这个过程被称为{\small\sffamily\bfseries{预处理}}（Pre-processing）。同理，在机器翻译模型输出译文后的处理作被称作{\small\sffamily\bfseries{后处理}}（Post-processing）。这两个过程对机器翻译性能影响很大，比如，在神经机器翻译里，不同的分词策略可能会造成翻译性能的天差地别。

-\parinterval 值得注意的是，有些观点认为，不论是分词还是句法分析，对于机器翻译来说并不要求符合人的认知和语言学约束。换句话说，机器翻译所使用的``单词''和``结构''本身并不是为了符合人类的解释，它们更直接目的是为了进行翻译。从系统开发的角度，有时候即使进行一些与我们的语言习惯有差别的处理，仍然会带来性能的提升，比如在神经机器翻译中，在传统分词的基础上进一步使用双字节编码（Byte Pair Encoding，BPE）子词切分会使得机器翻译性能大幅提高。当然，自然语言处理中语言学信息的使用一直是学界关注的焦点。甚至关于语言学结构对机器翻译是否有作用这个问题也有争论。但是不能否认的是，无论是语言学的知识，还是计算机自己学习到的知识，对机器翻译都是有价值的。在后续章节会看到，这两种类型的知识对机器翻译帮助很大 \footnote[1]{笔者并不认同语言学结构对机器翻译的帮助有限，相反机器翻译需要更多的人类先验知识的指导。当然，这个问题不是这里讨论的重点。} 。
+\parinterval 值得注意的是，有些观点认为，不论是分词还是句法分析，对于机器翻译来说并不要求符合人的认知和语言学约束。换句话说，机器翻译所使用的``单词''和``结构''本身并不是为了符合人类的解释，它们更直接目的是为了进行翻译。从系统开发的角度，有时候即使进行一些与人类的语言习惯有差别的处理，仍然会带来性能的提升，比如在神经机器翻译中，在传统分词的基础上进一步使用双字节编码（Byte Pair Encoding，BPE）子词切分会使得机器翻译性能大幅提高。当然，自然语言处理中语言学信息的使用一直是学界关注的焦点。甚至关于语言学结构对机器翻译是否有作用这个问题也有争论。但是不能否认的是，无论是语言学的知识，还是计算机自己学习到的知识，对机器翻译都是有价值的。在后续章节会看到，这两种类型的知识对机器翻译帮助很大 \footnote[1]{笔者并不认同语言学结构对机器翻译的帮助有限，相反机器翻译需要更多的人类先验知识的指导。当然，这个问题不是这里讨论的重点。} 。

 \parinterval 剩下的问题是如何进行句子的切分和结构的分析。思路有很多，一种常用的方法是对问题进行概率化，用统计模型来描述问题并求解之。比如，一个句子切分的好坏，并不是非零即一的判断，而是要估计出这种切分的可能性大小，最终选择可能性最大的结果进行输出。这也是一种典型的用统计建模的方式来描述自然语言处理问题。

@@ -81,9 +81,9 @@

 \parinterval {\small\bfnew{概率}}（Probability）是度量随机事件呈现其每个可能状态的可能性的数值，本质上它是一个测度函数\cite{mao-prob-book-2011}\cite{kolmogorov2018foundations}。概率的大小表征了随机事件在一次试验中发生的可能性大小。用$\textrm{P}(\cdot )$表示一个随机事件的可能性，即事件发生的概率。比如$\textrm{P}(\textrm{太阳从东方升起})$表示``太阳从东方升起的可能性''，同理，$\textrm{P}(A=B)$ 表示的就是``$A=B$'' 这件事的可能性。

-\parinterval 在实际问题中，我们往往需要得到随机变量的概率值。但是，真实的概率值可能是无法准确知道的，这时就需要对概率进行{\small\sffamily\bfseries{估计}}，得到的结果是概率的{\small\sffamily\bfseries{估计值}}（Estimate）。在概率论中，一个很简单的方法是利用相对频度作为概率的估计值。如果$\{x_1,x_2,\dots,x_n \}$是一个试验的样本空间，在相同情况下重复试验$N$次，观察到样本$x_i (1\leq{i}\leq{n})$的次数为$n (x_i )$，那么$x_i$在这$N$次试验中的相对频率是$\frac{n(x_i )}{N}$。当$N$越来越大时，相对概率也就越来越接近真实概率$\textrm{P}(x_i)$，即$\lim_{N \to \infty}\frac{n(x_i )}{N}=\textrm{P}(x_i)$。 实际上，很多概率模型都等同于相对频度估计，比如，对于一个服从多项式分布的变量的极大似然估计就可以用相对频度估计实现。
+\parinterval 在实际问题中，往往需要得到随机变量的概率值。但是，真实的概率值可能是无法准确知道的，这时就需要对概率进行{\small\sffamily\bfseries{估计}}，得到的结果是概率的{\small\sffamily\bfseries{估计值}}（Estimate）。在概率论中，一个很简单的方法是利用相对频度作为概率的估计值。如果$\{x_1,x_2,\dots,x_n \}$是一个试验的样本空间，在相同情况下重复试验$N$次，观察到样本$x_i (1\leq{i}\leq{n})$的次数为$n (x_i )$，那么$x_i$在这$N$次试验中的相对频率是$\frac{n(x_i )}{N}$。当$N$越来越大时，相对概率也就越来越接近真实概率$\textrm{P}(x_i)$，即$\lim_{N \to \infty}\frac{n(x_i )}{N}=\textrm{P}(x_i)$。 实际上，很多概率模型都等同于相对频度估计，比如，对于一个服从多项式分布的变量的极大似然估计就可以用相对频度估计实现。

-\parinterval 概率函数是用函数形式给出离散变量每个取值发生的概率，其实就是将变量的概率分布转化为数学表达形式。如果我们把$A$看做一个离散变量，$a$看做变量$A$的一个取值，那么$\textrm{P}(A)$被称作变量$A$的概率函数，$\textrm{P}(A=a)$被称作$A = a$的概率值，简记为$\textrm{P}(a)$。例如，在相同条件下掷一个骰子50次，用$A$表示投骰子出现的点数这个离散变量，$a_i$表示点数的取值，$\textrm{P}_i$表示$A=a_i$的概率值。下表为$A$的概率分布，给出了$A$的所有取值及其概率。
+\parinterval 概率函数是用函数形式给出离散变量每个取值发生的概率，其实就是将变量的概率分布转化为数学表达形式。如果把$A$看做一个离散变量，$a$看做变量$A$的一个取值，那么$\textrm{P}(A)$被称作变量$A$的概率函数，$\textrm{P}(A=a)$被称作$A = a$的概率值，简记为$\textrm{P}(a)$。例如，在相同条件下掷一个骰子50次，用$A$表示投骰子出现的点数这个离散变量，$a_i$表示点数的取值，$\textrm{P}_i$表示$A=a_i$的概率值。下表为$A$的概率分布，给出了$A$的所有取值及其概率。
 %表1--------------------------------------------------------------------
 \begin{table}[htp]
 \centering
@@ -99,7 +99,7 @@

 \parinterval 除此之外，概率函数$\textrm{P}(\cdot)$还具有非负性、归一性等特点，非负性是指，所有的概率函数$\textrm{P}(\cdot)$都必须是大于等于0的数值，概率函数中不可能出现负数：$\forall{x},\textrm{P}{(x)}\geq{0}$。归一性，又称规范性，简单的说就是所有可能发生的事件的概率总和为1，即$\sum_{x}\textrm{P}{(x)}={1}$。

-\parinterval 对于离散变量$A$，$\textrm{P}(A=a)$是个确定的值，可以表示事件$A=a$的可能性大小；而对于连续变量，求在某个定点处的概率是无意义的，只能求其落在某个取值区间内的概率。因此，用{\small\sffamily\bfseries{概率分布函数$F(x)$}}和{\small\sffamily\bfseries{概率密度函数}}$f(x)$来统一描述随机变量的取值分布情况。概率分布函数$F(x)$表示取值小于某个值的概率，是概率的累加（或积分）形式。假设$A$是一个随机变量，$a$是任意实数，将函数$F(a)=\textrm{P}\{A\leq a\}$，$-\infty<a<\infty $定义为$A$的分布函数。通过分布函数，我们可以清晰地表示任何随机变量的概率。
+\parinterval 对于离散变量$A$，$\textrm{P}(A=a)$是个确定的值，可以表示事件$A=a$的可能性大小；而对于连续变量，求在某个定点处的概率是无意义的，只能求其落在某个取值区间内的概率。因此，用{\small\sffamily\bfseries{概率分布函数$F(x)$}}和{\small\sffamily\bfseries{概率密度函数}}$f(x)$来统一描述随机变量的取值分布情况。概率分布函数$F(x)$表示取值小于某个值的概率，是概率的累加（或积分）形式。假设$A$是一个随机变量，$a$是任意实数，将函数$F(a)=\textrm{P}\{A\leq a\}$，$-\infty<a<\infty $定义为$A$的分布函数。通过分布函数，可以清晰地表示任何随机变量的概率。

 \parinterval 概率密度函数反映了变量在某个区间内的概率变化快慢，概率密度函数的值是概率的变化率，该连续变量的概率也就是对概率密度函数求积分得到的结果。设$f(x) \geq 0$是连续变量$X$的概率密度函数，$X$的分布函数就可以用如下公式定义：

@@ -133,7 +133,7 @@ F(X)=\int_{-\infty}^x f(x)dx
 \end{eqnarray}
 %----------------------------------------------

-\parinterval {\small\sffamily\bfseries{边缘概率}}（marginal probability）是和联合概率对应的，它指的是$\textrm{P}(X=a)$或$\textrm{P}(Y=b)$，即仅与单个随机变量有关的概率称为边缘概率。对于离散随机变量$X$和$Y$，我们知道$\textrm{P}(X,Y)$，则边缘概率$\textrm{P}(X)$可以通过求和的方式得到。对于$\forall x \in X $，有
+\parinterval {\small\sffamily\bfseries{边缘概率}}（marginal probability）是和联合概率对应的，它指的是$\textrm{P}(X=a)$或$\textrm{P}(Y=b)$，即仅与单个随机变量有关的概率称为边缘概率。对于离散随机变量$X$和$Y$，如果知道$\textrm{P}(X,Y)$，则边缘概率$\textrm{P}(X)$可以通过求和的方式得到。对于$\forall x \in X $，有
 \begin{eqnarray}
 \textrm{P}(X=x)=\sum_{y}  \textrm{P}(X=x,Y=y)
 \label{eq:2.2-2}
@@ -147,7 +147,7 @@ F(X)=\int_{-\infty}^x f(x)dx
 \end{eqnarray}
 %----------------------------------------------

-\parinterval 为了更好的区分条件概率、边缘概率和联合概率，这里我们用一个图形面积的计算来举例说明。如图\ref{fig:2.2-2}所示，矩形$A$代表事件$X$发生所对应的所有可能状态，矩形$B$代表事件$Y$发生所对应的所有可能状态，矩形$C$代表$A$和$B$的交集，则
+\parinterval 为了更好的区分条件概率、边缘概率和联合概率，这里用一个图形面积的计算来举例说明。如图\ref{fig:2.2-2}所示，矩形$A$代表事件$X$发生所对应的所有可能状态，矩形$B$代表事件$Y$发生所对应的所有可能状态，矩形$C$代表$A$和$B$的交集，则

 \begin{itemize}
 \item 边缘概率：矩形$A$或者矩形$B$的面积；
@@ -177,14 +177,14 @@ F(X)=\int_{-\infty}^x f(x)dx
 \end{eqnarray}
 %----------------------------------------------

-\parinterval 推广到$n$个事件，我们得到了链式法则的公式
+\parinterval 推广到$n$个事件，可以得到了链式法则的公式
 \begin{eqnarray}
 \textrm{P}(x_1,x_2,...,x_n)=\textrm{P}(x_1) \prod_{i=2}^n \textrm{P}(x_i \mid x_1,x_2,...,x_{i-1})
 \label{eq:2.2-5}
 \end{eqnarray}
 %----------------------------------------------

-\parinterval 我们可以通过下面这个例子更好的理解链式法则，如图所示，$A$、$B$、$C$、$D$、\\ $E$分别代表五个事件，其中，$A$只和$B$有关，$C$只和$B$、$D$有关，$E$只和$C$有关，$B$和$D$不依赖其他任何事件。则$\textrm{P}(A,B,C,D,E)$的表达式如下式：
+\parinterval 下面的例子有助于更好的理解链式法则，如图\ref{fig:2.2-3}所示，$A$、$B$、$C$、$D$、\\ $E$分别代表五个事件，其中，$A$只和$B$有关，$C$只和$B$、$D$有关，$E$只和$C$有关，$B$和$D$不依赖其他任何事件。则$\textrm{P}(A,B,C,D,E)$的表达式如下式：

 %----------------------------------------------
 % 图2.5
@@ -205,7 +205,7 @@ F(X)=\int_{-\infty}^x f(x)dx
 \label{eq:2.2-6}
 \end{eqnarray}

-\parinterval 根据图\ref {fig:2.2-3} 易知$E$只和$C$有关，所以$\textrm{P}(E \mid A,B,C,D)=\textrm{P}(E \mid C)$；$D$不依赖于其它事件，所以$\textrm{P}(D \mid A,B,C)=\textrm{P}(D)$；$C$只和$B$、$D$有关，所以$\textrm{P}(C \mid A,B)=\textrm{P}(C \mid B)$；$B$不依赖于其他事件，所以$\textrm{P}(B \mid  A)=\textrm{P}(B)$。最终化简可得：
+\parinterval 根据图\ref {fig:2.2-3} 易知$E$只和$C$有关，所以$\textrm{P}(E \mid A,B,C,D)=\textrm{P}(E \mid C)$；$D$不依赖于其他事件，所以$\textrm{P}(D \mid A,B,C)=\textrm{P}(D)$；$C$只和$B$、$D$有关，所以$\textrm{P}(C \mid A,B)=\textrm{P}(C \mid B)$；$B$不依赖于其他事件，所以$\textrm{P}(B \mid  A)=\textrm{P}(B)$。最终化简可得：
 %---------------------------------------------
 \begin{eqnarray}
 \textrm{P}(A,B,C,D,E)=\textrm{P}(E \mid C) \cdot \textrm{P}(D) \cdot \textrm{P}(C \mid B) \cdot \textrm{P}(B)
@@ -215,7 +215,7 @@ F(X)=\int_{-\infty}^x f(x)dx

 \subsection{贝叶斯法则}\index{Chapter2.2.4}

-\parinterval 首先介绍一下全概率公式：{\small\bfnew{全概率公式}}（Law of Total Probability）是概率论中重要的公式，它可以将一个复杂事件发生的概率分解成不同情况的小事件发生概率的和。这里我们先介绍一个概念——划分。若集合$S$的一个划分事件为$\{B_1,...,B_n\}$是指它们满足$\bigcup_{i=1}^n B_i=S \textrm{且}B_iB_j=\varnothing , i,j=1,...,n,i\neq j$。设$\{B_1,...,B_n\}$是$S$的一个划分，则事件$A$的全概率公式可以被描述为：
+\parinterval 首先介绍一下全概率公式：{\small\bfnew{全概率公式}}（Law of Total Probability）是概率论中重要的公式，它可以将一个复杂事件发生的概率分解成不同情况的小事件发生概率的和。这里先介绍一个概念——划分。若集合$S$的一个划分事件为$\{B_1,...,B_n\}$是指它们满足$\bigcup_{i=1}^n B_i=S \textrm{且}B_iB_j=\varnothing , i,j=1,...,n,i\neq j$。设$\{B_1,...,B_n\}$是$S$的一个划分，则事件$A$的全概率公式可以被描述为：

 %---------------------------------------------
 \begin{eqnarray}
@@ -251,7 +251,7 @@ F(X)=\int_{-\infty}^x f(x)dx
 \end{eqnarray}
 %--------------------------------------------

-\noindent 其中，等式右端的分母部分使用了全概率公式。由上式，我们也可以得到贝叶斯公式的另外两种写法:
+\noindent 其中，等式右端的分母部分使用了全概率公式。由上式，也可以得到贝叶斯公式的另外两种写法:
 \begin{eqnarray}
 \textrm{P}(A \mid B) & = & \frac { \textrm{P}(A \mid B)\textrm{P}(B) }  {\textrm{P}(A)} \nonumber \\
                     & = & \frac { \textrm{P}(A \mid B)\textrm{P}(B) }  {\textrm{P}(A \mid B)\textrm{P}(B)+\textrm{P}(A \mid \bar{B}) \textrm{P}(\bar{B})}
@@ -266,7 +266,7 @@ F(X)=\int_{-\infty}^x f(x)dx

 \subsubsection{信息熵}\index{Chapter2.2.5.1}

-\parinterval {\small\sffamily\bfseries{熵}}（Entropy）是热力学中的一个概念，同时也是对系统无序性的一种度量标准。在自然语言处理领域也会使用到信息熵这一概念，比如描述文字的信息量大小。一条信息的信息量可以被看作是这条信息的不确定性。如果我们需要确认一件非常不确定甚至于一无所知的事情，那么需要理解大量的相关信息才能进行确认；同样的，如果我们对某件事已经非常确定，那么就不需要太多的信息就可以把它搞清楚。如下就是两个例子，
+\parinterval {\small\sffamily\bfseries{熵}}（Entropy）是热力学中的一个概念，同时也是对系统无序性的一种度量标准。在自然语言处理领域也会使用到信息熵这一概念，比如描述文字的信息量大小。一条信息的信息量可以被看作是这条信息的不确定性。如果需要确认一件非常不确定甚至于一无所知的事情，那么需要理解大量的相关信息才能进行确认；同样的，如果对某件事已经非常确定，那么就不需要太多的信息就可以把它搞清楚。如下就是两个例子，

 \begin{example}
 确定性和不确定性的事件
@@ -277,7 +277,7 @@ F(X)=\int_{-\infty}^x f(x)dx
 \label{e.g:2.2-1}
 \end{example}

-\parinterval 在这两句话中，``太阳从东方升起''是一件确定性事件（在地球上），几乎不需要查阅更多信息就可以确认，因此这件事的信息熵相对较低；而``明天天气多云''这件事，我们需要关注天气预报，才能大概率确定这件事，它的不确定性很高，因而它的信息熵也就相对较高。因此，信息熵也是对事件不确定性的度量。进一步，我们定义{\small\bfnew{自信息}}（Self-information）：一个事件$X$的自信息的表达式为：
+\parinterval 在这两句话中，``太阳从东方升起''是一件确定性事件（在地球上），几乎不需要查阅更多信息就可以确认，因此这件事的信息熵相对较低；而``明天天气多云''这件事，需要关注天气预报，才能大概率确定这件事，它的不确定性很高，因而它的信息熵也就相对较高。因此，信息熵也是对事件不确定性的度量。进一步，定义{\small\bfnew{自信息}}（Self-information）为一个事件$X$的自信息的表达式为：
 \begin{eqnarray}
 \textrm{I}(x)=-\log\textrm{P}(x)
 \label{eq:2.2-17}
@@ -302,7 +302,7 @@ F(X)=\int_{-\infty}^x f(x)dx
 \label{eq:2.2-18}
 \end{eqnarray}

-\parinterval 一个分布的信息熵也就是从该分布中得到的一个事件的期望信息量。比如，$a$、$b$、$c$、$d$四支球队，四支队伍夺冠的概率分别是$P_1$、$P_2$、$P_3$、$P_4$，某个人对比赛不感兴趣但是又想知道哪只球队夺冠，通过使用二分法2次就确定哪支球队夺冠了。但其实，我们知道这四只球队中$c$的实力可以碾压其它球队，那么猜1次就可以确定。所以对于前面这种情况，哪只球队夺冠的信息量较高，信息熵也相对较高；对于后面这种情况，因为结果是容易猜到的，信息量和信息熵也就相对较低。因此可以得知：分布越尖锐熵越低；分布越均匀熵越高。
+\parinterval 一个分布的信息熵也就是从该分布中得到的一个事件的期望信息量。比如，$a$、$b$、$c$、$d$四支球队，四支队伍夺冠的概率分别是$P_1$、$P_2$、$P_3$、$P_4$，某个人对比赛不感兴趣但是又想知道哪只球队夺冠，通过使用二分法2次就确定哪支球队夺冠了。但假设这四只球队中$c$的实力可以碾压其他球队，那么猜1次就可以确定。所以对于前面这种情况，哪只球队夺冠的信息量较高，信息熵也相对较高；对于后面这种情况，因为结果是容易猜到的，信息量和信息熵也就相对较低。因此可以得知：分布越尖锐熵越低；分布越均匀熵越高。

 \subsubsection{KL距离}\index{Chapter2.2.5.2}

@@ -320,7 +320,7 @@ F(X)=\int_{-\infty}^x f(x)dx
 \begin{itemize}
 \item 非负性，即$\textrm{D}_{\textrm{KL}} (\textrm{P} \parallel \textrm{Q}) \ge 0$，等号成立条件是$\textrm{P}$和$\textrm{Q}$相等。
 \vspace{0.5em}
-\item 不对称性，即$\textrm{D}_{\textrm{KL}} (\textrm{P} \parallel \textrm{Q}) \neq \textrm{D}_{\textrm{KL}} (\textrm{Q}  \parallel \textrm{P})$，所以$\textrm{KL}$距离并不是我们常用的欧式空间中的距离。为了消除这种不确定性，有时也会使用$\textrm{D}_{\textrm{KL}} (\textrm{P}  \parallel \textrm{Q})+\textrm{D}_{\textrm{KL}} (\textrm{Q}  \parallel \textrm{P})$作为度量两个分布差异性的函数。
+\item 不对称性，即$\textrm{D}_{\textrm{KL}} (\textrm{P} \parallel \textrm{Q}) \neq \textrm{D}_{\textrm{KL}} (\textrm{Q}  \parallel \textrm{P})$，所以$\textrm{KL}$距离并不是常用的欧式空间中的距离。为了消除这种不确定性，有时也会使用$\textrm{D}_{\textrm{KL}} (\textrm{P}  \parallel \textrm{Q})+\textrm{D}_{\textrm{KL}} (\textrm{Q}  \parallel \textrm{P})$作为度量两个分布差异性的函数。
 \end{itemize}
 \vspace{0.5em}

@@ -337,7 +337,7 @@ F(X)=\int_{-\infty}^x f(x)dx
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{中文分词}\index{Chapter2.3}

-\parinterval 对于机器翻译系统而言，输入的是已经切分好的单词序列，而不是原始的字符串（图\ref{fig:2.3-1}）。比如，对于一个中文句子，单词之间是没有间隔的，因此我们需要把一个个的单词切分出来，这样机器翻译系统可以区分不同的翻译单元。甚至，我们可以对语言学上的单词进行进一步切分，得到词片段序列（比如：中国人$\to$中国/人）。我们可以把上述过程看作是一种{\small\sffamily\bfseries{分词}}（Segmentation）过程，即：将一个输入的自然语言字符串切割成单元序列（token序列），每个单元都对应可以处理的最小单位。
+\parinterval 对于机器翻译系统而言，输入的是已经切分好的单词序列，而不是原始的字符串（图\ref{fig:2.3-1}）。比如，对于一个中文句子，单词之间是没有间隔的，因此需要把一个个的单词切分出来，这样机器翻译系统可以区分不同的翻译单元。甚至，可以对语言学上的单词进行进一步切分，得到词片段序列（比如：中国人$\to$中国/人）。可以把上述过程看作是一种{\small\sffamily\bfseries{分词}}（Segmentation）过程，即：将一个输入的自然语言字符串切割成单元序列（token序列），每个单元都对应可以处理的最小单位。

 %----------------------------------------------
 % 图2.7
@@ -349,7 +349,7 @@ F(X)=\int_{-\infty}^x f(x)dx
 \end{figure}
 %-------------------------------------------
 %\vspace{-0.5em}
-\parinterval 分词得到的单元序列可以是语言学上的词序列，也可以是根据其它方式定义的基本处理单元。在本章中，我们把分词得到的一个个单元称为{\small\bfnew{单词}}（Word），或{\small\bfnew{词}}，尽管这些单元可以不是语言学上的完整单词。而这个过程也被称作{\small\bfnew{词法分析}}（Lexical Analysis）。除了汉语，词法分析在日语、泰语等单词之间无明确分割符的语言中有着广泛的应用，芬兰语、维吾尔语等一些形态学十分丰富的语言，也需要使用词法分析来解决复杂的词尾、词缀变化等形态学变化。
+\parinterval 分词得到的单元序列可以是语言学上的词序列，也可以是根据其他方式定义的基本处理单元。在本章中，可以把分词得到的一个个单元称为{\small\bfnew{单词}}（Word），或{\small\bfnew{词}}，尽管这些单元可以不是语言学上的完整单词。而这个过程也被称作{\small\bfnew{词法分析}}（Lexical Analysis）。除了汉语，词法分析在日语、泰语等单词之间无明确分割符的语言中有着广泛的应用，芬兰语、维吾尔语等一些形态学十分丰富的语言，也需要使用词法分析来解决复杂的词尾、词缀变化等形态学变化。

 \parinterval 在机器翻译中，分词系统的好坏往往会决定译文的质量。分词的目的是定义系统处理的基本单元，那么什么叫做``词''呢？关于词的定义有很多，比如：\\

@@ -371,7 +371,7 @@ F(X)=\int_{-\infty}^x f(x)dx
 \end{definition}
 %-------------------------------------------

-\parinterval 从语言学的角度，普遍认为词是可以单独运用的、包含意义的基本单位。我们使用有限的词可以组合出无限的句子，这也正体现出自然语言的奇妙之处。
+\parinterval 从语言学的角度，普遍认为词是可以单独运用的、包含意义的基本单位。这样可以使用有限的词可以组合出无限的句子，这也正体现出自然语言的奇妙之处。

 \parinterval 不过，机器翻译并不仅仅局限在语言学定义的单词。比如，神经机器翻译中广泛使用的BPE子词切分方法（第七章），可以被理解为将词的一部分也进行切开，也就是得到词片段送给机器翻译系统使用。比如，对如下英文字符串，可以得到如下切分结果
 \vspace{0.5em}
@@ -388,9 +388,9 @@ F(X)=\int_{-\infty}^x f(x)dx

 \subsection{基于词典的分词方法}\index{Chapter2.3.1}

-\parinterval 然而，计算机并不能像人类一样在概念上理解``词''，因此需要使用其它方式让计算机可以进行分词。一个最简单的方法就是给定一个词典，在这个词典中出现的汉字组合就是我们定义的``词''。也就是，我们通过一个词典定义一个标准，符合这个标准定义的字符串都是合法的``词''。
+\parinterval 然而，计算机并不能像人类一样在概念上理解``词''，因此需要使用其他方式让计算机可以进行分词。一个最简单的方法就是给定一个词典，在这个词典中出现的汉字组合就是所定义的``词''。也就是，通过一个词典定义一个标准，符合这个标准定义的字符串都是合法的``词''。

-\parinterval 在使用基于词典的分词方法时，只需预先加载词典到计算机中，扫描输入句子，查询每个词串是否出现在词典中。如图\ref{fig:2.3-2} 所示，比如，我们有一个包含六个词的词典，给定输入句子``确实现在物价很高''后，我们自左至右遍历输入句子的每个字，发现词串``确实''在词典中出现，说明``确实''是一个``词''，进行分词操作并在切分该``词''之后重复这个过程。
+\parinterval 在使用基于词典的分词方法时，只需预先加载词典到计算机中，扫描输入句子，查询每个词串是否出现在词典中。如图\ref{fig:2.3-2} 所示，有一个包含六个词的词典，给定输入句子`` 确实现在物价很高''后，分词系统自左至右遍历输入句子的每个字，发现词串``确实''在词典中出现，说明``确实''是一个``词''，进行分词操作并在切分该``词''之后重复这个过程。
 %----------------------------------------------
 % 图2.8
 \begin{figure}[htp]
@@ -419,7 +419,7 @@ F(X)=\int_{-\infty}^x f(x)dx

 \subsection{基于统计的分词方法}\label{sec2:statistical-seg}\index{Chapter2.3.2}

-\parinterval 既然基于词典的方法有很多问题，我们就需要一种更为有效的方法。在上文中提到，想要搭建一个分词系统，需要让计算机知道什么是``词''，那么我们可不可以给出已经切分好的分词数据，让计算机在这些数据中学习到规律呢？答案是肯定的 - 利用``数据''来让计算机明白``词''的定义，让计算机直接在数据中学到知识，这就是我们常说的数据驱动的方法。这个过程也是一个典型的基于统计建模的学习过程。
+\parinterval 既然基于词典的方法有很多问题，那么就需要一种更为有效的方法。在上文中提到，想要搭建一个分词系统，需要让计算机知道什么是``词''，那么可不可以给出已经切分好的分词数据，让计算机在这些数据中学习到规律呢？答案是肯定的 - 利用``数据''来让计算机明白``词''的定义，让计算机直接在数据中学到知识，这就常说的数据驱动的方法。这个过程也是一个典型的基于统计建模的学习过程。

 \subsubsection{统计模型的学习与推断}\index{Chapter2.3.2.1}

@@ -449,7 +449,7 @@ F(X)=\int_{-\infty}^x f(x)dx
 \vspace{-0.5em}
 \subsubsection{掷骰子游戏}\index{Chapter2.3.2.2}

-\parinterval 上述过程的核心在于从数据中学习一种对分词现象的统计描述，即学习函数$\textrm{P}(\cdot)$。如何让计算机利用分词好的数据学习到分词的知识呢？可以先看一个有趣的实例，用我们生活中比较常见的掷骰子来说，掷一个骰子，玩家猜一个数字，猜中就算赢，按照一般的常识，随便选一个数字，获胜的概率是一样的，即我们所有选择的获胜概率仅是$1/6$。因此这个游戏玩家很难获胜，除非运气很好。假如，我们进行一次游戏，玩家随便选了一个数字，比如是1，投掷30骰子，得到命中$7/30 > 1/6$，还不错。
+\parinterval 上述过程的核心在于从数据中学习一种对分词现象的统计描述，即学习函数$\textrm{P}(\cdot)$。如何让计算机利用分词好的数据学习到分词的知识呢？可以先看一个有趣的实例，用生活中比较常见的掷骰子来说，掷一个骰子，玩家猜一个数字，猜中就算赢，按照一般的常识，随便选一个数字，获胜的概率是一样的，即所有选择的获胜概率仅是$1/6$。因此这个游戏玩家很难获胜，除非运气很好。假设进行一次游戏，玩家随便选了一个数字，比如是1，投掷30骰子，得到命中$7/30 > 1/6$，还不错。
 \vspace{-0.5em}
 %----------------------------------------------
 % 图2.11
@@ -483,13 +483,13 @@ F(X)=\int_{-\infty}^x f(x)dx
 \label{eq:2.3-2}
 \end{eqnarray}

-\noindent 这里，$\theta_1 \sim \theta_5$可以被看作是模型的参数，因此这个模型的自由度是5。对于这样的模型，参数确定了，模型也就确定了。但是，新的问题来了，在定义骰子每个面的概率后，如何求出具体的值呢？一种常用的方法是，从大量实例中学习模型参数，这个方法也是常说的{\small\bfnew{参数估计}}（Parameter Estimation）。我们可以将这个不均匀的骰子先实验性的掷很多次，这可以被看作是独立同分布的若干次采样，比如$X$ 次，发现``1'' 出现$X_1$ 次，``2'' 出现$X_2$ 次，以此类推，得到了各个面出现的次数。假设掷骰子中每个面出现的概率符合多项式分布，通过简单的概率论知识可以知道每个面出现概率的极大似然估计为：
+\noindent 这里，$\theta_1 \sim \theta_5$可以被看作是模型的参数，因此这个模型的自由度是5。对于这样的模型，参数确定了，模型也就确定了。但是，新的问题来了，在定义骰子每个面的概率后，如何求出具体的值呢？一种常用的方法是，从大量实例中学习模型参数，这个方法也是常说的{\small\bfnew{参数估计}}（Parameter Estimation）。可以将这个不均匀的骰子先实验性的掷很多次，这可以被看作是独立同分布的若干次采样，比如$X$ 次，发现``1'' 出现$X_1$ 次，``2'' 出现$X_2$ 次，以此类推，得到了各个面出现的次数。假设掷骰子中每个面出现的概率符合多项式分布，通过简单的概率论知识可以知道每个面出现概率的极大似然估计为：
 \begin{eqnarray}
 \textrm{P(``i'')}=\frac {X_i}{X}
 \label{eq:2.3-3}
 \end{eqnarray}

-\parinterval 当$X$足够大的时，$\frac{X_i}{X}$可以无限逼近P(``$i$'')的真实值，因此可以通过大量的实验推算出掷骰子各个面的概率的准确估计值。回归到我们的问题中，如果我们在正式开始游戏前，预先掷骰子30次，得到如图\ref{fig:2.3-6}的结果。
+\parinterval 当$X$足够大的时，$\frac{X_i}{X}$可以无限逼近P(``$i$'')的真实值，因此可以通过大量的实验推算出掷骰子各个面的概率的准确估计值。回归到原始的问题，如果在正式开始游戏前，预先掷骰子30次，得到如图\ref{fig:2.3-6}的结果。

 %----------------------------------------------
 % 图2.12
@@ -501,7 +501,7 @@ F(X)=\int_{-\infty}^x f(x)dx
 \end{figure}
 %-------------------------------------------

-\parinterval 于是，我们看到了一个有倾向性的模型（图 \ref{fig:2.3-7}）：在这样的预先实验基础上，我们知道如果再次玩掷骰子游戏的话，选则数字``4''获胜的可能性是最大的。
+\parinterval 于是，我们看到了一个有倾向性的模型（图 \ref{fig:2.3-7}）：在这样的预先实验基础上，可以知道如果再次玩掷骰子游戏的话，选则数字``4''获胜的可能性是最大的。

 %----------------------------------------------
 % 图2.13
@@ -513,7 +513,7 @@ F(X)=\int_{-\infty}^x f(x)dx
 \end{figure}
 %-------------------------------------------

-\parinterval 通过上面这个掷骰子的游戏，可以得到一个道理：{\small\sffamily\bfseries{上帝是不公平的}}。因为在``公平''的世界中，没有任何一个模型可以学到有价值的事情。从机器学习的角度来看，所谓的``不公平''实际上这是客观事物中蕴含的一种{\small\sffamily\bfseries{偏置}}（Bias），也就是很多事情天然就有对某些情况有倾向。而图像处理、自然语言处理等问题中绝大多数都存在着偏置。比如，我们翻译一个英文单词的时候，它最可能的翻译结果往往就是那几个词。我们设计统计模型的目的正是要学习这种偏置，之后利用这种偏置对新的问题做出足够好的决策。
+\parinterval 通过上面这个掷骰子的游戏，可以得到一个道理：{\small\sffamily\bfseries{上帝是不公平的}}。因为在``公平''的世界中，没有任何一个模型可以学到有价值的事情。从机器学习的角度来看，所谓的``不公平''实际上这是客观事物中蕴含的一种{\small\sffamily\bfseries{偏置}}（Bias），也就是很多事情天然就有对某些情况有倾向。而图像处理、自然语言处理等问题中绝大多数都存在着偏置。比如，我们翻译一个英文单词的时候，它最可能的翻译结果往往就是那几个词。设计统计模型的目的正是要学习这种偏置，之后利用这种偏置对新的问题做出足够好的决策。

 \subsubsection{全概率分词方法}\index{Chapter2.3.2.3}

@@ -539,7 +539,7 @@ F(X)=\int_{-\infty}^x f(x)dx
 \end{figure}
 %-------------------------------------------

-\parinterval 如果，我们把这些数字换成汉语中的词，比如
+\parinterval 如果，把这些数字换成汉语中的词，比如

 \parinterval 88\; = \; 这

@@ -549,7 +549,7 @@ F(X)=\int_{-\infty}^x f(x)dx

 \parinterval ...

-\parinterval 之后可以得到图\ref{fig:2.3-9}所示的结果。
+\parinterval 就可以得到图\ref{fig:2.3-9}所示的结果。

 %----------------------------------------------
 % 图2.15
@@ -562,7 +562,7 @@ F(X)=\int_{-\infty}^x f(x)dx
 \end{figure}
 %-------------------------------------------

-\parinterval 于是，在中文分词问题中，可以假设我们拥有一个不均匀的多面骰子，每个面都对应一个单词。我们获取人工分词标注数据后，可以统计每个单词出现的次数，进而利用极大似然估计推算出每个单词出现的概率的估计值。图\ref{fig:2.3-10}给出了一个实例。
+\parinterval 于是，在中文分词问题中，可以假设有一个不均匀的多面骰子，每个面都对应一个单词。在获取人工分词标注数据后，可以统计每个单词出现的次数，进而利用极大似然估计推算出每个单词出现的概率的估计值。图\ref{fig:2.3-10}给出了一个实例。

 %----------------------------------------------
 % 图2.16
@@ -606,16 +606,16 @@ F(X)=\int_{-\infty}^x f(x)dx

 \parinterval 最后再整体看一下分词系统的学习和使用过程。如图\ref {fig:2.3-8}所示，我们利用大量人工标注好的分词数据，通过统计学习方法获得一个统计模型$\textrm{P}(\cdot)$，给定任意分词结果$W=w_1 w_2...w_m$，都能通过$\textrm{P}(W)=\textrm{P}(w_1) \cdot \textrm{P}(w_2 ) \cdot ... \cdot \textrm{P}(w_m)$计算这种切分的概率值。

-\parinterval 经过充分训练的统计模型$\textrm{P}(\cdot)$就是我们得到分词模型。对于输入的新句子$S$，通过这个模型找到最佳的分词结果$W^*$输出。假设输入句子$S$是``确实现在数据很多''，可以通过列举获得不同切分方式的概率，其中概率最高的切分方式，就是我们的目标输出。
+\parinterval 经过充分训练的统计模型$\textrm{P}(\cdot)$就是得到的分词模型。对于输入的新句子$S$，通过这个模型找到最佳的分词结果$W^*$输出。假设输入句子$S$是``确实现在数据很多''，可以通过列举获得不同切分方式的概率，其中概率最高的切分方式，就是系统的目标输出。

 \parinterval 这种分词方法也被称作基于1-gram语言模型的分词，或全概率分词，使用标注好的分词数据进行学习，获得分词模型。这种方法最大的优点是整个学习过程（模型训练过程）和推导过程（处理新句子进行切分的过程）都是全自动进行的。虽然这种方法十分简单，但是其效率很高，因此被广泛使用在工业界系统里。

-\parinterval 当然，真正的分词系统还需要解决很多其它问题，比如使用动态规划等方法高效搜索最优解以及如何处理未见过的词等等，由于本节的重点是介绍中文分词的基础方法和统计建模思想，因此无法覆盖所有中文分词的技术内容，有兴趣的读者可以参考\ref{sec2:summary}节的相关文献做进一步深入研究。
+\parinterval 当然，真正的分词系统还需要解决很多其他问题，比如使用动态规划等方法高效搜索最优解以及如何处理未见过的词等等，由于本节的重点是介绍中文分词的基础方法和统计建模思想，因此无法覆盖所有中文分词的技术内容，有兴趣的读者可以参考\ref{sec2:summary}节的相关文献做进一步深入研究。

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{$n$-gram语言模型 }\index{Chapter2.4}

-\parinterval 在基于统计的汉语分词模型中，我们通过``大题小做''的技巧，利用独立性假设把整个句子的单词切分概率转化为每个单个词出现概率的乘积。这里，每个单词也被称作1-gram（或uni-gram），而1-gram概率的乘积实际上也是在度量词序列出现的可能性（记为$\textrm{P}(w_1 w_2...w_m)$）。这种计算整个单词序列概率$\textrm{P}(w_1 w_2...w_m)$的方法被称为统计语言模型。1-gram语言模型是最简单的一种语言模型，它没有考虑任何的上下文。很自然的一个问题是：能否考虑上下文信息构建更强大的语言模型，进而得到更准确的分词结果。下面我们将进一步介绍更加通用的$n$-gram语言模型，它在机器翻译及其它自然语言处理任务中有更加广泛的应用。
+\parinterval 在基于统计的汉语分词模型中，我们通过``大题小做''的技巧，利用独立性假设把整个句子的单词切分概率转化为每个单个词出现概率的乘积。这里，每个单词也被称作1-gram（或uni-gram），而1-gram概率的乘积实际上也是在度量词序列出现的可能性（记为$\textrm{P}(w_1 w_2...w_m)$）。这种计算整个单词序列概率$\textrm{P}(w_1 w_2...w_m)$的方法被称为统计语言模型。1-gram语言模型是最简单的一种语言模型，它没有考虑任何的上下文。很自然的一个问题是：能否考虑上下文信息构建更强大的语言模型，进而得到更准确的分词结果。下面将进一步介绍更加通用的$n$-gram语言模型，它在机器翻译及其他自然语言处理任务中有更加广泛的应用。

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{建模}\index{Chapter2.4.1}
@@ -662,7 +662,6 @@ F(X)=\int_{-\infty}^x f(x)dx
 }
 \end{center}

-\vspace{-0.3em}
 \parinterval 可以看到，1-gram语言模型只是$n$-gram语言模型的一种特殊形式。$n$-gram的优点在于，它所使用的历史信息是有限的，即$n-1$个单词。这种性质也反映了经典的马尔可夫链的思想\cite{liuke-markov-2004}\cite{resnick1992adventures}有时也被称作马尔可夫假设或者马尔可夫属性。因此$n$-gram也可以被看作是变长序列上的一种马尔可夫模型，比如，2-gram语言模型对应着1阶马尔可夫模型，3-gram语言模型对应着2阶马尔可夫模型，以此类推。

 \parinterval 那么，如何计算$\textrm{P}(w_m|w_{m-n+1} ... w_{m-1})$呢？有很多种选择，比如：
@@ -684,7 +683,7 @@ F(X)=\int_{-\infty}^x f(x)dx

 \parinterval 极大似然估计方法和前面介绍的统计分词中的方法是一致的，它的核心是使用$n$-gram出现的频度进行参数估计，因此是也自然语言处理中一类经典的$n$-gram方法。基于人工神经网络的方法在近些年也非常受关注，它直接利用多层神经网络对问题的输入$(w_{m-n+1}...w_{m-1})$和输出$\textrm{P}(w_m|w_{m-n+1} ... w_{m-1})$进行建模，而模型的参数通过网络中神经元之间连接的权重进行体现。严格意义上了来说，基于人工神经网络的方法并不算基于$n$-gram的方法，或者说它并没有显性记录$n$-gram的生成概率，也不依赖$n$-gram的频度进行参数估计。为了保证内容的连贯性，本章将仍以传统$n$-gram语言模型为基础进行讨论，基于人工神经网络的方法将会在第五章和第六章进行详细介绍。

-\parinterval $n$-gram语言模型的使用非常简单。我们可以像\ref{sec2:statistical-seg}节中一样，直接用它来对词序列出现的概率进行计算。比如，可以使用一个2-gram语言模型计算一个分词序列的概率：
+\parinterval $n$-gram语言模型的使用非常简单。可以像\ref{sec2:statistical-seg}节中一样，直接用它来对词序列出现的概率进行计算。比如，可以使用一个2-gram语言模型计算一个分词序列的概率：

 \begin{eqnarray}
 & &\textrm{P}_{2-gram}{(\textrm{``确实}/\textrm{现在}/\textrm{数据}/\textrm{很}/\textrm{多''})} \nonumber \\
@@ -693,7 +692,7 @@ F(X)=\int_{-\infty}^x f(x)dx
 \label{eq:2.4-4}
 \end{eqnarray}

-\parinterval 以$n$-gram语言模型为代表的统计语言模型的应用非常广泛。除了分词，在文本生成、信息检索、摘要等自然语言处理任务中，语言模型都有举足轻重的地位。包括近些年非常受关注的预训练模型，本质上也是统计语言模型。这些技术都会在后续章节进行介绍。值得注意的是，统计语言模型给我们解决自然语言处理问题提供了一个非常好的建模思路，即：把整个序列生成的问题转化为逐个生成单词的问题。很快我们就会看到，这种建模方式会被广泛的用于机器翻译建模，在统计机器翻译和神经机器翻译中都会有明显的体现。
+\parinterval 以$n$-gram语言模型为代表的统计语言模型的应用非常广泛。除了分词，在文本生成、信息检索、摘要等自然语言处理任务中，语言模型都有举足轻重的地位。包括近些年非常受关注的预训练模型，本质上也是统计语言模型。这些技术都会在后续章节进行介绍。值得注意的是，统计语言模型为解决自然语言处理问题提供了一个非常好的建模思路，即：把整个序列生成的问题转化为逐个生成单词的问题。很快我们就会看到，这种建模方式会被广泛的用于机器翻译建模，在统计机器翻译和神经机器翻译中都会有明显的体现。

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{未登录词和平滑算法}\label{sec2:smoothing}\index{Chapter2.4.2}
@@ -720,12 +719,12 @@ F(X)=\int_{-\infty}^x f(x)dx

 \parinterval 为了解决未登录词引起的零概率问题，常用的做法是对模型进行平滑处理，也就是给可能出现的情况一个非零的概率，使得模型不会对整个序列给出零概率。平滑可以用``劫富济贫''这一思想理解，在保证所有情况的概率和为1的前提下，使极低概率的部分可以从高概率的部分分配到一部分概率，从而达到平滑的目的。

-\parinterval 语言模型使用的平滑算法有很多。在本节中，主要介绍三种平滑方法：加法平滑法、古德-图灵估计法和Kneser-Ney平滑。这些方法也可以被应用到其它任务的概率平滑操作中。
+\parinterval 语言模型使用的平滑算法有很多。在本节中，主要介绍三种平滑方法：加法平滑法、古德-图灵估计法和Kneser-Ney平滑。这些方法也可以被应用到其他任务的概率平滑操作中。

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsubsection{加法平滑方法}\index{Chapter2.4.2.1}

-\parinterval {\small\bfnew{加法平滑}}（Additive Smoothing）是一种简单的平滑技术。我们首先介绍这一方法，希望通过它了解平滑算法的思想。通常情况下，我们会利用采集到的语料库来模拟真实的全部语料库。当然，没有一个语料库能覆盖所有的语言现象。常见的一个问题是，使用的语料无法涵盖所有的词汇。因此，直接依据这样语料所获得的统计信息来获取语言模型就会产生偏差。假设依据某语料$C$ （从未出现`` 确实 现在''二元语法），评估一个已经分好词的句子$S$ =``确实/现在/物价/很/高''的概率。当计算``确实/现在''的概率时，$\textrm{P}(S) = 0$。显然这个结果是不合理的。
+\parinterval {\small\bfnew{加法平滑}}（Additive Smoothing）是一种简单的平滑技术。本小节首先介绍这一方法，希望通过它了解平滑算法的思想。通常情况下，系统研发者会利用采集到的语料库来模拟真实的全部语料库。当然，没有一个语料库能覆盖所有的语言现象。常见的一个问题是，使用的语料无法涵盖所有的词汇。因此，直接依据这样语料所获得的统计信息来获取语言模型就会产生偏差。假设依据某语料$C$ （从未出现`` 确实 现在''二元语法），评估一个已经分好词的句子$S$ =``确实/现在/物价/很/高''的概率。当计算``确实/现在''的概率时，$\textrm{P}(S) = 0$。显然这个结果是不合理的。

 \parinterval 加法平滑方法假设每个$n$-gram出现的次数比实际统计次数多$\theta$次，$0 \le \theta\le 1$。这样，计算概率的时候分子部分不会为0。重新计算$\textrm{P}(\textrm{现在}|\textrm{确实})$，可以得到：

@@ -735,7 +734,7 @@ F(X)=\int_{-\infty}^x f(x)dx
 \label{eq:2.4-7}
 \end{eqnarray}

-\noindent 其中，$V$表示所有词汇的词表，$|V|$为词表中单词的个数，$w$为词典中的一个词。有时候，加法平滑方法会将$\theta$取1，这时我们称之为加一平滑或是拉普拉斯平滑。这种方法比较容易理解，也比较简单，因此也往往被用于对系统的快速原型中。
+\noindent 其中，$V$表示所有词汇的词表，$|V|$为词表中单词的个数，$w$为词典中的一个词。有时候，加法平滑方法会将$\theta$取1，这时称之为加一平滑或是拉普拉斯平滑。这种方法比较容易理解，也比较简单，因此也往往被用于对系统的快速原型中。

 \parinterval 举一个例子。假设在一个英文文档中随机一些单词（词表大小$|V|=20$），各个单词出现的次数为：``look'': 4，``people'': 3，``am'': 2，``what'': 1，``want'': 1，``do'': 1。图\ref{fig:2.4-2} 给出了在平滑之前和平滑之后的概率分布。

@@ -790,7 +789,7 @@ N & = & \sum_{r=0}^{\infty}{r^{*}n_r} \nonumber \\

 \noindent 其中$n_1/N$就是分配给所有出现为0次事件的概率。古德-图灵方法最终通过出现1次的$n$-gram估计了出现为0次的事件概率，达到了平滑的效果。

-\parinterval 我们使用一个例子来说明这个方法是如何对事件出现的可能性进行平滑的。仍然考虑在加法平滑法中统计单词的例子，根据古德-图灵方法进行修正如表\ref{tab::2.4-2}所示。
+\parinterval 这里使用一个例子来说明这个方法是如何对事件出现的可能性进行平滑的。仍然考虑在加法平滑法中统计单词的例子，根据古德-图灵方法进行修正如表\ref{tab::2.4-2}所示。

 %------------------------------------------------------
 % 表1.3
@@ -813,7 +812,7 @@ N & = & \sum_{r=0}^{\infty}{r^{*}n_r} \nonumber \\
 %------------------------------------------------------

 \vspace{-1.5em}
-\parinterval 当$r$很大的时候经常会出现$n_{r+1}=0$的情况，而且这时$n_r$也会有噪音存在。通常，简单的古德-图灵方法可能无法很好的处理这种复杂的情况，不过古德-图灵方法仍然是其它一些平滑方法的基础。
+\parinterval 当$r$很大的时候经常会出现$n_{r+1}=0$的情况，而且这时$n_r$也会有噪音存在。通常，简单的古德-图灵方法可能无法很好的处理这种复杂的情况，不过古德-图灵方法仍然是其他一些平滑方法的基础。

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsubsection{Kneser-Ney平滑方法}\index{Chapter2.4.2.3}
@@ -836,9 +835,9 @@ I cannot see without my reading \underline{\ \ \ \ \ \ \ \ }
 \end{center}
 \vspace{0.0em}

-\noindent 直觉上我们会猜测这个地方的词应该是glasses，但是在训练语料库中Francisco 出现的频率非常高。如果在预测时仍然使用的是标准的1-gram模型，那么系统会高概率选择Francisco填入下划线出，这个结果明显是不合理的。当使用的是混合的插值模型时，如果reading Francisco这种二元语法并没有出现在语料中，就会导致1-gram对结果的影响变大，使得仍然会做出与标准1-gram模型相同的结果，犯下相同的错误。
+\noindent 直觉上应该会猜测这个地方的词应该是glasses，但是在训练语料库中Francisco 出现的频率非常高。如果在预测时仍然使用的是标准的1-gram模型，那么系统会高概率选择Francisco填入下划线出，这个结果明显是不合理的。当使用的是混合的插值模型时，如果reading Francisco这种二元语法并没有出现在语料中，就会导致1-gram对结果的影响变大，使得仍然会做出与标准1-gram模型相同的结果，犯下相同的错误。

-\parinterval 观察语料中的2-gram发现，Francisco的前一个词仅可能是San，不会出现reading。这个分析提醒了我们，考虑前一个词的影响是有帮助的，比如仅在前一个词时San时，我们才给Francisco赋予一个较高的概率值。基于这种想法，改进原有的1-gram模型，创造一个新的1-gram模型$\textrm{P}_{\textrm{continuation}}$，简写为$\textrm{P}_{\textrm{cont}}$。这个模型可以通过考虑前一个词的影响评估当前词作为第二个词出现的可能性。
+\parinterval 观察语料中的2-gram发现，Francisco的前一个词仅可能是San，不会出现reading。这个分析提醒了我们，考虑前一个词的影响是有帮助的，比如仅在前一个词时San时，才给Francisco赋予一个较高的概率值。基于这种想法，改进原有的1-gram模型，创造一个新的1-gram模型$\textrm{P}_{\textrm{continuation}}$，简写为$\textrm{P}_{\textrm{cont}}$。这个模型可以通过考虑前一个词的影响评估当前词作为第二个词出现的可能性。

 \parinterval 为了评估$\textrm{P}_{\textrm{cont}}$，统计使用当前词作为第二个词所出现二元语法的种类，二元语法种类越多，这个词作为第二个词出现的可能性越高，呈正比：
 \begin{eqnarray}
@@ -888,7 +887,7 @@ c_{\textrm{KN}}(\cdot) & = & \begin{cases} \textrm{count}(\cdot)\quad\quad \text
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{句法分析（短语结构分析）}\index{Chapter2.5}

-\parinterval 通过前面两节的内容，我们已经了解什么叫做``词''、如何对分词问题进行统计建模。同时也了解了如何对词序列的生成进行概率描述。无论是分词还是语言模型都是句子浅层词串信息的一种表示。对于一个自然语言句子来说，它更深层次的结构信息可以通过句法信息来描述，而句法信息也是机器翻译和自然语言处理其它任务中常用的知识之一。
+\parinterval 通过前面两节的内容，已经了解什么叫做``词''、如何对分词问题进行统计建模。同时也了解了如何对词序列的生成进行概率描述。无论是分词还是语言模型都是句子浅层词串信息的一种表示。对于一个自然语言句子来说，它更深层次的结构信息可以通过句法信息来描述，而句法信息也是机器翻译和自然语言处理其他任务中常用的知识之一。

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{句子的句法树表示}\index{Chapter2.5.1}
@@ -907,7 +906,7 @@ c_{\textrm{KN}}(\cdot) & = & \begin{cases} \textrm{count}(\cdot)\quad\quad \text

 \parinterval 图\ref{fig:2.5-1}右侧展示的是另一种句法结构，被称作依存句法树。依存句法树表示了句子中单词和单词之间的依存关系。比如，从这个例子可以了解，``猫''依赖``喜欢''，``吃''依赖``喜欢''，``鱼''依赖``吃''。

-\parinterval 短语结构树和依存句法树的结构和功能有很大不同。短语结构树的叶子节点是单词，中间节点是词性或者短语句法标记。在短语结构分析中，通常把单词称作{\small\bfnew{终结符}}（Terminal），把词性称为{\small\bfnew{预终结符}}（Pre-terminal），而把其它句法标记称为{\small\bfnew{非终结符}}（Non-terminal）。依存句法树没有预终结符和非终结符，所有的节点都是句子里的单词，通过不同节点间的连线表示句子中各个单词之间的依存关系。每个依存关系实际上都是有方向的，头和尾分别指向``接受''和``发出''依存关系的词。依存关系也可以进行分类，图\ref{fig:2.5-1}中我们对每个依存关系的类型都进行了标记，这也被称作是有标记的依存分析。如果不生成这些标记，这样的句法分析被称作无标记的依存分析。
+\parinterval 短语结构树和依存句法树的结构和功能有很大不同。短语结构树的叶子节点是单词，中间节点是词性或者短语句法标记。在短语结构分析中，通常把单词称作{\small\bfnew{终结符}}（Terminal），把词性称为{\small\bfnew{预终结符}}（Pre-terminal），而把其他句法标记称为{\small\bfnew{非终结符}}（Non-terminal）。依存句法树没有预终结符和非终结符，所有的节点都是句子里的单词，通过不同节点间的连线表示句子中各个单词之间的依存关系。每个依存关系实际上都是有方向的，头和尾分别指向``接受''和``发出''依存关系的词。依存关系也可以进行分类，图\ref{fig:2.5-1}中我们对每个依存关系的类型都进行了标记，这也被称作是有标记的依存分析。如果不生成这些标记，这样的句法分析被称作无标记的依存分析。

 \parinterval 虽然短语结构树和依存树的句法表现形式有很大不同，但是它们在某些条件下能相互转化。比如，可以使用启发性规则将短语结构树自动转化为依存树。从应用的角度，依存分析由于形式更加简单，而且直接建模词语之间的依赖，因此在自然语言处理领域中受到很多关注。在机器翻译中，不过无论是哪种句法树结构，都已经被证明会对机器翻译系统产生帮助。特别是短语结构树，在机器翻译中的应用历史更长，研究更为深入，因此本节将会以短语结构分析为例介绍句法分析的相关概念。

@@ -957,7 +956,7 @@ c_{\textrm{KN}}(\cdot) & = & \begin{cases} \textrm{count}(\cdot)\quad\quad \text
 \end{definition}
 %-------------------------------------------

-\parinterval 举例说明，假设有上下文无关文法$G=<N,\Sigma,R,S>$，我们用它描述一个简单中文句法结构。其中非终结符集合为不同的中文句法标记
+\parinterval 举例说明，假设有上下文无关文法$G=<N,\Sigma,R,S>$，可以用它描述一个简单中文句法结构。其中非终结符集合为不同的中文句法标记
 \begin{eqnarray}
 N=\{\textrm{NN},\textrm{VV},\textrm{NP},\textrm{VP},\textrm{IP}\} \nonumber
 \label{eq:2.5-1}
@@ -994,7 +993,7 @@ S=\{\textrm{IP}\} \nonumber
 %-------------------------------------------
 \begin{definition} 上下文无关文法规则的使用

-一个符号序列$u$可以通过使用规则$r$替换其中的某个非终结符，并得到符号序列$v$,我们说$v$是在$u$上使用$r$的结果，记为$u \overset{r}{\Rightarrow} v$：
+一个符号序列$u$可以通过使用规则$r$替换其中的某个非终结符，并得到符号序列$v$，于是$v$是在$u$上使用$r$的结果，记为$u \overset{r}{\Rightarrow} v$：
 \begin{center}
 \input{./Chapter2/Figures/figure-usage-of-regulation}
 \end{center}
@@ -1027,7 +1026,7 @@ s_0 \overset{r_1}{\Rightarrow} s_1 \overset{r_2}{\Rightarrow} s_2 \overset{r_3}{
 \end{definition}
 %-------------------------------------------

-\parinterval 比如，使用前面的示例文法，可以对``猫 喜欢 吃 鱼''进行分析，并形成句法分析树（图\ref{fig:2.5-4}）。我们从起始非终结符IP开始，使用唯一拥有IP作为左部的规则$r_8$推导出NP和VP，之后依次使用规则$r_5$、$r_1$、$r_7$、$r_2$、$r_6$、$r_3$、$r_4$，得到了完整的句法树。
+\parinterval 比如，使用前面的示例文法，可以对``猫 喜欢 吃 鱼''进行分析，并形成句法分析树（图\ref{fig:2.5-4}）。从起始非终结符IP开始，使用唯一拥有IP作为左部的规则$r_8$推导出NP和VP，之后依次使用规则$r_5$、$r_1$、$r_7$、$r_2$、$r_6$、$r_3$、$r_4$，得到了完整的句法树。
 %-------------------------------------------
 % 图2.5.2.3
 \begin{figure}[htp]
@@ -1041,7 +1040,7 @@ s_0 \overset{r_1}{\Rightarrow} s_1 \overset{r_2}{\Rightarrow} s_2 \overset{r_3}{

 \parinterval 通常，可以把推导简记为$d=r_1 \circ r_2 \circ ... \circ r_n$，其中$ \circ $表示规则的组合。显然，$d$也对应了树形结构，也就是句法分析结果。从这个角度看，推导就是描述句法分析树的一种方式。此外，规则的推导也把规则的使用过程与生成的字符串对应起来。一个推导所生成的字符串，也被称作文法所产生的一个{\small\bfnew{句子}}（Sentence）。而一个文法所能生成的所有句子是这个文法所对应的{\small\bfnew{语言}}（Language）。

-\parinterval 但是，句子和规则的推导并不是一一对应的。同一个句子，往往有很多推导的方式，我们称为{\small\bfnew{歧义}}（Ambiguity）。甚至同一棵句法树，也可以对应不同的推导。图\ref{fig:2.5-5}给出同一棵句法树所对应的两种不同的规则推导。
+\parinterval 但是，句子和规则的推导并不是一一对应的。同一个句子，往往有很多推导的方式，这种现象被称为{\small\bfnew{歧义}}（Ambiguity）。甚至同一棵句法树，也可以对应不同的推导。图\ref{fig:2.5-5} 给出同一棵句法树所对应的两种不同的规则推导。

 %-------------------------------------------
 %图2.5.2.4
@@ -1070,7 +1069,7 @@ s_0 \overset{r_1}{\Rightarrow} s_1 \overset{r_2}{\Rightarrow} s_2 \overset{r_3}{
 \end{figure}
 %-------------------------------------------

-\parinterval 在统计句法分析中，我们需要对每个推导进行统计建模，于是我们得到一个模型$\textrm{P}( \cdot )$，对于任意的推导$d$，都可以用$\textrm{P}(d)$计算出推导$d$的概率。这样，给定一个输入句子，我们可以对所有可能的推导用$\textrm{P}(d)$计算其概率值，并选择概率最大的结果作为句法分析的结果输出（图\ref{fig:2.5-7}）。
+\parinterval 在统计句法分析中，需要对每个推导进行统计建模，于是定义一个模型$\textrm{P}( \cdot )$，对于任意的推导$d$，都可以用$\textrm{P}(d)$计算出推导$d$的概率。这样，给定一个输入句子，我们可以对所有可能的推导用$\textrm{P}(d)$计算其概率值，并选择概率最大的结果作为句法分析的结果输出（图\ref{fig:2.5-7}）。
 %-------------------------------------------
 %图2.5.2.6
 \begin{figure}[htp]
@@ -1106,7 +1105,6 @@ s_0 \overset{r_1}{\Rightarrow} s_1 \overset{r_2}{\Rightarrow} s_2 \overset{r_3}{
 \end{eqnarray}

 \noindent 即，在给定规则左部的情况下生成规则右部的可能性。进一步，在上下文无关文法中，每条规则之间的使用都是相互独立的 \footnote[3]{如果是上下文有关文法，规则会形如 $a\alpha b\to a\beta b$，这时$\alpha \to \beta $的过程会依赖前后上下文$a$和$b$}。因此可以把$\textrm{P}(d)$分解为规则概率的乘积：
-
 \begin{eqnarray}
 \textrm{P}(d) & = & \textrm{P}(r_1 \cdot r_2 \cdot ... \cdot r_n) \nonumber \\
 & = & \textrm{P}(r_1) \cdot \textrm{P}(r_2) \cdots \textrm{P}(r_n)
@@ -1136,7 +1134,7 @@ r_6: & & \textrm{VP} \to \textrm{VV}\ \textrm{NN} \nonumber
 \label{eq:2.5-7}
 \end{eqnarray}

-\parinterval 图\ref{fig:2.5-8}展示了通过这种方法计算规则概率的过程。与词法分析类似，我们统计树库中规则左部和右部同时出现的次数，除以规则左部出现的全部次数，所得的结果就是所求规则的概率。这种方法也是典型的相对频度估计。但是如果规则左部和右部同时出现的次数为0时是否代表这个规则概率是0呢？遇到这种情况，可以使用平滑方法对概率进行平滑处理，具体思路可参考\ref{sec2:smoothing}节内容。
+\parinterval 图\ref{fig:2.5-8}展示了通过这种方法计算规则概率的过程。与词法分析类似，可以统计树库中规则左部和右部同时出现的次数，除以规则左部出现的全部次数，所得的结果就是所求规则的概率。这种方法也是典型的相对频度估计。但是如果规则左部和右部同时出现的次数为0时是否代表这个规则概率是0呢？遇到这种情况，可以使用平滑方法对概率进行平滑处理，具体思路可参考\ref{sec2:smoothing}节内容。

 %-------------------------------------------
 % 图2.5.3.1
@@ -1170,7 +1168,7 @@ r_6: & & \textrm{VP} \to \textrm{VV}\ \textrm{NN} \nonumber
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{小结及深入阅读} \label{sec2:summary}\index{Chapter2.6}

-\parinterval 在本章中，我们重点介绍了如何对自然语言处理问题进行统计建模，并从数据中自动学习统计模型的参数，最终使用学习到的模型对新的问题进行处理。之后，我们将这种思想应用到三个自然语言处理任务中，包括：中文分词、语言建模、句法分析，它们也和机器翻译有着紧密的联系。通过系统化的建模，可以发现：经过适当的假设和化简，统计模型可以很好的描述复杂的自然语言处理问题。相关概念和方法也会在后续章节的内容中被广泛使用。
+\parinterval 本章重点介绍了如何对自然语言处理问题进行统计建模，并从数据中自动学习统计模型的参数，最终使用学习到的模型对新的问题进行处理。之后，本章将这种思想应用到三个自然语言处理任务中，包括：中文分词、语言建模、句法分析，它们也和机器翻译有着紧密的联系。通过系统化的建模，可以发现：经过适当的假设和化简，统计模型可以很好的描述复杂的自然语言处理问题。相关概念和方法也会在后续章节的内容中被广泛使用。

 \parinterval 由于本章重点关注介绍如何用统计的思想对自然语言处理任务进行建模，因此并没有对具体的问题展开深入讨论。有几方面内容，读者可以继续关注：

@@ -1178,7 +1176,7 @@ r_6: & & \textrm{VP} \to \textrm{VV}\ \textrm{NN} \nonumber
 \begin{itemize}
 \item 在建模方面，本章介绍的三个任务均采用的是基于人工先验知识进行模型设计的思路。也就是，问题所表达的现象被``一步一步''生成出来。这是一种典型的生成式建模思想，它把要解决的问题看作一些观测结果的隐含变量（比如，句子是观测结果，分词结果是隐含在背后的变量），之后通过对隐含变量生成观测结果的过程进行建模，以达到对问题进行数学描述的目的。这类模型一般需要依赖一些独立性假设，假设的合理性对最终的性能有较大影响。相对{\small\sffamily\bfseries{生成模型}}（Generative Model），另一类方法{\small\sffamily\bfseries{判别模型}}（Discriminative Model），它直接描述了从隐含变量生成观测结果的过程，这样对问题的建模更加直接，同时这类模型可以更加灵活的引入不同的特征。判别式模型在自然语言处理中也有广泛应用\cite{shannon1948mathematical}\cite{ng2002discriminative}。 在本书的第四章也会使用到判别式模型。

-\item 从现在自然语言处理的前沿看，基于端到端学习的深度学习方法在很多任务中都取得了领先的性能。但是，本章并没有涉及深度学习及相关方法，这是由于笔者认为：{\color{red} 对问题的建模是自然语言处理的基础，对问题的本质刻画并不会因为方法的改变而改变}。因此，本章的内容没有太多的陷入到更加复杂的模型和算法设计中，相反，我们希望关注对基本问题的理解和描述。不过，一些前沿方法仍可以作为参考，包括：基于条件随机场和双向长短时记忆模型的序列标注模型\cite{lafferty2001conditional}\cite{huang2015bidirectional}\cite{ma2016end}、神经语言模型\cite{bengio2003neural}\cite{mikolov2010recurrent}、神经句法分析模型\cite{chen2014fast}\cite{zhu2015long}。
+\item 从现在自然语言处理的前沿看，基于端到端学习的深度学习方法在很多任务中都取得了领先的性能。但是，本章并没有涉及深度学习及相关方法，这是由于笔者认为：对问题的建模是自然语言处理的基础，对问题的本质刻画并不会因为方法的改变而改变。因此，本章的内容没有太多的陷入到更加复杂的模型和算法设计中，相反，我们希望关注对基本问题的理解和描述。不过，一些前沿方法仍可以作为参考，包括：基于条件随机场和双向长短时记忆模型的序列标注模型\cite{lafferty2001conditional}\cite{huang2015bidirectional}\cite{ma2016end}、神经语言模型\cite{bengio2003neural}\cite{mikolov2010recurrent}、神经句法分析模型\cite{chen2014fast}\cite{zhu2015long}。

 \item 此外，本章并没有对模型的推断方法进行深入介绍。比如，对于一个句子如何有效的找到概率最大的分词结果？显然，简单枚举是不可行的。对于这类问题比较简单的解决方法是使用动态规划\cite{huang2008advanced}。如果使用动态规划的条件不满足，可以考虑使用更加复杂的搜索策略，并配合一定剪枝方法。实际上，无论是$n$-gram语言模型还是简单的上下文无关文法都有高效的推断方法。比如，$n$-gram语言模型可以被视为概率有限状态自动机，因此可以直接使用成熟的自动机工具。对于更复杂的句法分析问题，可以考虑使用移进-规约方法来解决推断问题\cite{aho1972theory}。
 \end{itemize}

--- a/Book/Chapter3/Chapter3.tex
+++ b/Book/Chapter3/Chapter3.tex
--- a/Book/Chapter3/Figures/figure-example-of-t-s-generate.tex
+++ b/Book/Chapter3/Figures/figure-example-of-t-s-generate.tex
 %%% outline
 %-------------------------------------------------------------------------
 \begin{tikzpicture}
-
+\begin{scope}
 {
+\node [anchor=north west] (st) at (0,0) {\color{white}$\mathbf{s}$};
+\node [anchor=north] (taut) at ([yshift=-3em]st.south) {\color{white}\sffamily\bfseries{$\tau$}};
+\node [anchor=north] (phit) at ([yshift=-3em]taut.south) {\color{white}\sffamily\bfseries{$\phi$}};
+\node [anchor=north] (tt) at ([yshift=-3em]phit.south) {\color{white}$\mathbf{t}$};
+}
 {\scriptsize
-\node [anchor=west,minimum height=2.5em,minimum width=5.5em] (sf1) at (2.3em,0) {};
-\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2.5em,minimum width=5.5em] (s1) at ([xshift=2.3em]sf1.east) {科学家};
-\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2.5em,minimum width=5.5em] (s2) at ([xshift=2.32em]s1.east) {们};
-\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2.5em,minimum width=5.5em] (s3) at ([xshift=2.33em]s2.east) {并不};
-\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2.5em,minimum width=5.5em] (s4) at ([xshift=2.30em]s3.east) {知道};
-
-\node [anchor=north] (tau11) at ([xshift=-1.5em,yshift=-3.5em]sf1.south) {$\tau_0$};
-\node [anchor=west] (tau12) at ([xshift=-0.5em]tau11.east) {\tiny{1.NULL}};
-\node [rounded rectangle,draw,line width=1pt,minimum height=3.4em,minimum width=7.8em] (tau1) [fit = (tau11) (tau12)] {};
-
-\node [anchor=west] (tau21) at ([xshift=1.9em]tau1.east) {$\tau_1$};
-\node [anchor=west] (tau22) at ([xshift=-0.5em]tau21.north east) {\tiny{1.科学家}};
-\node [anchor=west] (tau23) at ([xshift=-0.5em]tau21.south east) {\tiny{2.们}};
-\node [rounded rectangle,draw,line width=1pt,minimum height=3.4em,minimum width=7.8em] (tau2)[fit = (tau21) (tau22) (tau23)] {};
-
-\node [anchor=west] (tau31) at ([xshift=2.1em]tau2.east) {$\tau_2$};
-\node [anchor=west] (tau32) at ([xshift=-0.5em]tau31.east) {\tiny{1.NULL}};
-\node [rounded rectangle,draw,line width=1pt,minimum height=3.4em,minimum width=7.8em] (tau3) [fit = (tau31) (tau32)] {};
-
-\node [anchor=west] (tau41) at ([xshift=2.3em]tau3.east) {$\tau_3$};
-\node [anchor=west] (tau42) at ([xshift=-0.5em]tau41.east) {\tiny{1.并不}};
-\node [rounded rectangle,draw,line width=1pt,minimum height=3.4em,minimum width=7.8em] (tau4) [fit = (tau41) (tau42)] {};
-
-\node [anchor=west] (tau51) at ([xshift=2.3em]tau4.east) {$\tau_4$};
-\node [anchor=west] (tau52) at ([xshift=-0.5em]tau51.east) {\tiny{1.知道}};
-\node [rounded rectangle,draw,line width=1pt,minimum height=3.4em,minimum width=7.8em] (tau5) [fit = (tau51) (tau52)] {};
+\node [anchor=west,minimum height=2.5em,minimum width=5.0em] (sf1) at ([xshift=1em]st.east) {};
+\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2.5em,minimum width=5.0em,fill=green!30,drop shadow] (s1) at ([xshift=2.48em]sf1.east) {科学家};
+\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2.5em,minimum width=5.0em,fill=green!30,drop shadow] (s2) at ([xshift=2.19em]s1.east) {们};
+\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2.5em,minimum width=5.0em,fill=green!30,drop shadow] (s3) at ([xshift=2.185em]s2.east) {并不};
+\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2.5em,minimum width=5.0em,fill=green!30,drop shadow] (s4) at ([xshift=2.183em]s3.east) {知道};
 }


+{\scriptsize
+\node [anchor=west] (tau11) at ([xshift=1.5em]taut.east) {$\tau_0$\tiny{1.NULL}};
+\begin{pgfonlayer}{background}
+\node [rounded rectangle,draw,line width=1pt,minimum height=3.0em,minimum width=6.8em,fill=red!30,drop shadow] (tau1) [fit = (tau11)] {};
+\end{pgfonlayer}
+
+\node [anchor=west] (tau21) at ([xshift=1.80em]tau1.east) {$\tau_1$};
+\node [anchor=west] (tau22) at ([yshift=-0.2em,xshift=-0.5em]tau21.north east) {\tiny{1.科学家}};
+\node [anchor=west] (tau23) at ([yshift=0.2em,xshift=-0.5em]tau21.south east) {\tiny{2.们}};
+\begin{pgfonlayer}{background}
+\node [rounded rectangle,draw,line width=1pt,minimum height=3.0em,minimum width=6.8em,fill=blue!30,drop shadow] (tau2)[fit = (tau21) (tau22) (tau23)] {};
+\end{pgfonlayer}
+
+\node [anchor=west] (tau31) at ([xshift=2.05em]tau2.east) {$\tau_2$\tiny{1.NULL}};
+\begin{pgfonlayer}{background}
+\node [rounded rectangle,draw,line width=1pt,minimum height=3.0em,minimum width=6.8em,fill=red!30,drop shadow] (tau3) [fit = (tau31)] {};
+\end{pgfonlayer}
+
+\node [anchor=west] (tau41) at ([xshift=2.2em]tau3.east) {$\tau_3$\tiny{1.并不}};
+\begin{pgfonlayer}{background}
+\node [rounded rectangle,draw,line width=1pt,minimum height=3.0em,minimum width=6.8em,fill=red!30,drop shadow] (tau4) [fit = (tau41)] {};
+\end{pgfonlayer}
+
+\node [anchor=west] (tau51) at ([xshift=2.2em]tau4.east) {$\tau_4$\tiny{1.知道}};
+\begin{pgfonlayer}{background}
+\node [rounded rectangle,draw,line width=1pt,minimum height=3.0em,minimum width=6.8em,fill=red!30,drop shadow] (tau5) [fit = (tau51)] {};
+\end{pgfonlayer}
+}

 {
-\node [anchor=north] (d1) at ([yshift=-6em]sf1.south) {$...$};
+\node [anchor=north] (d1) at ([yshift=-6.02em]sf1.south) {$...$};
 \node [anchor=north] (d2) at ([yshift=-6em]s1.south) {$...$};
 \node [anchor=north] (d31) at ([yshift=-6em]s2.south) {$...$};
 \node [anchor=north] (d32) at ([xshift=0.2em]d31.south) {\footnotesize{${<{\tau,\pi}>}_1$}};
@@ -43,81 +55,83 @@
 \node [anchor=north] (d5) at ([yshift=-6em]s4.south) {$...$};
 }

-
-
 \draw [->,thick,dashed] ([yshift=1em]tau1.north west) -- ([yshift=1em]tau5.north east);
 \draw [->,thick,dashed] ([yshift=-1em]tau1.south west) -- ([yshift=-1em]tau5.south east);
-
+%第一层连线
+\draw [->,thick] (tau2.north) -- (s1.south);
+\draw [->,thick] (tau4.north) -- (s3.south);
+\draw [->,thick] (tau5.north) -- (s4.south);
 \draw [->,thick] (tau23.east) -- (s2.south);
+%第二层连线
+\draw [->,thick] (d1.north) -- ([yshift=-4.48em]sf1.south);
+\draw [->,thick] (d2.north) -- ([yshift=-4.45em]s1.south);
+\draw [->,thick] (d31.north) -- ([yshift=-4.45em]s2.south);
+\draw [->,thick] (d4.north) -- ([yshift=-4.45em]s3.south);
+\draw [->,thick] (d5.north) -- ([yshift=-4.45em]s4.south);

-\draw [->,thick] ([yshift=4.2em]d2.north) -- (s1.south);
-\draw [->,thick] ([yshift=4.2em]d4.north) -- (s3.south);
-\draw [->,thick] ([yshift=4.2em]d5.north) -- (s4.south);
-
-\draw [->,thick] (d1.north) -- ([yshift=-4.25em]sf1.south);
-\draw [->,thick] (d2.north) -- ([yshift=-4.25em]s1.south);
-\draw [->,thick] (d31.north) -- ([yshift=-4.25em]s2.south);
-\draw [->,thick] (d4.north) -- ([yshift=-4.25em]s3.south);
-\draw [->,thick] (d5.north) -- ([yshift=-4.25em]s4.south);
-
-
+\end{scope}

 {\scriptsize
-\node [rectangle,draw,anchor=north,line width=1pt,minimum height=2.5em,minimum width=5.5em] (ns1) at ([yshift=-13em]s1.south) {科学家};
-\node [rectangle,draw,anchor=north,line width=1pt,minimum height=2.5em,minimum width=5.5em] (ns2) at ([yshift=-13em]s2.south) {们};
-\node [rectangle,draw,anchor=north,line width=1pt,minimum height=2.5em,minimum width=5.5em] (ns3) at ([yshift=-13em]s3.south) {并不};
-\node [rectangle,draw,anchor=north,line width=1pt,minimum height=2.5em,minimum width=5.5em] (ns4) at ([yshift=-13em]s4.south) {知道};
-
-\node [anchor=north] (ntau11) at ([yshift=-15em]tau11.south) {$\tau_0$};
-\node [anchor=west] (ntau12) at ([xshift=-0.5em]ntau11.east) {\tiny{1.NULL}};
-\node [rounded rectangle,draw,line width=1pt,minimum height=3.4em,minimum width=7.8em] (ntau1) [fit = (ntau11) (ntau12)] {};
-
-\node [anchor=west] (ntau21) at ([xshift=1.9em]ntau1.east) {$\tau_1$};
-\node [anchor=west] (ntau22) at ([xshift=-0.5em]ntau21.north east) {\tiny{1.们}};
-\node [anchor=west] (ntau23) at ([xshift=-0.5em]ntau21.south east) {\tiny{2.科学家}};
-\node [rounded rectangle,draw,line width=1pt,minimum height=3.4em,minimum width=7.8em] (ntau2)[fit = (ntau21) (ntau22) (ntau23)] {};
-
-\node [anchor=west] (ntau31) at ([xshift=2.1em]ntau2.east) {$\tau_2$};
-\node [anchor=west] (ntau32) at ([xshift=-0.5em]ntau31.east) {\tiny{1.NULL}};
-\node [rounded rectangle,draw,line width=1pt,minimum height=3.4em,minimum width=7.8em] (ntau3) [fit = (ntau31) (ntau32)] {};
-
-\node [anchor=west] (ntau41) at ([xshift=2.3em]ntau3.east) {$\tau_3$};
-\node [anchor=west] (ntau42) at ([xshift=-0.5em]ntau41.east) {\tiny{1.并不}};
-\node [rounded rectangle,draw,line width=1pt,minimum height=3.4em,minimum width=7.8em] (ntau4) [fit = (ntau41) (ntau42)] {};
-
-\node [anchor=west] (ntau51) at ([xshift=2.3em]ntau4.east) {$\tau_4$};
-\node [anchor=west] (ntau52) at ([xshift=-0.5em]ntau51.east) {\tiny{1.知道}};
-\node [rounded rectangle,draw,line width=1pt,minimum height=3.4em,minimum width=7.8em] (ntau5) [fit = (ntau51) (ntau52)] {};
+\node [anchor=west,minimum height=2.5em,minimum width=5.0em] (sf12) at ([yshift=-15.0em,xshift=1em]st.east) {};
+\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2.5em,minimum width=5.0em,fill=green!30,drop shadow] (s12) at ([xshift=2.48em]sf12.east) {科学家};
+\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2.5em,minimum width=5.0em,fill=green!30,drop shadow] (s22) at ([xshift=2.19em]s12.east) {们};
+\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2.5em,minimum width=5.0em,fill=green!30,drop shadow] (s32) at ([xshift=2.185em]s22.east) {并不};
+\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2.5em,minimum width=5.0em,fill=green!30,drop shadow] (s42) at ([xshift=2.183em]s32.east) {知道};
 }


+{\scriptsize
+\node [anchor=west] (tau112) at ([yshift=-15.0em,xshift=1.5em]taut.east) {$\tau_0$\tiny{1.NULL}};
+\begin{pgfonlayer}{background}
+\node [rounded rectangle,draw,line width=1pt,minimum height=3.0em,minimum width=6.8em,fill=red!30,drop shadow] (tau12) [fit = (tau112)] {};
+\end{pgfonlayer}
+
+\node [anchor=west] (tau212) at ([xshift=1.80em]tau12.east) {$\tau_1$};
+\node [anchor=west] (tau222) at ([yshift=-0.2em,xshift=-0.5em]tau212.north east) {\tiny{1.们}};
+\node [anchor=west] (tau232) at ([yshift=0.2em,xshift=-0.5em]tau212.south east) {\tiny{2.科学家}};
+\begin{pgfonlayer}{background}
+\node [rounded rectangle,draw,line width=1pt,minimum height=3.0em,minimum width=6.8em,fill=yellow!30,drop shadow] (tau22)[fit = (tau212) (tau222) (tau232)] {};
+\end{pgfonlayer}
+
+\node [anchor=west] (tau312) at ([xshift=2.05em]tau22.east) {$\tau_2$\tiny{1.NULL}};
+\begin{pgfonlayer}{background}
+\node [rounded rectangle,draw,line width=1pt,minimum height=3.0em,minimum width=6.8em,fill=red!30,drop shadow] (tau32) [fit = (tau312)] {};
+\end{pgfonlayer}
+
+\node [anchor=west] (tau412) at ([xshift=2.2em]tau32.east) {$\tau_3$\tiny{1.并不}};
+\begin{pgfonlayer}{background}
+\node [rounded rectangle,draw,line width=1pt,minimum height=3.0em,minimum width=6.8em,fill=red!30,drop shadow] (tau42) [fit = (tau412)] {};
+\end{pgfonlayer}
+
+\node [anchor=west] (tau512) at ([xshift=2.2em]tau42.east) {$\tau_4$\tiny{1.知道}};
+\begin{pgfonlayer}{background}
+\node [rounded rectangle,draw,line width=1pt,minimum height=3.0em,minimum width=6.8em,fill=red!30,drop shadow] (tau52) [fit = (tau512)] {};
+\end{pgfonlayer}
+}
 {
-{
-\node [anchor=north] (nd1) at ([yshift=-11em]d1.south) {$...$};
-\node [anchor=north] (nd2) at ([yshift=-11em]d2.south) {$...$};
-\node [anchor=north] (nd31) at ([yshift=-11em]d31.south) {$...$};
-\node [anchor=north] (nd32) at ([xshift=0.2em]nd31.south) {\footnotesize{${<{\tau,\pi}>}_2$}};
-\node [anchor=north] (nd4) at ([yshift=-11em]d4.south) {$...$};
-\node [anchor=north] (nd5) at ([yshift=-11em]d5.south) {$...$};
+\node [anchor=north] (d12) at ([yshift=-6.02em]sf12.south) {$...$};
+\node [anchor=north] (d22) at ([yshift=-6em]s12.south) {$...$};
+\node [anchor=north] (d312) at ([yshift=-6em]s22.south) {$...$};
+\node [anchor=north] (d322) at ([xshift=0.2em]d312.south) {\footnotesize{${<{\tau,\pi}>}_2$}};
+\node [anchor=north] (d42) at ([yshift=-6em]s32.south) {$...$};
+\node [anchor=north] (d52) at ([yshift=-6em]s42.south) {$...$};
 }

-\draw [->,thick,dashed] ([yshift=1em]ntau1.north west) -- ([yshift=1em]ntau5.north east);
-\draw [->,thick,dashed] ([yshift=-1em]ntau1.south west) -- ([yshift=-1em]ntau5.south east);
-
-
-\draw [->,thick] (ntau23.east) -- (ns2.south);
+\draw [->,thick,dashed] ([yshift=1em]tau12.north west) -- ([yshift=1em]tau52.north east);
+\draw [->,thick,dashed] ([yshift=-1em]tau12.south west) -- ([yshift=-1em]tau52.south east);
+%第一层连线
+\draw [->,thick] (tau22.north) -- (s12.south);
+\draw [->,thick] (tau42.north) -- (s32.south);
+\draw [->,thick] (tau52.north) -- (s42.south);
+\draw [->,thick] (tau232.east) -- (s22.south);
+%第二层连线
+\draw [->,thick] (d12.north) -- ([yshift=-4.48em]sf12.south);
+\draw [->,thick] (d22.north) -- ([yshift=-4.45em]s12.south);
+\draw [->,thick] (d312.north) -- ([yshift=-4.45em]s22.south);
+\draw [->,thick] (d42.north) -- ([yshift=-4.45em]s32.south);
+\draw [->,thick] (d52.north) -- ([yshift=-4.45em]s42.south);
+
+%\end{scope}

-\draw [->,thick] ([yshift=4.2em]nd2.north) -- (ns1.south);
-\draw [->,thick] ([yshift=4.2em]nd4.north) -- (ns3.south);
-\draw [->,thick] ([yshift=4.2em]nd5.north) -- (ns4.south);
-
-\draw [->,thick] (nd1.north) -- ([yshift=-16.15em]sf1.south);
-\draw [->,thick] (nd2.north) -- ([yshift=-16.15em]s1.south);
-\draw [->,thick] (nd31.north) -- ([yshift=-16.15em]s2.south);
-\draw [->,thick] (nd4.north) -- ([yshift=-16.15em]s3.south);
-\draw [->,thick] (nd5.north) -- ([yshift=-16.15em]s4.south);
-}
-
-}
 \end{tikzpicture}
 %---------------------------------------------------------------------
--- a/Book/Chapter3/Figures/figure-expression.tex
+++ b/Book/Chapter3/Figures/figure-expression.tex
@@ -12,16 +12,16 @@



-\node [anchor=west,inner sep=2pt,minimum height=2.5em] (eq1) at (0,0) {${\textrm{P}(\tau,\pi|\mathbf{t}) =  \prod_{j=0}^{l}{\textrm{P}(\varphi_j|\varphi_{1}^{j-1},\mathbf{t})} \times {\textrm{P}(\varphi_0|\varphi_{1}^{l},\mathbf{t})} \times}$};
-\node [anchor=north west,inner sep=2pt,minimum height=2.5em] (eq2) at ([xshift=-15.06em,yshift=0.0em]eq1.south east) {${\prod_{j=0}^l{\prod_{k=1}^{\varphi_j}{\textrm{P}(\tau_{jk}|\tau_{j1}^{k-1},\tau_{1}^{j-1},\varphi_{0}^{l},\mathbf{t} )}} \times}$};
-\node [anchor=north west,inner sep=2pt,minimum height=2.5em] (eq3) at ([xshift=-15.56em,yshift=0.0em]eq2.south east) {${\prod_{j=1}^l{\prod_{k=1}^{\varphi_j}{\textrm{P}(\pi_{jk}|\pi_{j1}^{k-1},\pi_{1}^{j-1},\tau_{0}^{l},\varphi_{0}^{l},\mathbf{t} )}} \times}$};
+\node [anchor=west,inner sep=2pt,minimum height=2.5em] (eq1) at (0,0) {${\textrm{P}(\tau,\pi|\mathbf{t}) =  \prod_{i=1}^{l}{\textrm{P}(\varphi_i|\varphi_{1}^{i-1},\mathbf{t})} \times {\textrm{P}(\varphi_0|\varphi_{1}^{l},\mathbf{t})} \times}$};
+\node [anchor=north west,inner sep=2pt,minimum height=2.5em] (eq2) at ([xshift=-15.06em,yshift=0.0em]eq1.south east) {${\prod_{i=0}^l{\prod_{k=1}^{\varphi_i}{\textrm{P}(\tau_{ik}|\tau_{i1}^{k-1},\tau_{1}^{i-1},\varphi_{0}^{l},\mathbf{t} )}} \times}$};
+\node [anchor=north west,inner sep=2pt,minimum height=2.5em] (eq3) at ([xshift=-15.56em,yshift=0.0em]eq2.south east) {${\prod_{i=1}^l{\prod_{k=1}^{\varphi_i}{\textrm{P}(\pi_{ik}|\pi_{i1}^{k-1},\pi_{1}^{i-1},\tau_{0}^{l},\varphi_{0}^{l},\mathbf{t} )}} \times}$};
 \node [anchor=north west,inner sep=2pt,minimum height=2.5em] (eq4) at ([xshift=-17.10em,yshift=0.0em]eq3.south east) {{${\prod_{k=1}^{\varphi_0}{\textrm{P}(\pi_{0k}|\pi_{01}^{k-1},\pi_{1}^{l},\tau_{0}^{l},\varphi_{0}^{l},\mathbf{t} )}}$}};

-\node [anchor=west,inner sep=2pt,minimum height=2.0em,fill=red!15] (part1) at ([xshift=-12.5em,yshift=0.0em]eq1.east) {{${\textrm{P}(\varphi_j|\varphi_{1}^{j-1},\mathbf{t})}$}};
-\node [anchor=west,inner sep=2pt,minimum height=2.0em,fill=blue!15] (part2) at ([xshift=-5.9em,yshift=0.0em]eq1.east) {{${\textrm{P}(\varphi_0|\varphi_{1}^{l},\mathbf{t})}$}};
-\node [anchor=west,inner sep=2pt,minimum height=2.0em,fill=green!15] (part3) at ([xshift=-10.7em,yshift=0.0em]eq2.east) {{${\textrm{P}(\tau_{jk}|\tau_{j1}^{k-1},\tau_{1}^{j-1},\varphi_{0}^{l},\mathbf{t} )}$}};
-\node [anchor=west,inner sep=2pt,minimum height=2.0em,fill=yellow!15] (part4) at ([xshift=-12.23em,yshift=0.0em]eq3.east) {{${\textrm{P}(\pi_{jk}|\pi_{j1}^{k-1},\pi_{1}^{j-1},\tau_{0}^{l},\varphi_{0}^{l},\mathbf{t} )}$}};
-\node [anchor=west,inner sep=2pt,minimum height=2.0em,fill=gray!15] (part5) at ([xshift=-10.4em,yshift=0.0em]eq4.east) {{${\textrm{P}(\pi_{0k}|\pi_{01}^{k-1},\pi_{1}^{l},\tau_{0}^{l},\varphi_{0}^{l},\mathbf{t} )}$}};
+\node [anchor=west,inner sep=2pt,minimum height=2.0em,fill=red!30] (part1) at ([xshift=-12.3em,yshift=0.0em]eq1.east) {{${\textrm{P}(\varphi_i|\varphi_{1}^{i-1},\mathbf{t})}$}};
+\node [anchor=west,inner sep=2pt,minimum height=2.0em,fill=blue!30] (part2) at ([xshift=-5.9em,yshift=0.0em]eq1.east) {{${\textrm{P}(\varphi_0|\varphi_{1}^{l},\mathbf{t})}$}};
+\node [anchor=west,inner sep=2pt,minimum height=2.0em,fill=green!30] (part3) at ([xshift=-10.5em,yshift=0.0em]eq2.east) {{${\textrm{P}(\tau_{ik}|\tau_{i1}^{k-1},\tau_{1}^{i-1},\varphi_{0}^{l},\mathbf{t} )}$}};
+\node [anchor=west,inner sep=2pt,minimum height=2.0em,fill=yellow!30] (part4) at ([xshift=-12.03em,yshift=0.0em]eq3.east) {{${\textrm{P}(\pi_{ik}|\pi_{i1}^{k-1},\pi_{1}^{i-1},\tau_{0}^{l},\varphi_{0}^{l},\mathbf{t} )}$}};
+\node [anchor=west,inner sep=2pt,minimum height=2.0em,fill=gray!30] (part5) at ([xshift=-10.4em,yshift=0.0em]eq4.east) {{${\textrm{P}(\pi_{0k}|\pi_{01}^{k-1},\pi_{1}^{l},\tau_{0}^{l},\varphi_{0}^{l},\mathbf{t} )}$}};


 \end{tikzpicture}

--- a/Book/Chapter3/Figures/figure-greedy-MT-decoding-pseudo-code.tex
+++ b/Book/Chapter3/Figures/figure-greedy-MT-decoding-pseudo-code.tex
@@ -120,7 +120,7 @@
 %% remark 5
 \begin{scope}
 {
-\node [anchor=north west,align=left] (remark5) at ([xshift=0.6em,yshift=-1.6em]remark4.south west) {\textsc{PruneForTop1}\\保留得分最高的结果};
+\node [anchor=north west,align=left] (remark5) at ([xshift=0.72em,yshift=-1.6em]remark4.south west) {\textsc{PruneForTop1}\\保留得分最高的结果};
 \node [anchor=west,draw,inner sep=1pt] (s1) at ([yshift=-0.5em,xshift=1.2em]remark5.north east){\tiny{0.234}};
 \node [anchor=north west,draw,inner sep=1pt] (s2) at ([yshift=-0.2em]s1.south west){\tiny{0.197}};
 \node [anchor=north west,draw,inner sep=1pt] (s3) at ([yshift=-0.2em]s2.south west){\tiny{0.083}};

--- a/Book/Chapter3/Figures/figure-probability_translation_process.tex
+++ b/Book/Chapter3/Figures/figure-probability_translation_process.tex
@@ -10,57 +10,70 @@
 \node [anchor=north] (tt) at ([yshift=-3em]phit.south) {$\mathbf{t}$};
 }
 {\scriptsize
-\node [anchor=west,minimum height=2.5em,minimum width=5.5em] (sf1) at ([xshift=1em]st.east) {};
-\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2.5em,minimum width=5.5em] (s1) at ([xshift=2.3em]sf1.east) {科学家};
-\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2.5em,minimum width=5.5em] (s2) at ([xshift=2.3em]s1.east) {们};
-\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2.5em,minimum width=5.5em] (s3) at ([xshift=2.3em]s2.east) {并不};
-\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2.5em,minimum width=5.5em] (s4) at ([xshift=2.3em]s3.east) {知道};
+\node [anchor=west,minimum height=2.5em,minimum width=5.0em] (sf1) at ([xshift=1em]st.east) {};
+\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2.5em,minimum width=5.0em,fill=green!30,drop shadow] (s1) at ([xshift=2.48em]sf1.east) {科学家};
+\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2.5em,minimum width=5.0em,fill=green!30,drop shadow] (s2) at ([xshift=2.19em]s1.east) {们};
+\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2.5em,minimum width=5.0em,fill=green!30,drop shadow] (s3) at ([xshift=2.185em]s2.east) {并不};
+\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2.5em,minimum width=5.0em,fill=green!30,drop shadow] (s4) at ([xshift=2.183em]s3.east) {知道};
 }


 {\scriptsize
-\node [anchor=west] (tau11) at ([xshift=1.5em]taut.east) {$\tau_0$};
-\node [anchor=west] (tau12) at ([xshift=-0.5em]tau11.east) {\tiny{1.NULL}};
-\node [rounded rectangle,draw,line width=1pt,minimum height=3.4em,minimum width=7.8em] (tau1) [fit = (tau11) (tau12)] {};
-
-\node [anchor=west] (tau21) at ([xshift=1.9em]tau1.east) {$\tau_1$};
-\node [anchor=west] (tau22) at ([xshift=-0.5em]tau21.north east) {\tiny{1.科学家}};
-\node [anchor=west] (tau23) at ([xshift=-0.5em]tau21.south east) {\tiny{2.们}};
-\node [rounded rectangle,draw,line width=1pt,minimum height=3.4em,minimum width=7.8em] (tau2)[fit = (tau21) (tau22) (tau23)] {};
-
-\node [anchor=west] (tau31) at ([xshift=2.1em]tau2.east) {$\tau_2$};
-\node [anchor=west] (tau32) at ([xshift=-0.5em]tau31.east) {\tiny{1.NULL}};
-\node [rounded rectangle,draw,line width=1pt,minimum height=3.4em,minimum width=7.8em] (tau3) [fit = (tau31) (tau32)] {};
-
-\node [anchor=west] (tau41) at ([xshift=2.3em]tau3.east) {$\tau_3$};
-\node [anchor=west] (tau42) at ([xshift=-0.5em]tau41.east) {\tiny{1.并不}};
-\node [rounded rectangle,draw,line width=1pt,minimum height=3.4em,minimum width=7.8em] (tau4) [fit = (tau41) (tau42)] {};
-
-\node [anchor=west] (tau51) at ([xshift=2.3em]tau4.east) {$\tau_4$};
-\node [anchor=west] (tau52) at ([xshift=-0.5em]tau51.east) {\tiny{1.知道}};
-\node [rounded rectangle,draw,line width=1pt,minimum height=3.4em,minimum width=7.8em] (tau5) [fit = (tau51) (tau52)] {};
+\node [anchor=west] (tau11) at ([xshift=1.5em]taut.east) {$\tau_0$\tiny{1.NULL}};
+\begin{pgfonlayer}{background}
+\node [rounded rectangle,draw,line width=1pt,minimum height=3.0em,minimum width=6.8em,fill=red!30,drop shadow] (tau1) [fit = (tau11)] {};
+\end{pgfonlayer}
+
+\node [anchor=west] (tau21) at ([xshift=1.80em]tau1.east) {$\tau_1$};
+\node [anchor=west] (tau22) at ([yshift=-0.2em,xshift=-0.5em]tau21.north east) {\tiny{1.科学家}};
+\node [anchor=west] (tau23) at ([yshift=0.2em,xshift=-0.5em]tau21.south east) {\tiny{2.们}};
+\begin{pgfonlayer}{background}
+\node [rounded rectangle,draw,line width=1pt,minimum height=3.0em,minimum width=6.8em,fill=red!30,drop shadow] (tau2)[fit = (tau21) (tau22) (tau23)] {};
+\end{pgfonlayer}
+
+\node [anchor=west] (tau31) at ([xshift=2.05em]tau2.east) {$\tau_2$\tiny{1.NULL}};
+\begin{pgfonlayer}{background}
+\node [rounded rectangle,draw,line width=1pt,minimum height=3.0em,minimum width=6.8em,fill=red!30,drop shadow] (tau3) [fit = (tau31)] {};
+\end{pgfonlayer}
+
+\node [anchor=west] (tau41) at ([xshift=2.2em]tau3.east) {$\tau_3$\tiny{1.并不}};
+\begin{pgfonlayer}{background}
+\node [rounded rectangle,draw,line width=1pt,minimum height=3.0em,minimum width=6.8em,fill=red!30,drop shadow] (tau4) [fit = (tau41)] {};
+\end{pgfonlayer}
+
+\node [anchor=west] (tau51) at ([xshift=2.2em]tau4.east) {$\tau_4$\tiny{1.知道}};
+\begin{pgfonlayer}{background}
+\node [rounded rectangle,draw,line width=1pt,minimum height=3.0em,minimum width=6.8em,fill=red!30,drop shadow] (tau5) [fit = (tau51)] {};
+\end{pgfonlayer}
 }

-{\scriptsize
-\node [anchor=west] (phi11) at ([xshift=2.4em]phit.east) {$\phi_0$};
-\node [anchor=west] (phi12) at ([xshift=-0.5em]phi11.east) {0};
-\node [rounded rectangle,draw,line width=1pt,minimum height=3.4em,minimum width=7.8em] (phi1) [fit = (phi11) (phi12)] {};
-
-\node [anchor=west] (phi21) at ([xshift=3em]phi1.east) {$\phi_1$};
-\node [anchor=west] (phi22) at ([xshift=-0.5em]phi21.east) {2};
-\node [rounded rectangle,draw,line width=1pt,minimum height=3.4em,minimum width=7.8em] (phi2) [fit = (phi21) (phi22)] {};
-
-\node [anchor=west] (phi31) at ([xshift=3em]phi2.east) {$\phi_2$};
-\node [anchor=west] (phi32) at ([xshift=-0.5em]phi31.east) {0};
-\node [rounded rectangle,draw,line width=1pt,minimum height=3.4em,minimum width=7.8em] (phi3) [fit = (phi31) (phi32)] {};

-\node [anchor=west] (phi41) at ([xshift=3em]phi3.east) {$\phi_3$};
-\node [anchor=west] (phi42) at ([xshift=-0.5em]phi41.east) {1};
-\node [rounded rectangle,draw,line width=1pt,minimum height=3.4em,minimum width=7.8em] (phi4) [fit = (phi41) (phi42)] {};
+{\scriptsize
+\node [anchor=west] (phi11) at ([xshift=2.3em]phit.east) {$\phi_0$\ 0};
+\begin{pgfonlayer}{background}
+\node [rounded rectangle,draw,line width=1pt,minimum height=3.0em,minimum width=6.8em,fill=blue!30,drop shadow] (phi1) [fit = (phi11)] {};
+\end{pgfonlayer}
+
+\node [anchor=west] (phi21) at ([xshift=2.947em]phi1.east) {$\phi_1$\ 2};
+\begin{pgfonlayer}{background}
+\node [rounded rectangle,draw,line width=1pt,minimum height=3.0em,minimum width=6.8em,fill=blue!30,drop shadow] (phi2) [fit = (phi21)] {};
+\end{pgfonlayer}
+
+\node [anchor=west] (phi31) at ([xshift=2.876em]phi2.east) {$\phi_2$\ 0};
+\begin{pgfonlayer}{background}
+\node [rounded rectangle,draw,line width=1pt,minimum height=3.0em,minimum width=6.8em,fill=blue!30,drop shadow] (phi3) [fit = (phi31)] {};
+\end{pgfonlayer}
+
+\node [anchor=west] (phi41) at ([xshift=2.8715em]phi3.east) {$\phi_3$\ 1};
+\begin{pgfonlayer}{background}
+\node [rounded rectangle,draw,line width=1pt,minimum height=3.0em,minimum width=6.8em,fill=blue!30,drop shadow] (phi4) [fit = (phi41)] {};
+\end{pgfonlayer}
+
+\node [anchor=west] (phi51) at ([xshift=2.86925em]phi4.east) {$\phi_4$\ 1};
+\begin{pgfonlayer}{background}
+\node [rounded rectangle,draw,line width=1pt,minimum height=3.0em,minimum width=6.8em,fill=blue!30,drop shadow] (phi5) [fit = (phi51)] {};
+\end{pgfonlayer}

-\node [anchor=west] (phi51) at ([xshift=3em]phi4.east) {$\phi_4$};
-\node [anchor=west] (phi52) at ([xshift=-0.5em]phi51.east) {1};
-\node [rounded rectangle,draw,line width=1pt,minimum height=3.4em,minimum width=7.8em] (phi5) [fit = (phi51) (phi52)] {};
 }

 \draw [->,thick,dashed] ([yshift=-1.4em]st.south west) -- ([xshift=0.8em,yshift=-1em]s4.south east);
@@ -68,40 +81,38 @@
 \draw [->,thick,dashed] ([yshift=-10.3em]st.south west) -- ([xshift=0.8em,yshift=-9.9em]s4.south east);

 {\scriptsize
-\node [rectangle,draw,anchor=north,line width=1pt,minimum height=2.5em,minimum width=5.5em] (t1) at ([yshift=-15em]sf1.south) {$t_0$};
-\node [rectangle,draw,anchor=north,line width=1pt,minimum height=2.5em,minimum width=5.5em] (t2) at ([yshift=-15em]s1.south) {Scientists};
-\node [rectangle,draw,anchor=north,line width=1pt,minimum height=2.5em,minimum width=5.5em] (t3) at ([yshift=-15em]s2.south) {do};
-\node [rectangle,draw,anchor=north,line width=1pt,minimum height=2.5em,minimum width=5.5em] (t4) at ([yshift=-15em]s3.south) {not};
-\node [rectangle,draw,anchor=north,line width=1pt,minimum height=2.5em,minimum width=5.5em] (t5) at ([yshift=-15em]s4.south) {konw};
+\node [rectangle,draw,anchor=north,line width=1pt,minimum height=2.5em,minimum width=5.0em,fill=yellow!30,drop shadow] (t1) at ([xshift=0.182em,yshift=-15em]sf1.south) {$t_0$};
+\node [rectangle,draw,anchor=north,line width=1pt,minimum height=2.5em,minimum width=5.0em,fill=yellow!30,drop shadow] (t2) at ([yshift=-15em]s1.south) {Scientists};
+\node [rectangle,draw,anchor=north,line width=1pt,minimum height=2.5em,minimum width=5.0em,fill=yellow!30,drop shadow] (t3) at ([yshift=-15em]s2.south) {do};
+\node [rectangle,draw,anchor=north,line width=1pt,minimum height=2.5em,minimum width=5.0em,fill=yellow!30,drop shadow] (t4) at ([yshift=-15em]s3.south) {not};
+\node [rectangle,draw,anchor=north,line width=1pt,minimum height=2.5em,minimum width=5.0em,fill=yellow!30,drop shadow] (t5) at ([yshift=-15em]s4.south) {konw};
 }
-
-
+%第一层连线
+\draw [->,thick] (tau2.north) -- (s1.south);
+\draw [->,thick] (tau4.north) -- (s3.south);
+\draw [->,thick] (tau5.north) -- (s4.south);
 \draw [->,thick] (tau23.east) -- (s2.south);
-
-\draw [->,thick] (t1.north) -- ([yshift=-8.8em]sf1.south);
-\draw [->,thick] (t2.north) -- ([yshift=-8.8em]s1.south);
-\draw [->,thick] (t3.north) -- ([yshift=-8.8em]s2.south);
-\draw [->,thick] (t4.north) -- ([yshift=-8.8em]s3.south);
-\draw [->,thick] (t5.north) -- ([yshift=-8.8em]s4.south);
-
-\draw [->,thick] ([yshift=4.6em]t1.north) -- ([yshift=-4.4em]sf1.south);
-\draw [->,thick] ([yshift=4.6em]t2.north) -- ([yshift=-4.4em]s1.south);
-\draw [->,thick] ([yshift=4.6em]t3.north) -- ([yshift=-4.4em]s2.south);
-\draw [->,thick] ([yshift=4.6em]t4.north) -- ([yshift=-4.4em]s3.south);
-\draw [->,thick] ([yshift=4.6em]t5.north) -- ([yshift=-4.4em]s4.south);
-
-\draw [->,thick] ([yshift=9em]t2.north) -- (s1.south);
-\draw [->,thick] ([yshift=9em]t4.north) -- (s3.south);
-\draw [->,thick] ([yshift=9em]t5.north) -- (s4.south);
+%第二层连线
+\draw [->,thick] (phi1.north) -- (tau1.south);
+\draw [->,thick] (phi2.north) -- (tau2.south);
+\draw [->,thick] (phi3.north) -- (tau3.south);
+\draw [->,thick] (phi4.north) -- (tau4.south);
+\draw [->,thick] (phi5.north) -- (tau5.south);
+%第三层连线
+\draw [->,thick] (t1.north) -- (phi1.south);
+\draw [->,thick] (t2.north) -- (phi2.south);
+\draw [->,thick] (t3.north) -- (phi3.south);
+\draw [->,thick] (t4.north) -- (phi4.south);
+\draw [->,thick] (t5.north) -- (phi5.south);


 {\scriptsize
-\node [anchor=west] (sent11) at ([xshift=1em,yshift=-2em]s4.south east) {把这些元语};
+\node [anchor=west] (sent11) at ([xshift=1em,yshift=-0.3em]s4.south east) {把这些元语};
 \node [anchor=west] (sent12) at ([yshift=-1em]sent11.west) {言单词放在};
 \node [anchor=west] (sent13) at ([yshift=-1em]sent12.west) {合适的位置};
-\node [anchor=west] (sent21) at ([yshift=-3em]sent13.west) {确定生成元};
+\node [anchor=west] (sent21) at ([yshift=-4.6em]sent13.west) {确定生成元};
 \node [anchor=west] (sent22) at ([yshift=-1em]sent21.west) {语言单词};
-\node [anchor=west] (sent31) at ([yshift=-4em]sent22.west) {确定生成元};
+\node [anchor=west] (sent31) at ([yshift=-4.6em]sent22.west) {确定生成元};
 \node [anchor=west] (sent32) at ([yshift=-1em]sent31.west) {语言单词的};
 \node [anchor=west] (sent33) at ([yshift=-1em]sent32.west) {个数};
 }

--- a/Book/Chapter3/Figures/figure-processes-SMT.tex
+++ b/Book/Chapter3/Figures/figure-processes-SMT.tex
@@ -16,7 +16,7 @@
 \end{pgfonlayer}
 }

-\node [anchor=west,ugreen] (P) at ([xshift=4em,yshift=-0.7em]corpus.east){P($t|s$)};
+\node [anchor=west,ugreen] (P) at ([xshift=4em,yshift=-0.7em]corpus.east){P($\mathbf{t}|\mathbf{s}$)};
 \node [anchor=south] (modellabel) at (P.north) {{\color{ublue} {\scriptsize \sffamily\bfseries{翻译模型}}}};

 \begin{pgfonlayer}{background}

--- a/Book/Chapter3/Figures/figure-word-alignment&probability-distribution-in-IBM-model-3.tex
+++ b/Book/Chapter3/Figures/figure-word-alignment&probability-distribution-in-IBM-model-3.tex
@@ -15,8 +15,8 @@
 \node [anchor=west] (eq2) at ([xshift=3.0em,yshift=0.0em]eq1.east) {早饭};
 \node [anchor=north] (eq3) at ([xshift=0.0em,yshift=-2.0em]eq1.south) {Have};
 \node [anchor=north] (eq4) at ([xshift=0.0em,yshift=-2.0em]eq2.south) {breakfast};
-\node [anchor=east] (eq5) at ([xshift=-1.0em,yshift=-1.8em]eq1.west) {$a_{1}$};
-\node [anchor=west] (eq6) at ([xshift=1.0em,yshift=-1.8em]eq2.east) {$\textrm{P}(\mathbf{s},a_{1}|\mathbf{t})=0.5$};
+\node [anchor=east] (eq5) at ([xshift=-1.0em,yshift=-1.8em]eq1.west) {$\mathbf{a}_{1}$};
+\node [anchor=west] (eq6) at ([xshift=1.0em,yshift=-1.8em]eq2.east) {$\textrm{P}(\mathbf{s},\mathbf{a}_{1}|\mathbf{t})=0.5$};
 \draw [-,very thick](eq1.south) -- (eq3.north);
 \draw [-,very thick](eq2.south) -- (eq4.north);
 \node [anchor=west] (eq7) at ([xshift=13.1em,yshift=1.4em]eq2.east) {};
@@ -34,8 +34,8 @@
 \node [anchor=west] (eq2) at ([xshift=3.0em,yshift=0.0em]eq1.east) {早饭};
 \node [anchor=north] (eq3) at ([xshift=0.0em,yshift=-2.0em]eq1.south) {Have};
 \node [anchor=north] (eq4) at ([xshift=0.0em,yshift=-2.0em]eq2.south) {breakfast};
-\node [anchor=east] (eq5) at ([xshift=-1.0em,yshift=-1.8em]eq1.west) {$a_{2}$};
-\node [anchor=west] (eq6) at ([xshift=1.0em,yshift=-1.8em]eq2.east) {$\textrm{P}(\mathbf{s},a_{2}|\mathbf{t})=0.1$};
+\node [anchor=east] (eq5) at ([xshift=-1.0em,yshift=-1.8em]eq1.west) {$\mathbf{a}_{2}$};
+\node [anchor=west] (eq6) at ([xshift=1.0em,yshift=-1.8em]eq2.east) {$\textrm{P}(\mathbf{s},\mathbf{a}_{2}|\mathbf{t})=0.1$};
 \draw [-,very thick](eq1.south) -- (eq4.north);
 \draw [-,very thick](eq2.south) -- (eq3.north);
 \end{scope}
@@ -45,8 +45,8 @@
 \node [anchor=west] (eq2) at ([xshift=3.0em,yshift=0.0em]eq1.east) {早饭};
 \node [anchor=north] (eq3) at ([xshift=0.0em,yshift=-2.0em]eq1.south) {Have};
 \node [anchor=north] (eq4) at ([xshift=0.0em,yshift=-2.0em]eq2.south) {breakfast};
-\node [anchor=east] (eq5) at ([xshift=-1.0em,yshift=-1.8em]eq1.west) {$a_{3}$};
-\node [anchor=west] (eq6) at ([xshift=1.0em,yshift=-1.8em]eq2.east) {$\textrm{P}(\mathbf{s},a_{3}|\mathbf{t})=0.1$};
+\node [anchor=east] (eq5) at ([xshift=-1.0em,yshift=-1.8em]eq1.west) {$\mathbf{a}_{3}$};
+\node [anchor=west] (eq6) at ([xshift=1.0em,yshift=-1.8em]eq2.east) {$\textrm{P}(\mathbf{s},\mathbf{a}_{3}|\mathbf{t})=0.1$};
 \draw [-,very thick](eq1.south) -- (eq3.north);
 \draw [-,very thick](eq2.south) -- (eq3.north);
 \end{scope}
@@ -56,8 +56,8 @@
 \node [anchor=west] (eq2) at ([xshift=3.0em,yshift=0.0em]eq1.east) {早饭};
 \node [anchor=north] (eq3) at ([xshift=0.0em,yshift=-2.0em]eq1.south) {Have};
 \node [anchor=north] (eq4) at ([xshift=0.0em,yshift=-2.0em]eq2.south) {breakfast};
-\node [anchor=east] (eq5) at ([xshift=-1.0em,yshift=-1.8em]eq1.west) {$a_{4}$};
-\node [anchor=west] (eq6) at ([xshift=1.0em,yshift=-1.8em]eq2.east) {$\textrm{P}(\mathbf{s},a_{4}|\mathbf{t})=0.1$};
+\node [anchor=east] (eq5) at ([xshift=-1.0em,yshift=-1.8em]eq1.west) {$\mathbf{a}_{4}$};
+\node [anchor=west] (eq6) at ([xshift=1.0em,yshift=-1.8em]eq2.east) {$\textrm{P}(\mathbf{s},\mathbf{a}_{4}|\mathbf{t})=0.1$};
 \draw [-,very thick](eq1.south) -- (eq4.north);
 \draw [-,very thick](eq2.south) -- (eq4.north);
 \end{scope}
@@ -67,8 +67,8 @@
 \node [anchor=west] (eq2) at ([xshift=3.0em,yshift=0.0em]eq1.east) {早饭};
 \node [anchor=north] (eq3) at ([xshift=0.0em,yshift=-2.0em]eq1.south) {Have};
 \node [anchor=north] (eq4) at ([xshift=0.0em,yshift=-2.0em]eq2.south) {breakfast};
-\node [anchor=east] (eq5) at ([xshift=-1.0em,yshift=-1.8em]eq1.west) {$a_{5}$};
-\node [anchor=west] (eq6) at ([xshift=1.0em,yshift=-1.8em]eq2.east) {$\textrm{P}(\mathbf{s},a_{5}|\mathbf{t})=0.05$};
+\node [anchor=east] (eq5) at ([xshift=-1.0em,yshift=-1.8em]eq1.west) {$\mathbf{a}_{5}$};
+\node [anchor=west] (eq6) at ([xshift=1.0em,yshift=-1.8em]eq2.east) {$\textrm{P}(\mathbf{s},\mathbf{a}_{5}|\mathbf{t})=0.05$};
 \draw [-,very thick](eq1.south) -- (eq3.north);
 \draw [-,very thick](eq1.south) -- (eq4.north);
 \draw [-,very thick](eq2.south) -- (eq3.north);
@@ -82,8 +82,8 @@
 \node [anchor=west] (eq2) at ([xshift=3.0em,yshift=0.0em]eq1.east) {早饭};
 \node [anchor=north] (eq3) at ([xshift=0.0em,yshift=-2.0em]eq1.south) {Have};
 \node [anchor=north] (eq4) at ([xshift=0.0em,yshift=-2.0em]eq2.south) {breakfast};
-\node [anchor=east] (eq5) at ([xshift=-1.0em,yshift=-1.8em]eq1.west) {$a_{6}$};
-\node [anchor=west] (eq6) at ([xshift=1.0em,yshift=-1.8em]eq2.east) {$\textrm{P}(\mathbf{s},a_{6}|\mathbf{t})=0.05$};
+\node [anchor=east] (eq5) at ([xshift=-1.0em,yshift=-1.8em]eq1.west) {$\mathbf{a}_{6}$};
+\node [anchor=west] (eq6) at ([xshift=1.0em,yshift=-1.8em]eq2.east) {$\textrm{P}(\mathbf{s},\mathbf{a}_{6}|\mathbf{t})=0.05$};
 \draw [-,very thick](eq1.south) -- (eq3.north);
 \draw [-,very thick](eq2.south) -- (eq4.north);
 \draw [-,very thick](eq2.south) -- (eq3.north);

--- a/Book/Chapter3/Figures/figure-word-alignment.tex
+++ b/Book/Chapter3/Figures/figure-word-alignment.tex
@@ -5,11 +5,11 @@
 {
 {\footnotesize
 \node [anchor=north west,minimum height=2em,minimum width=4em] (s11) at (0,0) {};
-\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2em,minimum width=4em] (s1) at ([xshift=2em]s11.east) {我};
-\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2em,minimum width=4em] (s2) at ([xshift=2em]s1.east) {改变};
-\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2em,minimum width=4em] (s3) at ([xshift=2em]s2.east) {主意};
-\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2em,minimum width=4em] (s4) at ([xshift=2em]s3.east) {了};
-\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2em,minimum width=4em] (s5) at ([xshift=2em]s4.east) {。};
+\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2em,minimum width=4em,fill=green!30,drop shadow] (s1) at ([xshift=2em]s11.east) {我};
+\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2em,minimum width=4em,fill=green!30,drop shadow] (s2) at ([xshift=2em]s1.east) {改变};
+\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2em,minimum width=4em,fill=green!30,drop shadow] (s3) at ([xshift=2em]s2.east) {主意};
+\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2em,minimum width=4em,fill=green!30,drop shadow] (s4) at ([xshift=2em]s3.east) {了};
+\node [rectangle,draw,anchor=west,line width=1pt,minimum height=2em,minimum width=4em,fill=green!30,drop shadow] (s5) at ([xshift=2em]s4.east) {。};

 \node [anchor=south] (nu1) at (s1.north) {1};
 \node [anchor=south] (nu2) at (s2.north) {2};
@@ -20,12 +20,12 @@

 {
 {\footnotesize
-\node [anchor=north,rectangle,draw,line width=1pt,minimum height=2em,minimum width=4em] (t1) at ([yshift=-3.5em]s11.south) {$t_0$};
-\node [rectangle,draw,anchor=north,line width=1pt,minimum height=2em,minimum width=4em] (t2) at ([yshift=-3.5em]s1.south) {I};
-\node [rectangle,draw,anchor=north,line width=1pt,minimum height=2em,minimum width=4em] (t3) at ([yshift=-3.5em]s2.south) {changed};
-\node [rectangle,draw,anchor=north,line width=1pt,minimum height=2em,minimum width=4em] (t4) at ([yshift=-3.5em]s3.south) {my};
-\node [rectangle,draw,anchor=north,line width=1pt,minimum height=2em,minimum width=4em] (t5) at ([yshift=-3.5em]s4.south) {mind};
-\node [rectangle,draw,anchor=north,line width=1pt,minimum height=2em,minimum width=4em] (t6) at ([yshift=-3.5em]s5.south) {.};
+\node [anchor=north,rectangle,draw,line width=1pt,minimum height=2em,minimum width=4em,fill=red!30,drop shadow] (t1) at ([yshift=-3.5em]s11.south) {$t_0$};
+\node [rectangle,draw,anchor=north,line width=1pt,minimum height=2em,minimum width=4em,fill=red!30,drop shadow] (t2) at ([yshift=-3.5em]s1.south) {I};
+\node [rectangle,draw,anchor=north,line width=1pt,minimum height=2em,minimum width=4em,fill=red!30,drop shadow] (t3) at ([yshift=-3.5em]s2.south) {changed};
+\node [rectangle,draw,anchor=north,line width=1pt,minimum height=2em,minimum width=4em,fill=red!30,drop shadow] (t4) at ([yshift=-3.5em]s3.south) {my};
+\node [rectangle,draw,anchor=north,line width=1pt,minimum height=2em,minimum width=4em,fill=red!30,drop shadow] (t5) at ([yshift=-3.5em]s4.south) {mind};
+\node [rectangle,draw,anchor=north,line width=1pt,minimum height=2em,minimum width=4em,fill=red!30,drop shadow] (t6) at ([yshift=-3.5em]s5.south) {.};

 \node [anchor=north] (nd1) at (t2.south) {[1]};
 \node [anchor=north] (nd2) at (t3.south) {[2]};

--- a/Book/Chapter4/Figures/content-of-chart-in-tree-based-decoding.tex
+++ b/Book/Chapter4/Figures/content-of-chart-in-tree-based-decoding.tex
@@ -82,7 +82,7 @@
 \node[anchor=west](b8) at ([xshift=0em,yshift=-1.5em]b7.west){{VP}};
 \node[anchor=west](b9) at ([xshift=0em,yshift=-1.5em]b8.west){{N/A}};
 \node[anchor=west](b10) at ([xshift=0em,yshift=-1.5em]b9.west){{VP}};
-\node[anchor=west](b11) at ([xshift=0em,yshift=-1.5em]b10.west){{IP({\red(root)})}};
+\node[anchor=west](b11) at ([xshift=0em,yshift=-1.5em]b10.west){{IP({\red root})}};

 \node[anchor=west](y2) at ([xshift=0.2em,yshift=-1.7em]y1.west){{猫}};
 \node[anchor=west](y3) at ([xshift=0em,yshift=-1.5em]y2.west){{喜欢}};

--- a/Book/Chapter4/Figures/process-of-machine-translation-base-phrase.tex
+++ b/Book/Chapter4/Figures/process-of-machine-translation-base-phrase.tex
@@ -4,21 +4,21 @@
 \begin{tikzpicture}
 \begin{scope}

-\tikzstyle{datanode} = [minimum width=7em,minimum height=1.7em,fill=red!20,rounded corners=0.7em];
-\tikzstyle{modelnode} = [minimum width=7em,minimum height=1.7em,fill=blue!20,rounded corners=0.2em];
-\tikzstyle{decodingnode} = [minimum width=7em,minimum height=1.7em,fill=green!20,rounded corners=0.2em];
+\tikzstyle{datanode} = [minimum width=7em,minimum height=1.7em,fill=red!20,rounded corners=0.3em];
+\tikzstyle{modelnode} = [minimum width=7em,minimum height=1.7em,fill=blue!20,rounded corners=0.3em];
+\tikzstyle{decodingnode} = [minimum width=7em,minimum height=1.7em,fill=green!20,rounded corners=0.3em];

-\node [datanode,draw,anchor=north west] (bitext) at (0,0) {{ \scriptsize{训练用双语数据}}};
-\node [modelnode,draw,anchor=north] (phrase) at ([yshift=-1.5em]bitext.south) {{ \scriptsize{短语抽取及打分}}};
-\node [modelnode,draw,anchor=west] (reorder) at ([xshift=1.5em]phrase.east) {{ \scriptsize{调序建模}}};
-\node [modelnode,draw,anchor=west] (lm) at ([xshift=1.5em]reorder.east) {{ \scriptsize{语言建模}}};
-\node [datanode,draw,anchor=south] (monotext) at ([yshift=1.5em]lm.north) {{ \scriptsize{目标语单语数据}}};
+\node [datanode,anchor=north west] (bitext) at (0,0) {{ \scriptsize{训练用双语数据}}};
+\node [modelnode,anchor=north] (phrase) at ([yshift=-1.5em]bitext.south) {{ \scriptsize{短语抽取及打分}}};
+\node [modelnode,anchor=west] (reorder) at ([xshift=1.5em]phrase.east) {{ \scriptsize{调序建模}}};
+\node [modelnode,anchor=west] (lm) at ([xshift=1.5em]reorder.east) {{ \scriptsize{语言建模}}};
+\node [datanode,anchor=south] (monotext) at ([yshift=1.5em]lm.north) {{ \scriptsize{目标语单语数据}}};

-\node [datanode,draw,anchor=north] (phrasetable) at ([yshift=-1.5em]phrase.south) {{ \scriptsize{短语表}}};
-\node [datanode,draw,anchor=north] (reordertable) at ([yshift=-1.5em]reorder.south) {{ \scriptsize{调序模型}}};
-\node [datanode,draw,anchor=north] (lmtable) at ([yshift=-1.5em]lm.south) {{ \scriptsize{语言模型}}};
+\node [datanode,anchor=north] (phrasetable) at ([yshift=-1.5em]phrase.south) {{ \scriptsize{短语表}}};
+\node [datanode,anchor=north] (reordertable) at ([yshift=-1.5em]reorder.south) {{ \scriptsize{调序模型}}};
+\node [datanode,anchor=north] (lmtable) at ([yshift=-1.5em]lm.south) {{ \scriptsize{语言模型}}};

-\node [decodingnode,draw,anchor=north] (decoding) at ([yshift=-2em]reordertable.south) {{ \scriptsize{解码器}}};
+\node [decodingnode,anchor=north] (decoding) at ([yshift=-2em]reordertable.south) {{ \scriptsize{解码器}}};

 \draw [->,very thick] ([yshift=-0.1em]bitext.south) -- ([yshift=0.1em]phrase.north);
 \draw [->,very thick] (bitext.south east) -- ([yshift=0.1em]reorder.north west);

--- a/Book/Chapter4/Figures/processing-of-hierarchical-phrase-system.tex
+++ b/Book/Chapter4/Figures/processing-of-hierarchical-phrase-system.tex
@@ -4,9 +4,9 @@
 \begin{tikzpicture}
 \begin{scope}

-\tikzstyle{datanode} = [minimum width=7em,minimum height=1.7em,fill=blue!20,draw,rounded corners=0.7em];
-\tikzstyle{modelnode} = [minimum width=7em,minimum height=1.7em,fill=red!20,draw,rounded corners=0.2em];
-\tikzstyle{decodingnode} = [minimum width=7em,minimum height=1.7em,fill=green!20,draw,rounded corners=0.2em];
+\tikzstyle{datanode} = [minimum width=7em,minimum height=1.7em,fill=blue!20,rounded corners=0.3em];
+\tikzstyle{modelnode} = [minimum width=7em,minimum height=1.7em,fill=red!20,rounded corners=0.3em];
+\tikzstyle{decodingnode} = [minimum width=7em,minimum height=1.7em,fill=green!20,rounded corners=0.3em];

 \node [datanode,anchor=north west] (bitext) at (0,0) {{ \scriptsize{训练用双语数据}}};
 \node [modelnode, anchor=north west] (gi) at ([xshift=2em,yshift=-0.2em]bitext.south east) {{ \scriptsize{文法(规则)抽取}}};

--- a/Book/Chapter4/Figures/tree-fragment-to-string-mapping.tex
+++ b/Book/Chapter4/Figures/tree-fragment-to-string-mapping.tex
@@ -12,10 +12,14 @@
 \path [draw, ->, thick] ([xshift=1em]sn3.east) -- ([xshift=2.5em]sn3.east);

 \node [anchor=west] (tw1) at ([xshift=3.5em]sn3.east) {increases};
-\node [anchor=west] (tw2) at ([xshift=0.3em]tw1.east) {NN};
+\node [anchor=west,fill=red!20] (tw2) at ([xshift=0.3em]tw1.east) {NN};

 \draw[dotted,thick] ([yshift=-0.1em]sn3.south)..controls +(south:1.2) and +(south: 1.2)..([yshift=-0.1em]tw2.south);

+\begin{pgfonlayer}{background}
+\node [rectangle,inner sep=0em,fill=red!20] [fit = (sn3)] (nn1) {};
+\end{pgfonlayer}
+
 \end{scope}

 \end{tikzpicture}

--- a/Book/Chapter4/chapter4.tex
+++ b/Book/Chapter4/chapter4.tex
--- a/Book/Chapter5/Figures/fig-weather-forward.tex
+++ b/Book/Chapter5/Figures/fig-weather-forward.tex
@@ -3,46 +3,46 @@

 \node [anchor=west,minimum width=1.5em,minimum height=1.5em] (part1) at (0,0) {\footnotesize{$y$}};
 \node [anchor=north,minimum width=1.5em,minimum height=1.5em] (part1-2) at ([xshift=-1.6em,yshift=-0.3em]part1.south) {\scriptsize {$\rm {shape(1)}$}};
-\node [anchor=north,draw,minimum width=4.0em,minimum height=1.5em] (part2) at ([yshift=-1.5em]part1.south) {\footnotesize {$\rm{sigmoid}$}};
+\node [anchor=north,draw,minimum width=4.0em,minimum height=1.5em,fill=orange!20] (part2) at ([yshift=-1.5em]part1.south) {\footnotesize {$\rm{sigmoid}$}};
 \draw [-,thick](part1.south)--(part2.north);

 \node [anchor=north,minimum width=1.5em,minimum height=1.5em] (part2-2) at ([xshift=-1.6em,yshift=-0.3em]part2.south) {\scriptsize {$\rm{shape(1)}$}};
-\node [anchor=north,draw,minimum width=4.0em,minimum height=1.5em] (part3) at ([yshift=-1.5em]part2.south) {\footnotesize {$\rm{ADD}$}};
+\node [anchor=north,draw,minimum width=4.0em,minimum height=1.5em,fill=green!20] (part3) at ([yshift=-1.5em]part2.south) {\footnotesize {$\rm{ADD}$}};
 \draw [-,thick](part2.south)--(part3.north);

 \node [anchor=north,minimum width=1.5em,minimum height=1.5em] (part3-2) at ([xshift=-1.6em,yshift=-0.3em]part3.south) {\scriptsize {$\rm {shape(1)}$}};
-\node [anchor=north,draw,minimum width=4.0em,minimum height=1.5em] (part4) at ([yshift=-1.5em]part3.south) {\footnotesize {$\rm{MUL}$}};
+\node [anchor=north,draw,minimum width=4.0em,minimum height=1.5em,fill=blue!20] (part4) at ([yshift=-1.5em]part3.south) {\footnotesize {$\rm{MUL}$}};
 \draw [-,thick](part3.south)--(part4.north);

 \node [anchor=north,minimum width=1.5em,minimum height=1.5em] (part4-2) at ([xshift=-1.6em,yshift=-0.2em]part4.south) {\scriptsize {$\rm {shape(2)}$}};
 \node [anchor=north,minimum width=4.0em,minimum height=1.5em] (part5) at ([yshift=-1.4em]part4.south) {\footnotesize {$\mathbf a$}};
 \draw [-,thick](part4.south)--([yshift=-0.1em]part5.north);
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\node [anchor=west,minimum width=2.0em,minimum height=1.5em,draw] (part5-3) at ([xshift=0.0em,yshift=0.1em]part5.east) {\footnotesize {$\mathbf w^2$}};
-\node [anchor=west,minimum width=2.0em,minimum height=1.5em,draw] (part5-4) at ([xshift=2.0em,yshift=0.0em]part5-3.east) {\footnotesize {$\mathbf b^2$}};
+\node [anchor=west,minimum width=2.0em,minimum height=1.5em,draw,fill=red!20] (part5-3) at ([xshift=0.0em,yshift=0.1em]part5.east) {\footnotesize {$\mathbf w^2$}};
+\node [anchor=west,minimum width=2.0em,minimum height=1.5em,draw,fill=orange!40] (part5-4) at ([xshift=2.0em,yshift=0.0em]part5-3.east) {\footnotesize {$\mathbf b^2$}};
 \draw[-,thick](part4.south)--(part5-3.north);
 \draw[-,thick](part3.south)--(part5-4.north);
 \node [anchor=south,minimum width=1.5em,minimum height=1.5em] (part5-3-1) at ([xshift=1.3em,yshift=-0.45em]part5-3.north) {\scriptsize {$\rm{shape(2)}$}};
 \node [anchor=south,minimum width=1.5em,minimum height=1.5em] (part5-4-1) at ([xshift=1.3em,yshift=-0.45em]part5-4.north) {\scriptsize {$\rm{shape(1)}$}};
 %%%%%%%%%%%%%%%%%%%%%%%%%%
 \node [anchor=north,minimum width=1.5em,minimum height=1.5em] (part5-2) at ([xshift=-1.6em,yshift=-0.2em]part5.south) {\scriptsize {$\rm{shape(2)}$}};
-\node [anchor=north,draw,minimum width=4.0em,minimum height=1.5em] (part6) at ([yshift=-1.4em]part5.south) {\footnotesize {$\rm{tanh}$}};
+\node [anchor=north,draw,minimum width=4.0em,minimum height=1.5em,fill=yellow!20] (part6) at ([yshift=-1.4em]part5.south) {\footnotesize {$\rm{tanh}$}};
 \draw [-,thick]([yshift=0.1em]part5.south)--(part6.north);

 \node [anchor=north,minimum width=1.5em,minimum height=1.5em] (part6-2) at ([xshift=-1.6em,yshift=-0.3em]part6.south) {\scriptsize {$\rm{shape(2)}$}};
-\node [anchor=north,draw,minimum width=4.0em,minimum height=1.5em] (part7) at ([yshift=-1.5em]part6.south) {\footnotesize {$\rm{ADD}$}};
+\node [anchor=north,draw,minimum width=4.0em,minimum height=1.5em,fill=green!20] (part7) at ([yshift=-1.5em]part6.south) {\footnotesize {$\rm{ADD}$}};
 \draw [-,thick](part6.south)--(part7.north);

 \node [anchor=north,minimum width=1.5em,minimum height=1.5em] (part7-2) at ([xshift=-1.6em,yshift=-0.3em]part7.south) {\scriptsize {$\rm{shape(2)}$}};
-\node [anchor=north,draw,minimum width=4.0em,minimum height=1.5em] (part8) at ([yshift=-1.5em]part7.south) {\footnotesize {$\rm{MUL}$}};
+\node [anchor=north,draw,minimum width=4.0em,minimum height=1.5em,fill=blue!20] (part8) at ([yshift=-1.5em]part7.south) {\footnotesize {$\rm{MUL}$}};
 \draw [-,thick](part7.south)--(part8.north);

 \node [anchor=north,minimum width=1.5em,minimum height=1.5em] (part8-2) at ([xshift=-1.6em,yshift=-0.2em]part8.south) {\scriptsize{$\rm{shape(2)}$}};
 \node [anchor=north,minimum width=4.0em,minimum height=1.5em] (part9) at ([yshift=-1.4em]part8.south) {\footnotesize {$\mathbf x$}};
 \draw [-,thick](part8.south)--([yshift=-0.1em]part9.north);
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\node [anchor=west,minimum width=2.0em,minimum height=1.5em,draw] (part9-3) at ([xshift=0.0em,yshift=0.1em]part9.east) {\footnotesize {$\mathbf w’$}};
-\node [anchor=west,minimum width=2.0em,minimum height=1.5em,draw] (part9-4) at ([xshift=2.0em,yshift=0.0em]part9-3.east) {\footnotesize {$\mathbf b’$}};
+\node [anchor=west,minimum width=2.0em,minimum height=1.5em,draw,fill=red!20] (part9-3) at ([xshift=0.0em,yshift=0.1em]part9.east) {\footnotesize {$\mathbf w^1$}};
+\node [anchor=west,minimum width=2.0em,minimum height=1.5em,draw,fill=orange!40] (part9-4) at ([xshift=2.0em,yshift=0.0em]part9-3.east) {\footnotesize {$\mathbf b^1$}};
 \draw[-,thick](part8.south)--(part9-3.north);
 \draw[-,thick](part7.south)--(part9-4.north);
 \node [anchor=south,minimum width=1.5em,minimum height=1.5em] (part9-3-1) at ([xshift=1.5em,yshift=-0.45em]part9-3.north) {\scriptsize {$\rm{shape(3,2)}$}};

--- a/Book/Chapter5/Figures/fig-weather.tex
+++ b/Book/Chapter5/Figures/fig-weather.tex
@@ -2,23 +2,23 @@
 \begin{tikzpicture}
 \begin{scope}
 %左
-\node [anchor=west,draw=ublue,minimum width=2.5em] (part1-1) at (0,0) {\footnotesize{天空状况}};
-\node [anchor=north,draw=ublue,minimum width=2.5em] (part1-2) at ([yshift=-2em]part1-1.south) {\footnotesize {低空气温}};
-\node [anchor=north,draw=ublue,minimum width=2.5em] (part1-3) at ([yshift=-2em]part1-2.south) {\footnotesize {水平气压}};
-\node [anchor=north,minimum width=2.5em] (part1-4) at ([yshift=-1.0em]part1-3.south) {\footnotesize {输入层}};
-\node[anchor=south,minimum height=12em,minimum width=5.0em,draw=ublue,dotted,thick] (part1out) at ([xshift=0.0em,yshift=-8em]part1-2.north) {};
+\node [anchor=west,draw=ublue,minimum width=2.5em,fill=yellow!20] (part1-1) at (0,0) {\scriptsize{天空状况}};
+\node [anchor=north,draw=ublue,minimum width=2.5em,fill=yellow!20] (part1-2) at ([yshift=-1.7em]part1-1.south) {\scriptsize {低空气温}};
+\node [anchor=north,draw=ublue,minimum width=2.5em,fill=yellow!20] (part1-3) at ([yshift=-1.7em]part1-2.south) {\scriptsize {水平气压}};
+\node [anchor=north,minimum width=2.5em] (part1-4) at ([yshift=-0.5em]part1-3.south) {\scriptsize {输入层}};
+

 %中
-\node [circle,anchor=west,draw=ublue,minimum width=2.2em] (part2-1) at ([xshift=2.0em,yshift=1.5em]part1-2.east) {\footnotesize {温度}};
-\node [circle,anchor=west,draw=ublue,minimum width=2.2em] (part2-2) at ([xshift=2.0em,yshift=-1.5em]part1-2.east) {\footnotesize {风速}};
-\node [anchor=north,minimum width=3.0em] (part2-3) at ([xshift=0.0em,yshift=-2.42em]part2-2.south) {\footnotesize{隐藏层}};
+\node [circle,anchor=west,draw=ublue,minimum width=2.0em,fill=blue!20] (part2-1) at ([xshift=2.0em,yshift=1.5em]part1-2.east) {\scriptsize {温度}};
+\node [circle,anchor=west,draw=ublue,minimum width=2.0em,fill=blue!20] (part2-2) at ([xshift=2.0em,yshift=-1.5em]part1-2.east) {\scriptsize {风速}};
+\node [anchor=north,minimum width=3.0em] (part2-3) at ([xshift=0.0em,yshift=-1.52em]part2-2.south) {\scriptsize{隐藏层}};
 \node [anchor=north] (labela) at ([xshift=0.0em,yshift=-1em]part2-3.south) {\footnotesize {(a)}};
-\node[anchor=south,minimum height=12em,minimum width=4.0em,draw=ublue,dotted,thick] (part2out) at ([xshift=5.2em,yshift=-8em]part1-2.north) {};

 %右
-\node [anchor=west,draw=ublue,minimum width=3.0em] (part3-1) at ([xshift=6.5em,yshift=0.0em]part1-2.east) {\footnotesize {穿衣指数}};
-\node [anchor=north,minimum width=3.0em] (part3-2) at ([yshift=-4.65em]part3-1.south) {\footnotesize{输出层}};
-\node[anchor=south,minimum height=12em,minimum width=5em,draw=ublue,dotted,thick] (part3out) at ([xshift=10.5em,yshift=-8em]part1-2.north) {};
+\node [anchor=west,draw=ublue,minimum width=3.0em,fill=purple!20] (part3-1) at ([xshift=5.8em,yshift=0.0em]part1-2.east) {\scriptsize {穿衣指数}};
+\node [anchor=north,minimum width=3.0em] (part3-2) at ([yshift=-3.6em]part3-1.south) {\scriptsize{输出层}};
+\node[anchor=south,minimum height=11em,minimum width=15.0em,draw=ublue,dotted,thick] (part2out) at ([xshift=4.8em,yshift=-7em]part1-2.north) {};
+

 %连线

@@ -32,26 +32,27 @@
 \draw [->,thick,ublue](part2-2.east)--(part3-1.west);
 \end{scope}

-\begin{scope}[xshift=3in]
+\begin{scope}[xshift=2.8in]
 %左
-\node [anchor=west,draw=ublue,minimum width=1.5em,minimum height=1.5em] (part1-1) at (0,0) {\footnotesize{$x_1$}};
-\node [anchor=north,draw=ublue,minimum width=1.5em,minimum height=1.5em] (part1-2) at ([yshift=-2em]part1-1.south) {\footnotesize{$x_2$}};
-\node [anchor=north,draw=ublue,minimum width=1.5em,minimum height=1.5em] (part1-3) at ([yshift=-2em]part1-2.south) {\footnotesize{$x_3$}};
-\node [anchor=north,minimum width=3.0em] (part1-4) at ([yshift=-1.0em]part1-3.south) {\footnotesize {输入层}};
-\node[anchor=south,minimum height=12em,minimum width=3.4em,draw=ublue,dotted,thick] (part1out) at ([xshift=0.0em,yshift=-8em]part1-2.north) {};
+\node [anchor=west,draw=ublue,minimum width=1.5em,minimum height=1.5em,fill=yellow!20] (part1-1) at (0,0) {\footnotesize{$x_1$}};
+\node [anchor=north,draw=ublue,minimum width=1.5em,minimum height=1.5em,fill=yellow!20] (part1-2) at ([yshift=-1.6em]part1-1.south) {\footnotesize{$x_2$}};
+\node [anchor=north,draw=ublue,minimum width=1.5em,minimum height=1.5em,fill=yellow!20] (part1-3) at ([yshift=-1.6em]part1-2.south) {\footnotesize{$x_3$}};
+\node [anchor=north,minimum width=3.0em] (part1-4) at ([yshift=-0.5em]part1-3.south) {\scriptsize {输入层}};
+

 %中
-\node [circle,anchor=west,draw=ublue,minimum width=2.0em] (part2-1) at ([xshift=2.2em,yshift=1.5em]part1-2.east) {\footnotesize{$a_1$}};
-\node [circle,anchor=west,draw=ublue,minimum width=2.0em] (part2-2) at ([xshift=2.2em,yshift=-1.5em]part1-2.east) {\footnotesize {$a_2$}};
-\node [anchor=north,minimum width=3.0em] (part2-3) at ([xshift=0.0em,yshift=-2.79em]part2-2.south) {\footnotesize {隐藏层}};
+\node [circle,anchor=west,draw=ublue,minimum width=2.0em,fill=blue!20] (part2-1) at ([xshift=1.8em,yshift=1.5em]part1-2.east) {\footnotesize{$a_1$}};
+\node [circle,anchor=west,draw=ublue,minimum width=2.0em,fill=blue!20] (part2-2) at ([xshift=1.8em,yshift=-1.5em]part1-2.east) {\footnotesize {$a_2$}};
+\node [anchor=north,minimum width=3.0em] (part2-3) at ([xshift=0.0em,yshift=-1.9em]part2-2.south) {\scriptsize {隐藏层}};
 \node [anchor=north] (labelb) at ([xshift=1em,yshift=-1em]part2-3.south) {\footnotesize {(b)}};
-\node[anchor=south,minimum height=12em,minimum width=3.4em,draw=ublue,dotted,thick] (part2out) at ([xshift=4em,yshift=-8em]part1-2.north) {};
+

 %右
-\node [circle,anchor=west,draw=ublue,minimum width=2.0em] (part3-1) at ([xshift=6.2em,yshift=0.0em]part1-2.east) {\footnotesize{$y$}};
-\node [anchor=west,draw=ublue,minimum width=1.5em,minimum height=1.5em] (part3-2) at ([xshift=1.2em]part3-1.east) {\footnotesize {$y$}};
-\node [anchor=north,minimum width=3.0em] (part3-3) at ([xshift=1.4em,yshift=-4.3em]part3-1.south) {\footnotesize {输出层}};
-\node[anchor=south,minimum height=12em,minimum width=5.5em,draw=ublue,dotted,thick] (part3out) at ([xshift=9.2em,yshift=-8em]part1-2.north) {};
+\node [circle,anchor=west,draw=ublue,minimum width=2.0em,fill=purple!20] (part3-1) at ([xshift=5.2em,yshift=0.0em]part1-2.east) {\footnotesize{$y$}};
+\node [anchor=west,draw=ublue,minimum width=1.5em,minimum height=1.5em,fill=red!40] (part3-2) at ([xshift=1.0em]part3-1.east) {\footnotesize {$y$}};
+\node [anchor=north,minimum width=3.0em] (part3-3) at ([xshift=1.4em,yshift=-3.45em]part3-1.south) {\scriptsize {输出层}};
+\node[anchor=south,minimum height=11em,minimum width=14.0em,draw=ublue,dotted,thick] (part2out) at ([xshift=4.9em,yshift=-7em]part1-2.north) {};
+

 %连线

@@ -67,5 +68,6 @@
 \end{scope}

 \end{tikzpicture}
+%%%------------------------------------------------------------------------------------------------------------
 %%------------------------------------------------------------------------------------------------------------

--- a/Book/Chapter5/chapter5.tex
+++ b/Book/Chapter5/chapter5.tex
--- a/Book/Chapter6/Chapter6.tex
+++ b/Book/Chapter6/Chapter6.tex
--- a/Book/ChapterAppend/ChapterAppend.tex
+++ b/Book/ChapterAppend/ChapterAppend.tex
@@ -26,121 +26,96 @@
 \section{IBM模型3训练方法}
 \parinterval 模型3的参数估计与模型1和模型2采用相同的方法。这里直接给出辅助函数。
 \begin{eqnarray}
-h(t,d,n,p, \lambda,\mu, \nu, \zeta) & = &  \textrm{P}_{\theta}(\mathbf{s}|\mathbf{t})-\sum_{e}\lambda_{e}(\sum_{s}t(\mathbf{s}|\mathbf{t})-1)-\sum_{i}\mu_{iml}(\sum_{j}d(j|i,m,l)-1) \nonumber \\
-& & -\sum_{e}\nu_{e}(\sum_{\varphi}n(\varphi|e)-1)-\zeta(p^0+p^1-1)
+h(t,d,n,p, \lambda,\mu, \nu, \zeta) & = &  \textrm{P}_{\theta}(\mathbf{s}|\mathbf{t})-\sum_{t}\lambda_{t}\big(\sum_{s}t(s|t)-1\big)  \nonumber \\
+& & -\sum_{i}\mu_{iml}\big(\sum_{j}d(j|i,m,l)-1\big) \nonumber \\
+& & -\sum_{t}\nu_{t}\big(\sum_{\varphi}n(\varphi|t)-1\big)-\zeta(p^0+p^1-1)
 \label{eq:1.1}
 \end{eqnarray}
 %----------------------------------------------
 \parinterval 由于篇幅所限这里略去了推导步骤直接给出一些用于参数估计的等式。
 \begin{eqnarray}
-c(s|t,\mathbf{s},\mathbf{t}) = \sum_{\mathbf{a}}(\textrm{p}_{\theta}(\mathbf{s},\mathbf{a}|\mathbf{t}) \times \sum_{i=1}^{m} (\delta(s_i,\mathbf{s}) \cdot \delta(t_{a_{i}},\mathbf{t})))
-\label{eq:1.2}
-\end{eqnarray}
-\begin{eqnarray}
-c(i|j,m,l;\mathbf{s},\mathbf{t}) = \sum_{\mathbf{a}}(\textrm{p}_{\theta}(\mathbf{s},\mathbf{a}|\mathbf{t}) \times \delta(j,a_i))
-\label{eq:1.3}
-\end{eqnarray}
-\begin{eqnarray}
-c(\varphi|e;\mathbf{s},\mathbf{t}) = \sum_{\mathbf{a}}(\textrm{p}_{\theta}(\mathbf{s},\mathbf{a}|\mathbf{t}) \times \sum_{j=1}^{l}\delta(\varphi,\varphi_{j})\delta(e,e_j))
+c(s|t,\mathbf{s},\mathbf{t}) & = & \sum_{\mathbf{a}}\big[\textrm{P}_{\theta}(\mathbf{s},\mathbf{a}|\mathbf{t}) \times \sum_{j=1}^{m} (\delta(s_j,s) \cdot \delta(t_{a_{j}},t))\big] \label{eq:1.2} \\
+c(j|i,m,l;\mathbf{s},\mathbf{t}) & = & \sum_{\mathbf{a}}\big[\textrm{P}_{\theta}(\mathbf{s},\mathbf{a}|\mathbf{t}) \times \delta(i,a_j)\big] \label{eq:1.3} \\
+c(\varphi|t;\mathbf{s},\mathbf{t}) & = & \sum_{\mathbf{a}}\big[\textrm{P}_{\theta}(\mathbf{s},\mathbf{a}|\mathbf{t}) \times \sum_{i=1}^{l}\delta(\varphi,\varphi_{i})\delta(t,t_i)\big]
 \label{eq:1.4}
 \end{eqnarray}
+
 \begin{eqnarray}
-c(0|\mathbf{s},\mathbf{t}) = \sum_{\mathbf{a}}(\textrm{p}_{\theta}(\mathbf{s},\mathbf{a}|\mathbf{t})  \times (m-2\varphi_0) )
-\label{eq:1.5}
-\end{eqnarray}
-\begin{eqnarray}
-c(1|\mathbf{s},\mathbf{t}) = \sum_{\mathbf{a}}(\textrm{p}_{\theta}(\mathbf{s},\mathbf{a}|\mathbf{t}) \times \varphi_0)
-\label{eq:1.6}
+c(0|\mathbf{s},\mathbf{t}) & = & \sum_{\mathbf{a}}\big[\textrm{P}_{\theta}(\mathbf{s},\mathbf{a}|\mathbf{t})  \times (m-2\varphi_0) \big] \label{eq:1.5} \\
+c(1|\mathbf{s},\mathbf{t}) & = & \sum_{\mathbf{a}}\big[\textrm{P}_{\theta}(\mathbf{s},\mathbf{a}|\mathbf{t}) \times \varphi_0 \big] \label{eq:1.6}
 \end{eqnarray}
 %----------------------------------------------

-\parinterval 进一步，
+\parinterval 进一步，对于由$K$个样本组成的训练集，有：
 \begin{eqnarray}
-t(\mathbf{s}|\mathbf{t}) = \lambda_{t}^{-1} \times \sum_{k=1}^{S}c(\mathbf{s}|\mathbf{t};\mathbf{s}(k),\mathbf{t}(k))
-\label{eq:1.7}
-\end{eqnarray}
-\begin{eqnarray}
-d(i|j,m,l) = \mu_{jml}^{-1} \times \sum_{k=1}^{S}c(i|j,m,l;\mathbf{s}(k),\mathbf{t}(k))
-\label{eq:1.8}
-\end{eqnarray}
-\begin{eqnarray}
-n(\varphi|\mathbf{t}) = \nu_{t}^{-1} \times \sum_{s=1}^{S}c(\varphi |t;\mathbf{s}(k),\mathbf{t}(k))
-\label{eq:1.9}
-\end{eqnarray}
-\begin{eqnarray}
-pk = \zeta^{-1} \sum_{k=1}^{S}c(k;\mathbf{s}(k),\mathbf{t}(k))
-\label{eq:1.10}
+t(s|t) & = & \lambda_{t}^{-1} \times \sum_{k=1}^{K}c(s|t;\mathbf{s}^{[k]},\mathbf{t}^{[k]}) \label{eq:1.7} \\
+d(j|i,m,l) & = & \mu_{iml}^{-1} \times \sum_{k=1}^{K}c(j|i,m,l;\mathbf{s}^{[k]},\mathbf{t}^{[k]}) \label{eq:1.8} \\
+n(\varphi|t) & = & \nu_{t}^{-1} \times \sum_{s=1}^{K}c(\varphi |t;\mathbf{s}^{[k]},\mathbf{t}^{[k]}) \label{eq:1.9} \\
+p_x & = & \zeta^{-1} \sum_{k=1}^{K}c(x;\mathbf{s}^{[k]},\mathbf{t}^{[k]}) \label{eq:1.10}
 \end{eqnarray}
 %----------------------------------------------

-\parinterval 在模型3中，因为产出率的引入，我们并不能像在模型1和模型2中那样，在保证正确性的情况下加速参数估计的过程。这就使得每次迭代过程中，我们都不得不面对大小为$(l+1)^m$的词对齐空间。遍历所有$(l+1)^m$个词对齐所带来的高时间复杂度显然是不能被接受的。因此就要考虑是不是可以仅利用词对齐空间中的部分词对齐对这些参数进行估计。比较简单且直接的方法就是仅利用Viterbi对齐来进行参数估计。遗憾的是，在模型3中我们没有方法直接获得Viterbi对齐。这样只能采用一种折中的方法，即仅考虑那些使得$\textrm{P}_{\theta}(\mathbf{s},\mathbf{a}|\mathbf{t})$值较高的词对齐。这里把这部分词对齐组成的集合记为S。式(\ref{eq:1.2})可以被修改为，
+\parinterval 在模型3中，因为产出率的引入，并不能像模型1和模型2那样，在保证正确性的情况下加速参数估计的过程。这就使得每次迭代过程中，都不得不面对大小为$(l+1)^m$的词对齐空间。遍历所有$(l+1)^m$个词对齐所带来的高时间复杂度显然是不能被接受的。因此就要考虑能否仅利用词对齐空间中的部分词对齐对这些参数进行估计。比较简单且直接的方法就是仅利用Viterbi对齐来进行参数估计\footnote{Viterbi词对齐可以被简单的看作搜索到的最好词对齐。}。 遗憾的是，在模型3中并没有方法直接获得Viterbi对齐。这样只能采用一种折中的策略，即仅考虑那些使得$\textrm{P}_{\theta}(\mathbf{s},\mathbf{a}|\mathbf{t})$达到较高值的词对齐。这里把这部分词对齐组成的集合记为$S$。式\ref{eq:1.2}可以被修改为，
 \begin{eqnarray}
-c(s|t,\mathbf{s},\mathbf{t}) \approx \sum_{\mathbf{a} \in \mathbf{S}}(\textrm{P}_{\theta}(\mathbf{s},\mathbf{a}|\mathbf{t}) \times \sum_{i=1}^{m}(\delta(s_i,\mathbf{s}) \cdot \delta(t_{a_{i}},\mathbf{t})))
+c(s|t,\mathbf{s},\mathbf{t}) \approx \sum_{\mathbf{a} \in \mathbf{S}}\big[\textrm{P}_{\theta}(\mathbf{s},\mathbf{a}|\mathbf{t}) \times \sum_{j=1}^{m}(\delta(s_j,\mathbf{s}) \cdot \delta(t_{a_{j}},\mathbf{t})) \big]
 \label{eq:1.11}
 \end{eqnarray}
 %----------------------------------------------

-\parinterval 同理可以获得式(\ref{eq:1.3})、式(\ref{eq:1.4})、式(\ref{eq:1.5})和式(\ref{eq:1.6})的修改结果。
+\parinterval 同理可以获得式\ref{eq:1.3}-\ref{eq:1.6}的修改结果。进一步，在IBM模型3中，可以如下定义$S$：

-\parinterval 在模型3中，可以如下定义\textrm{S}
 \begin{eqnarray}
-\textrm{S} = N(b^{\infty}(V(\mathbf{s}|\mathbf{t};2))) \cup (\mathop{\cup}\limits_{ij} N(b_{i \leftrightarrow j}^{\infty}(V_{i \leftrightarrow j}(\mathbf{s}|\mathbf{t},2))))
+S = N(b^{\infty}(V(\mathbf{s}|\mathbf{t};2))) \cup (\mathop{\cup}\limits_{ij} N(b_{i \leftrightarrow j}^{\infty}(V_{i \leftrightarrow j}(\mathbf{s}|\mathbf{t},2))))
 \label{eq:1.12}
 \end{eqnarray}
 %----------------------------------------------

-\parinterval 其中 $b^{\infty}(V(\mathbf{s}|\mathbf{t};2))$ 和 $b_{i \leftrightarrow j}^{\infty}(V_{i \leftrightarrow j}(\mathbf{s}|\mathbf{t},2))$ 分别是对 $V(\mathbf{s}|\mathbf{t};3)$ 和 $V_{i \leftrightarrow j}(\mathbf{s}|\mathbf{t},3)$ 的估计。在计算\textrm{S}的过程中，我们需要知道一个对齐$\bf{a}$的邻居$\bf{a}'$的概率，即如何通过$\textrm{p}_{\theta}(\mathbf{a},\mathbf{s}|\mathbf{t})$计算$\textrm{p}_{\theta}(\mathbf{a}',\mathbf{s}|\mathbf{t})$。在模型3总，如果$\bf{a}$和$\bf{a}'$区别于某个源语单词的对齐到的目标位置上（$a_j$不等于$a_{j}'$），那么
-\begin{small}
+\parinterval 为了理解这个公式，先介绍几个概念。
+
+\begin{itemize}
+\item $V(\mathbf{s}|\mathbf{t})$表示Viterbi词对齐，$V(\mathbf{s}|\mathbf{t},1)$、$V(\mathbf{s}|\mathbf{t},2)$和$V(\mathbf{s}|\mathbf{t},3)$就分别对应了模型1、2 和3 的Viterbi 词对齐； 
+\item 把那些满足第$j$个源语言单词对应第$i$个目标语言单词（$a_j=i$）的词对齐构成的集合记为$\mathbf{A}_{i \leftrightarrow j}(\mathbf{s},\mathbf{t})$。通常称这些对齐中$j$和$i$被``钉''在了一起。在$\mathbf{A}_{i \leftrightarrow j}(\mathbf{s},\mathbf{t})$中使$\textrm{P}(\mathbf{a}|\mathbf{s},\mathbf{t})$达到最大的那个词对齐被记为$V_{i \leftrightarrow j}(\mathbf{s},\mathbf{t})$；
+\item 如果两个词对齐，通过交换两个词对齐连接就能互相转化，则称它们为邻居。一个词对齐$\mathbf{a}$的所有邻居记为$N(\mathbf{a})$。
+\end{itemize}
+
+\vspace{0.3em}
+\parinterval 公式\ref{eq:1.12}中，$b^{\infty}(V(\mathbf{s}|\mathbf{t};2))$ 和 $b_{i \leftrightarrow j}^{\infty}(V_{i \leftrightarrow j}(\mathbf{s}|\mathbf{t},2))$ 分别是对 $V(\mathbf{s}|\mathbf{t};3)$ 和 $V_{i \leftrightarrow j}(\mathbf{s}|\mathbf{t},3)$ 的估计。在计算$S$的过程中，需要知道一个对齐$\bf{a}$的邻居$\bf{a}^{'}$的概率，即通过$\textrm{P}_{\theta}(\mathbf{a},\mathbf{s}|\mathbf{t})$计算$\textrm{p}_{\theta}(\mathbf{a}',\mathbf{s}|\mathbf{t})$。在模型3中，如果$\bf{a}$和$\bf{a}'$仅区别于某个源语单词对齐到的目标位置上（$a_j \neq a_{j}'$），那么
+
 \begin{eqnarray}
-\textrm{p}_{\theta}(\mathbf{a}',\mathbf{s}|\mathbf{t}) = \textrm{p}_{\theta}(\mathbf{a},\mathbf{s}|\mathbf{t}) \cdot \frac{\varphi_{j'}+1}{\varphi_j} \cdot \frac{n(\varphi_{j'}+1|t_{j'})}{n(\varphi_{j'}|t_{j'})} \cdot \frac{n(\varphi_{j-1}|t_{j})}{n(\varphi_{j}|t_{j})} \cdot \frac{t(s_i|t_{j'})}{t(s_{i}|t_{j})} \cdot \frac{d(i|j',m,l)}{d(i|j,m,l)}
+\textrm{P}_{\theta}(\mathbf{a}',\mathbf{s}|\mathbf{t}) & = & \textrm{P}_{\theta}(\mathbf{a},\mathbf{s}|\mathbf{t}) \cdot  \nonumber \\
+                                                                                   &     & \frac{\varphi_{i'}+1}{\varphi_i} \cdot \frac{n(\varphi_{i'}+1|t_{i'})}{n(\varphi_{i'}|t_{i'})} \cdot \frac{n(\varphi_{i}-1|t_{i})}{n(\varphi_{i}|t_{i})} \cdot \nonumber \\
+                                                                                   &     & \frac{t(s_j|t_{i'})}{t(s_{j}|t_{i})} \cdot \frac{d(j|i',m,l)}{d(j|i,m,l)}
 \label{eq:1.13}
 \end{eqnarray}
-\end{small}
 %----------------------------------------------

-\parinterval 如果$\bf{a}$和$\bf{a}'$区别于两个位置$i_1$和$i_2$的对齐上，$a_{j_{1}}=a{j_{2}}'$且$a_{j_{2}}=a{j_{1}}'$，那么
+\parinterval 如果$\bf{a}$和$\bf{a}'$区别于两个位置$j_1$和$j_2$的对齐上，$a_{j_{1}}=a_{j_{2}^{'}}$且$a_{j_{2}}=a_{j_{1}^{'}}$，那么
 \begin{eqnarray}
-\textrm{P}_{\theta}(\mathbf{a'},\mathbf{s}|\mathbf{t}) = \textrm{P}_{\theta}(\mathbf{a},\mathbf{s}|\mathbf{t}) \cdot \frac{t(s_{i_{2}}|t_{a_{i_{2}}})}{t(s_{i_{1}}|t_{a{i_{1}}})} \cdot \frac{d(i_{2})|a{i_{2}},m,l)}{d(i_{1}|a_{i_{1}},m,l)}
+\textrm{P}_{\theta}(\mathbf{a'},\mathbf{s}|\mathbf{t}) = \textrm{P}_{\theta}(\mathbf{a},\mathbf{s}|\mathbf{t}) \cdot \frac{t(s_{j_{2}}|t_{a_{j_{2}}})}{t(s_{j_{1}}|t_{a_{j_{1}}})} \cdot \frac{d(j_{2}|a_{j_{2}},m,l)}{d(j_{1}|a_{j_{1}},m,l)}
 \label{eq:1.14}
 \end{eqnarray}
 %----------------------------------------------

-\parinterval 这样每次迭代就可以仅在\textrm{S}上进行计数。相比整个词对齐空间，\textrm{S}只是一个非常小的子集，因此运算复杂度可以大大被降低。本质上说，这里定义\textrm{S}是为了用模型2的Viterbi对齐来估计模型3的Viterbi对齐。
-
-\parinterval 对于模型3的参数估计过程，实际上是建立在模型1和模型2的参数估计结果上的。这不仅是因为模型3要利用模型2的Viterbi对齐，而且还因为模型3参数的初值也要直接利用模型2的参数。从这个角度说，模型1，2，3是有序的且向前依赖的。单独的对模型3的参数进行估计是极其困难的。实际上IBM的模型4和模型5也具有这样的性质，即他们都可以利用前一个模型参数估计的结果作为自身参数的初始值。
+\parinterval 相比整个词对齐空间，$S$只是一个非常小的子集，因此运算复杂度可以大大被降低。可以看到，模型3的参数估计过程是建立在模型1和模型2的参数估计结果上的。这不仅是因为模型3要利用模型2的Viterbi对齐，而且还因为模型3参数的初值也要直接利用模型2的参数。从这个角度说，模型1，2，3是有序的且向前依赖的。单独的对模型3的参数进行估计是极其困难的。实际上IBM的模型4和模型5也具有这样的性质，即它们都可以利用前一个模型参数估计的结果作为自身参数的初始值。

 \section{IBM模型4训练方法}

-\parinterval 模型4的参数估计基本与模型3一致。需要修改的是扭曲度的估计公式，如下：
+\parinterval 模型4的参数估计基本与模型3一致。需要修改的是扭曲度的估计公式，对于目标语第$i$个cept.生成的第一单词，可以得到（假设有$K$个训练样本）：
 \begin{eqnarray}
-c_1(\Delta_i|ca,cb;\mathbf{s},\mathbf{t}) = \sum_{\mathbf{a}}(\textrm{P}_{\theta}(\mathbf{s},\mathbf{a}|\mathbf{t}) \times s_1(\Delta_i|ca,cb;\mathbf{a},\mathbf{s},\mathbf{t}))
+d_1(\Delta_j|ca,cb;\mathbf{s},\mathbf{t}) = \mu_{1cacb}^{-1} \times \sum_{k=1}^{K}c_1(\Delta_j|ca,cb;\mathbf{s}^{[k]},\mathbf{t}^{[k]})
 \label{eq:1.15}
 \end{eqnarray}
-\begin{small}
-\begin{eqnarray}
-s_1(\Delta_i|ca,cb;\rm{a},\mathbf{s},\mathbf{t}) = \sum_{p=1}^l (\varepsilon(\phi_p) \cdot \delta(\pi_{p1}-\odot _{[p]},\Delta_i) \cdot \delta(A(e_{p-1}),ca) \cdot \delta(B(\tau_{p1}),cb))
-\label{eq:1.16}
-\end{eqnarray}
-\end{small}
-\begin{eqnarray}
-d_1(\Delta_i|ca,cb;\mathbf{s},\mathbf{t}) = \mu_{1cacb}^{-1} \times \sum_{s=1}^{S}c(\Delta_i|ca,cb;\mathbf{s}(s),\mathbf{t}(s))
-\label{eq:1.17}
-\end{eqnarray}
-\begin{eqnarray}
-c_{>1}(\Delta_i|cb;\mathbf{s},\mathbf{t}) = \sum_{\mathbf{a}}(\textrm{p}_{\theta}(\mathbf{s},\mathbf{a}|\mathbf{t}) \times s_{>1}(\Delta_i|cb;\mathbf{a},\mathbf{s},\mathbf{t}))
-\label{eq:1.18}
-\end{eqnarray}
-\begin{eqnarray}
-s_{>1}(\Delta_i|cb;\mathbf{a},\mathbf{s},\mathbf{t}) = \sum_{p=1}^l(\varepsilon(\phi_p-1)\sum_{k=2}^{\phi_p}\delta(p-\pi_{[p]k-1},\Delta_i) \cdot \delta(B(\tau_{[p]k}),cb))
-\label{eq:1.19}
-\end{eqnarray}
+
+其中，
+
 \begin{eqnarray}
-d_{>1}(\Delta_i|cb;\mathbf{s},\mathbf{t}) = \mu_{>1cb}^{-1} \times \sum_{s=1}^{S}c_{>1}(\Delta_i|cb;\mathbf{s}(s),\mathbf{t}(s))
-\label{eq:1.20}
+c_1(\Delta_j|ca,cb;\mathbf{s},\mathbf{t})           & = & \sum_{\mathbf{a}}\big[\textrm{P}_{\theta}(\mathbf{s},\mathbf{a}|\mathbf{t}) \times s_1(\Delta_j|ca,cb;\mathbf{a},\mathbf{s},\mathbf{t})\big] \label{eq:1.16} \\
+s_1(\Delta_j|ca,cb;\rm{a},\mathbf{s},\mathbf{t}) & = & \sum_{i=1}^l \big[\varepsilon(\phi_i) \cdot \delta(\pi_{i1}-\odot _{i},\Delta_j) \cdot \nonumber \\
+                                                                           &     & \delta(A(t_{i-1}),ca) \cdot \delta(B(\tau_{i1}),cb) \big] \label{eq:1.17}
 \end{eqnarray}
-%----------------------------------------------

-\parinterval 其中，
+且
+
 \begin{eqnarray}
 \varepsilon(x) = \begin{cases}
 0 & x \leq 0 \\
@@ -148,63 +123,78 @@ d_{>1}(\Delta_i|cb;\mathbf{s},\mathbf{t}) = \mu_{>1cb}^{-1} \times \sum_{s=1}^{S
 \end{cases}
 \label{eq:1.21}
 \end{eqnarray}
+
+对于目标语第$i$个cept.生成的其他单词（非第一个单词），可以得到：
+
+\begin{eqnarray}
+d_{>1}(\Delta_j|cb;\mathbf{s},\mathbf{t}) = \mu_{>1cb}^{-1} \times \sum_{k=1}^{K}c_{>1}(\Delta_j|cb;\mathbf{s}^{[k]},\mathbf{t}^{[k]})
+\label{eq:1.18}
+\end{eqnarray}
+
+其中，
+
+\begin{eqnarray}
+c_{>1}(\Delta_j|cb;\mathbf{s},\mathbf{t})                  & = & \sum_{\mathbf{a}}\big[\textrm{p}_{\theta}(\mathbf{s},\mathbf{a}|\mathbf{t}) \times s_{>1}(\Delta_j|cb;\mathbf{a},\mathbf{s},\mathbf{t}) \big] \label{eq:1.19} \\
+s_{>1}(\Delta_j|cb;\mathbf{a},\mathbf{s},\mathbf{t}) & = & \sum_{i=1}^l \big[\varepsilon(\phi_i-1)\sum_{k=2}^{\phi_i}\delta(\pi_{[i]k}-\pi_{[i]k-1},\Delta_j) \cdot \nonumber ß\\
+                                                                                  &    & \delta(B(\tau_{[i]k}),cb) \big] \label{eq:1.20}
+\end{eqnarray}
+
 %----------------------------------------------

-\parinterval $ca$和$cb$分别表示目标语和源语的某个词类。
+\noindent 这里，$ca$和$cb$分别表示目标语言和源语言的某个词类。模型4需要像模型3一样，通过定义一个词对齐集合$S$，使得每次迭代都在$S$上进行，进而降低运算量。模型4中$S$的定义为：

-\parinterval 模型4需要像模型3一样，通过定义一个词对齐集合\textrm{S}，使得每次迭代都在\textrm{S}上进行，进而降低运算量。模型4中\textrm{S}的定义为，
 \begin{eqnarray}
 \textrm{S} = N(\tilde{b}^{\infty}(V(\mathbf{s}|\mathbf{t};2))) \cup (\mathop{\cup}\limits_{ij} N(\tilde{b}_{i \leftrightarrow j}^{\infty}(V_{i \leftrightarrow j}(\mathbf{s}|\mathbf{t},2))))
 \label{eq:1.22}
 \end{eqnarray}
 %----------------------------------------------

-\parinterval 对于一个对齐$\mathbf{a}$，可用模型3对它的邻居进行排名，即按$\textrm{p}_{\theta}(b(\mathbf{a})|\mathbf{s},\mathbf{t};3)$排序。$\tilde{b}(\mathbf{a})$ \\ 表示这个排名表中满足$\textrm{p}_{\theta}(\mathbf{a}'|\mathbf{s},\mathbf{t};4) > \textrm{P}_{\theta}⁡(\mathbf{a}|\mathbf{s},\mathbf{t};4)$的最高排名的$\mathbf{a}'$。同理可知$\tilde{b}_{i \leftrightarrow j}^{\infty}(\mathbf{a})$ \\ 的意义。这里之所以不用模型3中采用的方法直接利用$b^{\infty}(\mathbf{a})$得到模型4中高概率的对齐，是因为模型4中，要想获得某个对齐$\mathbf{a}$的邻居$\mathbf{a}'$，必须做很大调整，比如：调整$\tau_{[j]1}$和$\odot_{[j]}$等等。这个过程要比模型3的相应过程复杂得多。因此在模型4中只能借助于模型3的中间步骤来进行估计。
+\parinterval 对于一个对齐$\mathbf{a}$，可用模型3对它的邻居进行排名，即按$\textrm{P}_{\theta}(b(\mathbf{a})|\mathbf{s},\mathbf{t};3)$排序，其中$b(\mathbf{a})$表示$\mathbf{a}$的邻居。$\tilde{b}(\mathbf{a})$ 表示这个排名表中满足$\textrm{P}_{\theta}(\mathbf{a}'|\mathbf{s},\mathbf{t};4) > \textrm{P}_{\theta}⁡(\mathbf{a}|\mathbf{s},\mathbf{t};4)$的最高排名的$\mathbf{a}'$。同理可知$\tilde{b}_{i \leftrightarrow j}^{\infty}(\mathbf{a})$ 的意义。这里之所以不用模型3中采用的方法直接利用$b^{\infty}(\mathbf{a})$得到模型4中高概率的对齐，是因为模型4中，要想获得某个对齐$\mathbf{a}$的邻居$\mathbf{a}'$，必须做很大调整，比如：调整$\tau_{[i]1}$和$\odot_{i}$等等。这个过程要比模型3的相应过程复杂得多。因此在模型4中只能借助于模型3的中间步骤来进行参数估计。
 \setlength{\belowdisplayskip}{3pt}%调整空白大小
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{IBM模型5训练方法}
-\parinterval 模型5的参数估计过程也与模型3的过程基本一致，二者的区别在于扭曲度的估计公式。在模型5中，
+\parinterval 模型5的参数估计过程也与模型3的过程基本一致，二者的区别在于扭曲度的估计公式。在模型5中，对于目标语第$i$个cept.生成的第一单词，可以得到（假设有$K$个训练样本）：
+
 \begin{eqnarray}
-c_1(\Delta_i|cb,v1,v2;\mathbf{s},\mathbf{t}) = \sum_{\mathbf{a}}(\textrm{P}(\mathbf{s},\mathbf{a}|\mathbf{t}) \times s_1(\Delta_i|cb,v1,v2;\mathbf{a},\mathbf{s},\mathbf{t}))
+d_1(\Delta_j|cb;\mathbf{s},\mathbf{t}) = \mu_{1cb}^{-1} \times \sum_{k=1}^{K}c_1(\Delta_j|cb;\mathbf{s}^{[k]},\mathbf{t}^{[k]})
 \label{eq:1.23}
 \end{eqnarray}
+
+其中，
+
 \begin{eqnarray}
-s_1(\Delta_i|cb,v1,v2;\rm{a},\mathbf{s},\mathbf{t}) & = & \sum_{p=1}^l (\varepsilon(\phi_p) \cdot \delta(v_{\pi_{p1}},\Delta_i) \cdot \delta(X_{\{p-1\}},v1) \nonumber \\
-& & \cdot \delta(v_m-\phi_p+1,v2) \cdot \delta(v_{\pi_{p1}},v_{\pi_{p1-1}})
-\label{eq:1.24}
-\end{eqnarray}
-\begin{eqnarray}
-d_1(\Delta_i|cb;\mathbf{s},\mathbf{t}) = \mu_{1cb}^{-1} \times \sum_{s=1}^{S}c(\Delta_i|cb;\mathbf{f}(s),\mathbf{e}(s))
-\label{eq:1.25}
+c_1(\Delta_j|cb,v_x,v_y;\mathbf{s},\mathbf{t})                   & = & \sum_{\mathbf{a}}\Big[ \textrm{P}(\mathbf{s},\mathbf{a}|\mathbf{t}) \times s_1(\Delta_j|cb,v_x,v_y;\mathbf{a},\mathbf{s},\mathbf{t}) \Big] \label{eq:1.24} \\
+s_1(\Delta_j|cb,v_x,v_y;\mathbf{a},\mathbf{s},\mathbf{t}) & = & \sum_{i=1}^l \Big [ \varepsilon(\phi_i) \cdot \delta(v_{\pi_{i1}},\Delta_j) \cdot \delta(v_{\odot _{i-1}},v_x) \nonumber \\
+                                                                                          &    & \cdot \delta(v_m-\phi_i+1,v_y) \cdot \delta(v_{\pi_{i1}},v_{\pi_{i1}-1} )\Big] \label{eq:1.25}
 \end{eqnarray}
+
+
+对于目标语第$i$个cept.生成的其他单词（非第一个单词），可以得到：
+
 \begin{eqnarray}
-c_{>1}(\Delta_i|cb,v;\mathbf{s},\mathbf{t}) = \sum_{\mathbf{a}}(\textrm{p}(\mathbf{f},\mathbf{s}|\mathbf{t}) \times s_{>1}(\Delta_i|cb,v;\mathbf{a},\mathbf{s},\mathbf{t}))
+d_{>1}(\Delta_j|cb,v;\mathbf{s},\mathbf{t}) = \mu_{>1cb}^{-1} \times \sum_{k=1}^{K}c_{>1}(\Delta_j|cb,v;\mathbf{s}^{[k]},\mathbf{t}^{[k]})
 \label{eq:1.26}
 \end{eqnarray}
-%\begin{small}
-\begin{eqnarray}
-s_{>1}(\Delta_i|cb,v;\mathbf{a},\mathbf{s},\mathbf{t}) & = & \sum_{p=1}^l(\varepsilon(\phi_p-1)\sum_{k=2}^{\phi_p}(\delta(v_{\pi_{pk}}-V_{\pi_{[p]k-1}},\Delta_i)  \nonumber \\
-& & \cdot \delta(B(\tau_{[p]k}) ,cb) \cdot \delta(vm-v_{\pi_{p(k-1)}}-\phi_p+k,v) \nonumber \\
-& & \cdot \delta(v_{\pi_{p1}},v_{\pi_{p1-1}})))
-\label{eq:1.27}
-\end{eqnarray}
-%\end{small}
+
+其中，
+
 \begin{eqnarray}
-d_{>1}(\Delta_i|cb,v;\mathbf{s},\mathbf{t}) = \mu_{>1cb}^{-1} \times \sum_{s=1}^{S}c_{>1}(\Delta_i|cb,v;\mathbf{f}(s),\mathbf{e}(s))
-\label{eq:1.28}
+c_{>1}(\Delta_j|cb,v;\mathbf{s},\mathbf{t})                   & =  & \sum_{\mathbf{a}}\Big[\textrm{P}(\mathbf{a},\mathbf{s}|\mathbf{t}) \times s_{>1}(\Delta_j|cb,v;\mathbf{a},\mathbf{s},\mathbf{t}) \Big] \label{eq:1.27} \\
+s_{>1}(\Delta_j|cb,v;\mathbf{a},\mathbf{s},\mathbf{t}) & = & \sum_{i=1}^l\Big[\varepsilon(\phi_i-1)\sum_{k=2}^{\phi_i} \big[\delta(v_{\pi_{ik}}-v_{\pi_{[i]k}-1},\Delta_j)  \nonumber \\
+                                                                                    &     & \cdot \delta(B(\tau_{[i]k}) ,cb) \cdot \delta(v_m-v_{\pi_{i(k-1)}}-\phi_i+k,v) \nonumber \\
+                                                                                    &     & \cdot \delta(v_{\pi_{i1}},v_{\pi_{i1}-1}) \big] \Big] \label{eq:1.28}
 \end{eqnarray}
+
 %----------------------------------------------
 \vspace{0.5em}

-\parinterval 这里$X_{\{p-1\}}$表示在位置小于$p$的非空对的目标语单词对应的源语单词的平均置位。
-
-\parinterval 从式(\ref{eq:1.24})中可以看出因子$\delta(v_{\pi_{p1}},v_{\pi_{p1-1}})$保证了，即使对齐$\mathbf{a}$不合理（一个源语位置对应多个目标语位置）也可以避免在这个不合理的对齐上计算结果。需要注意的是因子$\delta(v_{\pi_{p1}},v_{\pi_{p1-1}})$，只能保证$\mathbf{a}$中不合理的部分不产生坏的影响，而$\mathbf{a}$中其它正确的部分仍会参与迭代。
+\parinterval 从式(\ref{eq:1.24})中可以看出因子$\delta(v_{\pi_{i1}},v_{\pi_{i1}-1})$保证了，即使对齐$\mathbf{a}$不合理（一个源语位置对应多个目标语位置）也可以避免在这个不合理的对齐上计算结果。需要注意的是因子$\delta(v_{\pi_{p1}},v_{\pi_{p1-1}})$，确保了$\mathbf{a}$中不合理的部分不产生坏的影响，而$\mathbf{a}$中其他正确的部分仍会参与迭代。

-\parinterval 不过上面的参数估计过程与前面4个模型中参数估计过程并不完全一样。前面四个模型在每次迭代中，可以在给定$\mathbf{s}$、$\mathbf{t}$和一个对齐$\mathbf{a}$的情况下直接计算并更新参数。但是在模型5的参数估计过程中，如公式(\ref{eq:1.24})中，需要模拟出由$\mathbf{t}$生成$\mathbf{s}$的过程才能得到正确的结果，因为从$\mathbf{t}$、$\mathbf{s}$和$\mathbf{a}$中是不能直接得到 的正确结果的。具体说，就是要从目标语句子的第一个单词开始到最后一个单词结束，依次生成每个目标语单词对应的源语单词，每处理完一个目标语单词就要暂停，然后才能计算式(\ref{eq:1.24})中求和符号里面的内容。这也就是说即使给定了$\mathbf{s}$、$\mathbf{t}$和一个对齐$\mathbf{a}$，也不能直接在它们上计算，必须重新模拟$\mathbf{t}$到$\mathbf{s}$的生成过程。
+\parinterval 不过上面的参数估计过程与IBM前4个模型的参数估计过程并不完全一样。IBM前4个模型在每次迭代中，可以在给定$\mathbf{s}$、$\mathbf{t}$和一个对齐$\mathbf{a}$的情况下直接计算并更新参数。但是在模型5的参数估计过程中（如公式\ref{eq:1.24}），需要模拟出由$\mathbf{t}$生成$\mathbf{s}$的过程才能得到正确的结果，因为从$\mathbf{t}$、$\mathbf{s}$和$\mathbf{a}$中是不能直接得到 的正确结果的。具体说，就是要从目标语言句子的第一个单词开始到最后一个单词结束，依次生成每个目标语言单词对应的源语言单词，每处理完一个目标语言单词就要暂停，然后才能计算式\ref{eq:1.24}中求和符号里面的内容。这也就是说即使给定了$\mathbf{s}$、$\mathbf{t}$和一个对齐$\mathbf{a}$，也不能直接在它们上进行计算，必须重新模拟$\mathbf{t}$到$\mathbf{s}$的生成过程。

-\parinterval 从前面的分析可以看出，虽然模型5比模型4更精确，但是模型5过于复杂以至于给参数估计增加了巨大的计算量（对于每组$\mathbf{t}$、$\mathbf{s}$和$\mathbf{a}$都要模拟$\mathbf{t}$生成$\mathbf{s}$的翻译过程，时间复杂度成指数增加）。因此模型5并不具有很强的实际意义。
+\parinterval 从前面的分析可以看出，虽然模型5比模型4更精确，但是模型5过于复杂以至于给参数估计增加了计算量（对于每组$\mathbf{t}$、$\mathbf{s}$和$\mathbf{a}$都要模拟$\mathbf{t}$生成$\mathbf{s}$的翻译过程）。因此模型5的开发对于系统实现是一个挑战。

-\parinterval 在模型5中同样需要定义一个词对齐集合S，使得每次迭代都在\textrm{S}上进行。这里对\textrm{S}进行如下定义
+\parinterval 在模型5中同样需要定义一个词对齐集合$S$，使得每次迭代都在$S$上进行。可以对$S$进行如下定义
 \begin{eqnarray}
 \textrm{S} = N(\tilde{\tilde{b}}^{\infty}(V(\mathbf{s}|\mathbf{t};2))) \cup (\mathop{\cup}\limits_{ij} N(\tilde{\tilde{b}}_{i \leftrightarrow j}^{\infty}(V_{i \leftrightarrow j}(\mathbf{s}|\mathbf{t},2))))
 \label{eq:1.29}
@@ -212,7 +202,7 @@ d_{>1}(\Delta_i|cb,v;\mathbf{s},\mathbf{t}) = \mu_{>1cb}^{-1} \times \sum_{s=1}^
 \vspace{0.5em}

 %----------------------------------------------
-\parinterval 这里$\tilde{\tilde{b}}(\mathbf{a})$借用了模型4中$\tilde{b}(\mathbf{a})$的概念。不过$\tilde{\tilde{b}}(\mathbf{a})$表示在利用模型3进行排名的列表中满足$\textrm{p}_{\theta}(\mathbf{a}'|\mathbf{s},\mathbf{t};5)$的最高排名的词对齐。
+\parinterval 这里$\tilde{\tilde{b}}(\mathbf{a})$借用了模型4中$\tilde{b}(\mathbf{a})$的概念。不过$\tilde{\tilde{b}}(\mathbf{a})$表示在利用模型3进行排名的列表中满足$\textrm{P}_{\theta}(\mathbf{a}'|\mathbf{s},\mathbf{t};5)$的最高排名的词对齐。
 \end{appendices}



--- a/Book/mt-book-xelatex.bbl
+++ b/Book/mt-book-xelatex.bbl
--- a/Book/mt-book-xelatex.idx
+++ b/Book/mt-book-xelatex.idx
-\indexentry{Chapter1.1|hyperpage}{9}
-\indexentry{Chapter1.2|hyperpage}{12}
-\indexentry{Chapter1.3|hyperpage}{17}
-\indexentry{Chapter1.4|hyperpage}{18}
-\indexentry{Chapter1.4.1|hyperpage}{18}
-\indexentry{Chapter1.4.2|hyperpage}{20}
-\indexentry{Chapter1.4.3|hyperpage}{21}
-\indexentry{Chapter1.4.4|hyperpage}{22}
-\indexentry{Chapter1.4.5|hyperpage}{23}
-\indexentry{Chapter1.5|hyperpage}{24}
-\indexentry{Chapter1.5.1|hyperpage}{24}
-\indexentry{Chapter1.5.2|hyperpage}{25}
-\indexentry{Chapter1.5.2.1|hyperpage}{25}
-\indexentry{Chapter1.5.2.2|hyperpage}{27}
-\indexentry{Chapter1.5.2.3|hyperpage}{27}
-\indexentry{Chapter1.6|hyperpage}{28}
-\indexentry{Chapter1.7|hyperpage}{30}
-\indexentry{Chapter1.7.1|hyperpage}{30}
-\indexentry{Chapter1.7.1.1|hyperpage}{31}
-\indexentry{Chapter1.7.1.2|hyperpage}{32}
-\indexentry{Chapter1.7.2|hyperpage}{34}
-\indexentry{Chapter1.8|hyperpage}{36}
-\indexentry{Chapter2.1|hyperpage}{42}
-\indexentry{Chapter2.2|hyperpage}{43}
-\indexentry{Chapter2.2.1|hyperpage}{43}
-\indexentry{Chapter2.2.2|hyperpage}{45}
-\indexentry{Chapter2.2.3|hyperpage}{46}
-\indexentry{Chapter2.2.4|hyperpage}{47}
-\indexentry{Chapter2.2.5|hyperpage}{49}
-\indexentry{Chapter2.2.5.1|hyperpage}{49}
-\indexentry{Chapter2.2.5.2|hyperpage}{50}
-\indexentry{Chapter2.2.5.3|hyperpage}{50}
-\indexentry{Chapter2.3|hyperpage}{51}
-\indexentry{Chapter2.3.1|hyperpage}{52}
-\indexentry{Chapter2.3.2|hyperpage}{53}
-\indexentry{Chapter2.3.2.1|hyperpage}{53}
-\indexentry{Chapter2.3.2.2|hyperpage}{54}
-\indexentry{Chapter2.3.2.3|hyperpage}{56}
-\indexentry{Chapter2.4|hyperpage}{58}
-\indexentry{Chapter2.4.1|hyperpage}{59}
-\indexentry{Chapter2.4.2|hyperpage}{61}
-\indexentry{Chapter2.4.2.1|hyperpage}{62}
-\indexentry{Chapter2.4.2.2|hyperpage}{63}
-\indexentry{Chapter2.4.2.3|hyperpage}{64}
-\indexentry{Chapter2.5|hyperpage}{66}
-\indexentry{Chapter2.5.1|hyperpage}{66}
-\indexentry{Chapter2.5.2|hyperpage}{68}
-\indexentry{Chapter2.5.3|hyperpage}{72}
-\indexentry{Chapter2.6|hyperpage}{74}
-\indexentry{Chapter3.1|hyperpage}{79}
-\indexentry{Chapter3.2|hyperpage}{81}
-\indexentry{Chapter3.2.1|hyperpage}{81}
-\indexentry{Chapter3.2.1.1|hyperpage}{81}
-\indexentry{Chapter3.2.1.2|hyperpage}{82}
-\indexentry{Chapter3.2.1.3|hyperpage}{83}
-\indexentry{Chapter3.2.2|hyperpage}{83}
-\indexentry{Chapter3.2.3|hyperpage}{84}
-\indexentry{Chapter3.2.3.1|hyperpage}{84}
-\indexentry{Chapter3.2.3.2|hyperpage}{85}
-\indexentry{Chapter3.2.3.3|hyperpage}{86}
-\indexentry{Chapter3.2.4|hyperpage}{87}
-\indexentry{Chapter3.2.4.1|hyperpage}{87}
-\indexentry{Chapter3.2.4.2|hyperpage}{89}
-\indexentry{Chapter3.2.5|hyperpage}{90}
-\indexentry{Chapter3.3|hyperpage}{93}
-\indexentry{Chapter3.3.1|hyperpage}{93}
-\indexentry{Chapter3.3.2|hyperpage}{96}
-\indexentry{Chapter3.3.2.1|hyperpage}{97}
-\indexentry{Chapter3.3.2.2|hyperpage}{98}
-\indexentry{Chapter3.3.2.3|hyperpage}{99}
-\indexentry{Chapter3.4|hyperpage}{100}
-\indexentry{Chapter3.4.1|hyperpage}{100}
-\indexentry{Chapter3.4.2|hyperpage}{102}
-\indexentry{Chapter3.4.3|hyperpage}{103}
-\indexentry{Chapter3.4.4|hyperpage}{104}
-\indexentry{Chapter3.4.4.1|hyperpage}{104}
-\indexentry{Chapter3.4.4.2|hyperpage}{105}
-\indexentry{Chapter3.5|hyperpage}{110}
-\indexentry{Chapter3.5.1|hyperpage}{111}
-\indexentry{Chapter3.5.2|hyperpage}{113}
-\indexentry{Chapter3.5.3|hyperpage}{115}
-\indexentry{Chapter3.5.4|hyperpage}{116}
-\indexentry{Chapter3.5.5|hyperpage}{118}
-\indexentry{Chapter3.5.5|hyperpage}{120}
-\indexentry{Chapter3.6|hyperpage}{121}
-\indexentry{Chapter3.6.1|hyperpage}{121}
-\indexentry{Chapter3.6.2|hyperpage}{122}
-\indexentry{Chapter3.6.4|hyperpage}{123}
-\indexentry{Chapter3.6.5|hyperpage}{123}
-\indexentry{Chapter3.7|hyperpage}{123}
-\indexentry{Chapter4.1|hyperpage}{125}
-\indexentry{Chapter4.1.1|hyperpage}{127}
-\indexentry{Chapter4.1.2|hyperpage}{128}
-\indexentry{Chapter4.2|hyperpage}{130}
-\indexentry{Chapter4.2.1|hyperpage}{130}
-\indexentry{Chapter4.2.2|hyperpage}{133}
-\indexentry{Chapter4.2.2.1|hyperpage}{133}
-\indexentry{Chapter4.2.2.2|hyperpage}{134}
-\indexentry{Chapter4.2.2.3|hyperpage}{135}
-\indexentry{Chapter4.2.3|hyperpage}{136}
-\indexentry{Chapter4.2.3.1|hyperpage}{136}
-\indexentry{Chapter4.2.3.2|hyperpage}{137}
-\indexentry{Chapter4.2.3.3|hyperpage}{138}
-\indexentry{Chapter4.2.4|hyperpage}{140}
-\indexentry{Chapter4.2.4.1|hyperpage}{140}
-\indexentry{Chapter4.2.4.2|hyperpage}{141}
-\indexentry{Chapter4.2.4.3|hyperpage}{142}
-\indexentry{Chapter4.2.5|hyperpage}{143}
-\indexentry{Chapter4.2.6|hyperpage}{143}
-\indexentry{Chapter4.2.7|hyperpage}{147}
-\indexentry{Chapter4.2.7.1|hyperpage}{148}
-\indexentry{Chapter4.2.7.2|hyperpage}{148}
-\indexentry{Chapter4.2.7.3|hyperpage}{149}
-\indexentry{Chapter4.2.7.4|hyperpage}{150}
-\indexentry{Chapter4.3|hyperpage}{151}
-\indexentry{Chapter4.3.1|hyperpage}{154}
-\indexentry{Chapter4.3.1.1|hyperpage}{155}
-\indexentry{Chapter4.3.1.2|hyperpage}{156}
-\indexentry{Chapter4.3.1.3|hyperpage}{157}
-\indexentry{Chapter4.3.1.4|hyperpage}{158}
-\indexentry{Chapter4.3.2|hyperpage}{158}
-\indexentry{Chapter4.3.3|hyperpage}{160}
-\indexentry{Chapter4.3.4|hyperpage}{161}
-\indexentry{Chapter4.3.5|hyperpage}{164}
-\indexentry{Chapter4.4|hyperpage}{166}
-\indexentry{Chapter4.4.1|hyperpage}{169}
-\indexentry{Chapter4.4.2|hyperpage}{171}
-\indexentry{Chapter4.4.2.1|hyperpage}{172}
-\indexentry{Chapter4.4.2.2|hyperpage}{173}
-\indexentry{Chapter4.4.2.3|hyperpage}{175}
-\indexentry{Chapter4.4.3|hyperpage}{176}
-\indexentry{Chapter4.4.3.1|hyperpage}{177}
-\indexentry{Chapter4.4.3.2|hyperpage}{180}
-\indexentry{Chapter4.4.3.3|hyperpage}{181}
-\indexentry{Chapter4.4.3.4|hyperpage}{183}
-\indexentry{Chapter4.4.3.5|hyperpage}{184}
-\indexentry{Chapter4.4.4|hyperpage}{185}
-\indexentry{Chapter4.4.4.1|hyperpage}{186}
-\indexentry{Chapter4.4.4.2|hyperpage}{187}
-\indexentry{Chapter4.4.5|hyperpage}{187}
-\indexentry{Chapter4.4.5|hyperpage}{189}
-\indexentry{Chapter4.4.7|hyperpage}{193}
-\indexentry{Chapter4.4.7.1|hyperpage}{194}
-\indexentry{Chapter4.4.7.2|hyperpage}{194}
-\indexentry{Chapter4.5|hyperpage}{196}
-\indexentry{Chapter5.1|hyperpage}{202}
-\indexentry{Chapter5.1.1|hyperpage}{202}
-\indexentry{Chapter5.1.1.1|hyperpage}{202}
-\indexentry{Chapter5.1.1.2|hyperpage}{203}
-\indexentry{Chapter5.1.1.3|hyperpage}{204}
-\indexentry{Chapter5.1.2|hyperpage}{205}
-\indexentry{Chapter5.1.2.1|hyperpage}{205}
-\indexentry{Chapter5.1.2.2|hyperpage}{206}
-\indexentry{Chapter5.2|hyperpage}{206}
-\indexentry{Chapter5.2.1|hyperpage}{206}
-\indexentry{Chapter5.2.1.1|hyperpage}{207}
-\indexentry{Chapter5.2.1.2|hyperpage}{208}
-\indexentry{Chapter5.2.1.3|hyperpage}{208}
-\indexentry{Chapter5.2.1.4|hyperpage}{209}
-\indexentry{Chapter5.2.1.5|hyperpage}{210}
-\indexentry{Chapter5.2.1.6|hyperpage}{211}
-\indexentry{Chapter5.2.2|hyperpage}{212}
-\indexentry{Chapter5.2.2.1|hyperpage}{212}
-\indexentry{Chapter5.2.2.2|hyperpage}{214}
-\indexentry{Chapter5.2.2.3|hyperpage}{214}
-\indexentry{Chapter5.2.2.4|hyperpage}{215}
-\indexentry{Chapter5.2.3|hyperpage}{216}
-\indexentry{Chapter5.2.3.1|hyperpage}{216}
-\indexentry{Chapter5.2.3.2|hyperpage}{218}
-\indexentry{Chapter5.2.4|hyperpage}{218}
-\indexentry{Chapter5.3|hyperpage}{224}
-\indexentry{Chapter5.3.1|hyperpage}{224}
-\indexentry{Chapter5.3.1.1|hyperpage}{224}
-\indexentry{Chapter5.3.1.2|hyperpage}{226}
-\indexentry{Chapter5.3.1.3|hyperpage}{227}
-\indexentry{Chapter5.3.2|hyperpage}{228}
-\indexentry{Chapter5.3.3|hyperpage}{229}
-\indexentry{Chapter5.3.4|hyperpage}{233}
-\indexentry{Chapter5.3.5|hyperpage}{234}
-\indexentry{Chapter5.4|hyperpage}{235}
-\indexentry{Chapter5.4.1|hyperpage}{236}
-\indexentry{Chapter5.4.2|hyperpage}{237}
-\indexentry{Chapter5.4.2.1|hyperpage}{238}
-\indexentry{Chapter5.4.2.2|hyperpage}{240}
-\indexentry{Chapter5.4.2.3|hyperpage}{242}
-\indexentry{Chapter5.4.3|hyperpage}{245}
-\indexentry{Chapter5.4.4|hyperpage}{247}
-\indexentry{Chapter5.4.4.1|hyperpage}{247}
-\indexentry{Chapter5.4.4.2|hyperpage}{248}
-\indexentry{Chapter5.4.4.3|hyperpage}{248}
-\indexentry{Chapter5.4.5|hyperpage}{250}
-\indexentry{Chapter5.4.6|hyperpage}{251}
-\indexentry{Chapter5.4.6.1|hyperpage}{252}
-\indexentry{Chapter5.4.6.2|hyperpage}{254}
-\indexentry{Chapter5.4.6.3|hyperpage}{255}
-\indexentry{Chapter5.5|hyperpage}{257}
-\indexentry{Chapter5.5.1|hyperpage}{257}
-\indexentry{Chapter5.5.1.1|hyperpage}{258}
-\indexentry{Chapter5.5.1.2|hyperpage}{260}
-\indexentry{Chapter5.5.1.3|hyperpage}{261}
-\indexentry{Chapter5.5.1.4|hyperpage}{262}
-\indexentry{Chapter5.5.2|hyperpage}{263}
-\indexentry{Chapter5.5.2.1|hyperpage}{263}
-\indexentry{Chapter5.5.2.2|hyperpage}{263}
-\indexentry{Chapter5.5.3|hyperpage}{265}
-\indexentry{Chapter5.5.3.1|hyperpage}{265}
-\indexentry{Chapter5.5.3.2|hyperpage}{267}
-\indexentry{Chapter5.5.3.3|hyperpage}{267}
-\indexentry{Chapter5.5.3.4|hyperpage}{268}
-\indexentry{Chapter5.5.3.5|hyperpage}{269}
-\indexentry{Chapter5.6|hyperpage}{269}
-\indexentry{Chapter6.1|hyperpage}{271}
-\indexentry{Chapter6.1.1|hyperpage}{273}
-\indexentry{Chapter6.1.2|hyperpage}{275}
-\indexentry{Chapter6.1.3|hyperpage}{278}
-\indexentry{Chapter6.2|hyperpage}{280}
-\indexentry{Chapter6.2.1|hyperpage}{280}
-\indexentry{Chapter6.2.2|hyperpage}{281}
-\indexentry{Chapter6.2.3|hyperpage}{282}
-\indexentry{Chapter6.2.4|hyperpage}{283}
-\indexentry{Chapter6.3|hyperpage}{284}
-\indexentry{Chapter6.3.1|hyperpage}{286}
-\indexentry{Chapter6.3.2|hyperpage}{288}
-\indexentry{Chapter6.3.3|hyperpage}{292}
-\indexentry{Chapter6.3.3.1|hyperpage}{292}
-\indexentry{Chapter6.3.3.2|hyperpage}{292}
-\indexentry{Chapter6.3.3.3|hyperpage}{294}
-\indexentry{Chapter6.3.3.4|hyperpage}{295}
-\indexentry{Chapter6.3.3.5|hyperpage}{297}
-\indexentry{Chapter6.3.4|hyperpage}{297}
-\indexentry{Chapter6.3.4.1|hyperpage}{298}
-\indexentry{Chapter6.3.4.2|hyperpage}{299}
-\indexentry{Chapter6.3.4.3|hyperpage}{302}
-\indexentry{Chapter6.3.5|hyperpage}{304}
-\indexentry{Chapter6.3.5.1|hyperpage}{305}
-\indexentry{Chapter6.3.5.2|hyperpage}{305}
-\indexentry{Chapter6.3.5.3|hyperpage}{306}
-\indexentry{Chapter6.3.5.4|hyperpage}{306}
-\indexentry{Chapter6.3.5.5|hyperpage}{307}
-\indexentry{Chapter6.3.5.5|hyperpage}{308}
-\indexentry{Chapter6.3.6|hyperpage}{309}
-\indexentry{Chapter6.3.6.1|hyperpage}{311}
-\indexentry{Chapter6.3.6.2|hyperpage}{312}
-\indexentry{Chapter6.3.6.3|hyperpage}{313}
-\indexentry{Chapter6.3.7|hyperpage}{314}
-\indexentry{Chapter6.4|hyperpage}{316}
-\indexentry{Chapter6.4.1|hyperpage}{317}
-\indexentry{Chapter6.4.2|hyperpage}{318}
-\indexentry{Chapter6.4.3|hyperpage}{320}
-\indexentry{Chapter6.4.4|hyperpage}{322}
-\indexentry{Chapter6.4.5|hyperpage}{324}
-\indexentry{Chapter6.4.6|hyperpage}{326}
-\indexentry{Chapter6.4.7|hyperpage}{327}
-\indexentry{Chapter6.4.8|hyperpage}{328}
-\indexentry{Chapter6.4.9|hyperpage}{329}
-\indexentry{Chapter6.4.10|hyperpage}{332}
-\indexentry{Chapter6.5|hyperpage}{332}
-\indexentry{Chapter6.5.1|hyperpage}{333}
-\indexentry{Chapter6.5.2|hyperpage}{333}
-\indexentry{Chapter6.5.3|hyperpage}{333}
-\indexentry{Chapter6.5.4|hyperpage}{335}
-\indexentry{Chapter6.5.5|hyperpage}{335}
-\indexentry{Chapter6.6|hyperpage}{335}
+\indexentry{Chapter5.1|hyperpage}{10}
+\indexentry{Chapter5.1.1|hyperpage}{10}
+\indexentry{Chapter5.1.1.1|hyperpage}{10}
+\indexentry{Chapter5.1.1.2|hyperpage}{11}
+\indexentry{Chapter5.1.1.3|hyperpage}{12}
+\indexentry{Chapter5.1.2|hyperpage}{13}
+\indexentry{Chapter5.1.2.1|hyperpage}{13}
+\indexentry{Chapter5.1.2.2|hyperpage}{14}
+\indexentry{Chapter5.2|hyperpage}{14}
+\indexentry{Chapter5.2.1|hyperpage}{14}
+\indexentry{Chapter5.2.1.1|hyperpage}{15}
+\indexentry{Chapter5.2.1.2|hyperpage}{16}
+\indexentry{Chapter5.2.1.3|hyperpage}{16}
+\indexentry{Chapter5.2.1.4|hyperpage}{17}
+\indexentry{Chapter5.2.1.5|hyperpage}{18}
+\indexentry{Chapter5.2.1.6|hyperpage}{19}
+\indexentry{Chapter5.2.2|hyperpage}{20}
+\indexentry{Chapter5.2.2.1|hyperpage}{20}
+\indexentry{Chapter5.2.2.2|hyperpage}{22}
+\indexentry{Chapter5.2.2.3|hyperpage}{22}
+\indexentry{Chapter5.2.2.4|hyperpage}{23}
+\indexentry{Chapter5.2.3|hyperpage}{24}
+\indexentry{Chapter5.2.3.1|hyperpage}{24}
+\indexentry{Chapter5.2.3.2|hyperpage}{26}
+\indexentry{Chapter5.2.4|hyperpage}{26}
+\indexentry{Chapter5.3|hyperpage}{31}
+\indexentry{Chapter5.3.1|hyperpage}{32}
+\indexentry{Chapter5.3.1.1|hyperpage}{32}
+\indexentry{Chapter5.3.1.2|hyperpage}{34}
+\indexentry{Chapter5.3.1.3|hyperpage}{35}
+\indexentry{Chapter5.3.2|hyperpage}{36}
+\indexentry{Chapter5.3.3|hyperpage}{36}
+\indexentry{Chapter5.3.4|hyperpage}{40}
+\indexentry{Chapter5.3.5|hyperpage}{41}
+\indexentry{Chapter5.4|hyperpage}{42}
+\indexentry{Chapter5.4.1|hyperpage}{43}
+\indexentry{Chapter5.4.2|hyperpage}{44}
+\indexentry{Chapter5.4.2.1|hyperpage}{45}
+\indexentry{Chapter5.4.2.2|hyperpage}{47}
+\indexentry{Chapter5.4.2.3|hyperpage}{49}
+\indexentry{Chapter5.4.3|hyperpage}{52}
+\indexentry{Chapter5.4.4|hyperpage}{54}
+\indexentry{Chapter5.4.4.1|hyperpage}{54}
+\indexentry{Chapter5.4.4.2|hyperpage}{55}
+\indexentry{Chapter5.4.4.3|hyperpage}{56}
+\indexentry{Chapter5.4.5|hyperpage}{57}
+\indexentry{Chapter5.4.6|hyperpage}{58}
+\indexentry{Chapter5.4.6.1|hyperpage}{59}
+\indexentry{Chapter5.4.6.2|hyperpage}{61}
+\indexentry{Chapter5.4.6.3|hyperpage}{62}
+\indexentry{Chapter5.5|hyperpage}{63}
+\indexentry{Chapter5.5.1|hyperpage}{64}
+\indexentry{Chapter5.5.1.1|hyperpage}{65}
+\indexentry{Chapter5.5.1.2|hyperpage}{67}
+\indexentry{Chapter5.5.1.3|hyperpage}{68}
+\indexentry{Chapter5.5.1.4|hyperpage}{69}
+\indexentry{Chapter5.5.2|hyperpage}{70}
+\indexentry{Chapter5.5.2.1|hyperpage}{70}
+\indexentry{Chapter5.5.2.2|hyperpage}{70}
+\indexentry{Chapter5.5.3|hyperpage}{72}
+\indexentry{Chapter5.5.3.1|hyperpage}{72}
+\indexentry{Chapter5.5.3.2|hyperpage}{74}
+\indexentry{Chapter5.5.3.3|hyperpage}{75}
+\indexentry{Chapter5.5.3.4|hyperpage}{75}
+\indexentry{Chapter5.5.3.5|hyperpage}{76}
+\indexentry{Chapter5.6|hyperpage}{77}
--- a/Book/mt-book-xelatex.ptc
+++ b/Book/mt-book-xelatex.ptc
@@ -2,578 +2,140 @@
 \defcounter {refsection}{0}\relax 
 \select@language {english}
 \defcounter {refsection}{0}\relax 
-\contentsline {part}{\@mypartnumtocformat {I}{机器翻译基础}}{7}{part.1}
+\contentsline {part}{\@mypartnumtocformat {I}{神经机器翻译}}{7}{part.1}
 \ttl@starttoc {default@1}
 \defcounter {refsection}{0}\relax 
-\contentsline {chapter}{\numberline {1}机器翻译简介}{9}{chapter.1}
+\contentsline {chapter}{\numberline {1}人工神经网络和神经语言建模}{9}{chapter.1}
 \defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {1.1}机器翻译的概念}{9}{section.1.1}
+\contentsline {section}{\numberline {1.1}深度学习与人工神经网络}{10}{section.1.1}
 \defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {1.2}机器翻译简史}{12}{section.1.2}
+\contentsline {subsection}{\numberline {1.1.1}发展简史}{10}{subsection.1.1.1}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.2.1}人工翻译}{12}{subsection.1.2.1}
+\contentsline {subsubsection}{早期的人工神经网络和第一次寒冬}{10}{section*.2}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.2.2}机器翻译的萌芽}{13}{subsection.1.2.2}
+\contentsline {subsubsection}{神经网络的第二次高潮和第二次寒冬}{11}{section*.3}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.2.3}机器翻译的受挫}{14}{subsection.1.2.3}
+\contentsline {subsubsection}{深度学习和神经网络方法的崛起}{12}{section*.4}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.2.4}机器翻译的快速成长}{15}{subsection.1.2.4}
+\contentsline {subsection}{\numberline {1.1.2}为什么需要深度学习}{13}{subsection.1.1.2}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.2.5}机器翻译的爆发}{16}{subsection.1.2.5}
+\contentsline {subsubsection}{端到端学习和表示学习}{13}{section*.6}
 \defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {1.3}机器翻译现状}{17}{section.1.3}
+\contentsline {subsubsection}{深度学习的效果}{14}{section*.8}
 \defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {1.4}机器翻译方法}{18}{section.1.4}
+\contentsline {section}{\numberline {1.2}神经网络基础}{14}{section.1.2}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.4.1}基于规则的机器翻译}{18}{subsection.1.4.1}
+\contentsline {subsection}{\numberline {1.2.1}线性代数基础}{14}{subsection.1.2.1}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.4.2}基于实例的机器翻译}{20}{subsection.1.4.2}
+\contentsline {subsubsection}{标量、向量和矩阵}{15}{section*.10}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.4.3}统计机器翻译}{21}{subsection.1.4.3}
+\contentsline {subsubsection}{矩阵的转置}{16}{section*.11}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.4.4}神经机器翻译}{22}{subsection.1.4.4}
+\contentsline {subsubsection}{矩阵加法和数乘}{16}{section*.12}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.4.5}对比分析}{23}{subsection.1.4.5}
+\contentsline {subsubsection}{矩阵乘法和矩阵点乘}{17}{section*.13}
 \defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {1.5}翻译质量评价}{24}{section.1.5}
+\contentsline {subsubsection}{线性映射}{18}{section*.14}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.5.1}人工评价}{24}{subsection.1.5.1}
+\contentsline {subsubsection}{范数}{19}{section*.15}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.5.2}自动评价}{25}{subsection.1.5.2}
+\contentsline {subsection}{\numberline {1.2.2}人工神经元和感知机}{20}{subsection.1.2.2}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{BLEU}{25}{section*.15}
+\contentsline {subsubsection}{（一）感知机\ \raisebox {0.5mm}{------}\ 最简单的人工神经元模型}{20}{section*.18}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{TER}{27}{section*.16}
+\contentsline {subsubsection}{（二）神经元内部权重}{22}{section*.21}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{基于检测点的评价}{27}{section*.17}
+\contentsline {subsubsection}{（三）神经元的输入\ \raisebox {0.5mm}{------}\ 离散 vs 连续}{22}{section*.23}
 \defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {1.6}机器翻译应用}{28}{section.1.6}
+\contentsline {subsubsection}{（四）神经元内部的参数学习}{23}{section*.25}
 \defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {1.7}开源项目与评测}{30}{section.1.7}
+\contentsline {subsection}{\numberline {1.2.3}多层神经网络}{24}{subsection.1.2.3}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.7.1}开源机器翻译系统}{30}{subsection.1.7.1}
+\contentsline {subsubsection}{线性变换和激活函数}{24}{section*.27}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{统计机器翻译开源系统}{31}{section*.19}
+\contentsline {subsubsection}{单层神经网络$\rightarrow $多层神经网络}{26}{section*.34}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{神经机器翻译开源系统}{32}{section*.20}
+\contentsline {subsection}{\numberline {1.2.4}函数拟合能力}{26}{subsection.1.2.4}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.7.2}常用数据集及公开评测任务}{34}{subsection.1.7.2}
+\contentsline {section}{\numberline {1.3}神经网络的张量实现}{31}{section.1.3}
 \defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {1.8}推荐学习资源}{36}{section.1.8}
+\contentsline {subsection}{\numberline {1.3.1} 张量及其计算}{32}{subsection.1.3.1}
 \defcounter {refsection}{0}\relax 
-\contentsline {chapter}{\numberline {2}词法、语法及统计建模基础}{41}{chapter.2}
+\contentsline {subsubsection}{张量}{32}{section*.44}
 \defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {2.1}问题概述 }{42}{section.2.1}
+\contentsline {subsubsection}{张量的矩阵乘法}{34}{section*.47}
 \defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {2.2}概率论基础}{43}{section.2.2}
+\contentsline {subsubsection}{张量的单元操作}{35}{section*.49}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {2.2.1}随机变量和概率}{43}{subsection.2.2.1}
+\contentsline {subsection}{\numberline {1.3.2}张量的物理存储形式}{36}{subsection.1.3.2}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {2.2.2}联合概率、条件概率和边缘概率}{45}{subsection.2.2.2}
+\contentsline {subsection}{\numberline {1.3.3}使用开源框架实现张量计算}{36}{subsection.1.3.3}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {2.2.3}链式法则}{46}{subsection.2.2.3}
+\contentsline {subsection}{\numberline {1.3.4}神经网络中的前向传播}{40}{subsection.1.3.4}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {2.2.4}贝叶斯法则}{47}{subsection.2.2.4}
+\contentsline {subsection}{\numberline {1.3.5}神经网络实例}{41}{subsection.1.3.5}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {2.2.5}KL距离和熵}{49}{subsection.2.2.5}
+\contentsline {section}{\numberline {1.4}神经网络的参数训练}{42}{section.1.4}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{信息熵}{49}{section*.27}
+\contentsline {subsection}{\numberline {1.4.1}损失函数}{43}{subsection.1.4.1}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{KL距离}{50}{section*.29}
+\contentsline {subsection}{\numberline {1.4.2}基于梯度的参数优化}{44}{subsection.1.4.2}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{交叉熵}{50}{section*.30}
+\contentsline {subsubsection}{（一）梯度下降}{45}{section*.67}
 \defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {2.3}中文分词}{51}{section.2.3}
+\contentsline {subsubsection}{（二）梯度获取}{47}{section*.69}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {2.3.1}基于词典的分词方法}{52}{subsection.2.3.1}
+\contentsline {subsubsection}{（三）基于梯度的方法的变种和改进}{49}{section*.73}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {2.3.2}基于统计的分词方法}{53}{subsection.2.3.2}
+\contentsline {subsection}{\numberline {1.4.3}参数更新的并行化策略}{52}{subsection.1.4.3}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{统计模型的学习与推断}{53}{section*.34}
+\contentsline {subsection}{\numberline {1.4.4}梯度消失、梯度爆炸和稳定性训练}{54}{subsection.1.4.4}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{掷骰子游戏}{54}{section*.36}
+\contentsline {subsubsection}{（一）梯度消失现象及解决方法}{54}{section*.76}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{全概率分词方法}{56}{section*.40}
+\contentsline {subsubsection}{（二）梯度爆炸现象及解决方法}{55}{section*.80}
 \defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {2.4}$n$-gram语言模型 }{58}{section.2.4}
+\contentsline {subsubsection}{（三）稳定性训练}{56}{section*.81}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {2.4.1}建模}{59}{subsection.2.4.1}
+\contentsline {subsection}{\numberline {1.4.5}过拟合}{57}{subsection.1.4.5}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {2.4.2}未登录词和平滑算法}{61}{subsection.2.4.2}
+\contentsline {subsection}{\numberline {1.4.6}反向传播}{58}{subsection.1.4.6}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{加法平滑方法}{62}{section*.46}
+\contentsline {subsubsection}{（一）输出层的反向传播}{59}{section*.84}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{古德-图灵估计法}{63}{section*.48}
+\contentsline {subsubsection}{（二）隐藏层的反向传播}{61}{section*.88}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{Kneser-Ney平滑方法}{64}{section*.50}
+\contentsline {subsubsection}{（三）程序实现}{62}{section*.91}
 \defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {2.5}句法分析（短语结构分析）}{66}{section.2.5}
+\contentsline {section}{\numberline {1.5}神经语言模型}{63}{section.1.5}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {2.5.1}句子的句法树表示}{66}{subsection.2.5.1}
+\contentsline {subsection}{\numberline {1.5.1}基于神经网络的语言建模}{64}{subsection.1.5.1}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {2.5.2}上下文无关文法}{68}{subsection.2.5.2}
+\contentsline {subsubsection}{（一）基于前馈神经网络的语言模型}{65}{section*.94}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {2.5.3}规则和推导的概率}{72}{subsection.2.5.3}
+\contentsline {subsubsection}{（二）基于循环神经网络的语言模型}{67}{section*.97}
 \defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {2.6}小结及深入阅读}{74}{section.2.6}
+\contentsline {subsubsection}{（三）基于自注意力机制的语言模型}{68}{section*.99}
 \defcounter {refsection}{0}\relax 
-\contentsline {part}{\@mypartnumtocformat {II}{统计机器翻译}}{77}{part.2}
-\ttl@stoptoc {default@1}
-\ttl@starttoc {default@2}
+\contentsline {subsubsection}{（四）语言模型的评价}{69}{section*.101}
 \defcounter {refsection}{0}\relax 
-\contentsline {chapter}{\numberline {3}基于词的机器翻译模型}{79}{chapter.3}
+\contentsline {subsection}{\numberline {1.5.2}单词表示模型}{70}{subsection.1.5.2}
 \defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {3.1}什么是基于词的翻译模型}{79}{section.3.1}
+\contentsline {subsubsection}{（一）One-hot编码}{70}{section*.102}
 \defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {3.2}构建一个简单的机器翻译系统}{81}{section.3.2}
+\contentsline {subsubsection}{（二）分布式表示}{70}{section*.104}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {3.2.1}如何进行翻译？}{81}{subsection.3.2.1}
+\contentsline {subsection}{\numberline {1.5.3}句子表示模型及预训练}{72}{subsection.1.5.3}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（二）机器翻译流程}{82}{section*.63}
+\contentsline {subsubsection}{（一）简单的上下文表示模型}{72}{section*.108}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（三）人工 vs. 机器}{83}{section*.65}
+\contentsline {subsubsection}{（二）ELMO模型}{74}{section*.111}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {3.2.2}基本框架}{83}{subsection.3.2.2}
+\contentsline {subsubsection}{（三）GPT模型}{75}{section*.113}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {3.2.3}单词翻译概率}{84}{subsection.3.2.3}
+\contentsline {subsubsection}{（四）BERT模型}{75}{section*.115}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（一）什么是单词翻译概率？}{84}{section*.67}
+\contentsline {subsubsection}{（五）为什么要预训练？}{76}{section*.117}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（二）如何从一个双语平行数据中学习？}{85}{section*.69}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（三）如何从大量的双语平行数据中学习？}{86}{section*.70}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {3.2.4}句子级翻译模型}{87}{subsection.3.2.4}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（一）句子级翻译的基础模型}{87}{section*.72}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（二）生成流畅的译文}{89}{section*.74}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {3.2.5}解码}{90}{subsection.3.2.5}
-\defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {3.3}基于词的翻译建模}{93}{section.3.3}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {3.3.1}噪声信道模型}{93}{subsection.3.3.1}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {3.3.2}统计机器翻译的三个基本问题}{96}{subsection.3.3.2}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{词对齐}{97}{section*.83}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{基于词对齐的翻译模型}{98}{section*.86}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{基于词对齐的翻译实例}{99}{section*.88}
-\defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {3.4}IBM模型1-2}{100}{section.3.4}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {3.4.1}IBM模型1}{100}{subsection.3.4.1}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {3.4.2}IBM模型2}{102}{subsection.3.4.2}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {3.4.3}解码及计算优化}{103}{subsection.3.4.3}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {3.4.4}训练}{104}{subsection.3.4.4}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（一）目标函数}{104}{section*.93}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（二）优化}{105}{section*.95}
-\defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {3.5}IBM模型3-5及隐马尔可夫模型}{110}{section.3.5}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {3.5.1}基于产出率的翻译模型}{111}{subsection.3.5.1}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {3.5.2}IBM 模型3}{113}{subsection.3.5.2}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {3.5.3}IBM 模型4}{115}{subsection.3.5.3}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {3.5.4} IBM 模型5}{116}{subsection.3.5.4}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {3.5.5}隐马尔可夫模型}{118}{subsection.3.5.5}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{隐马尔可夫模型}{118}{section*.107}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{词对齐模型}{119}{section*.109}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {3.5.6}解码和训练}{120}{subsection.3.5.6}
-\defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {3.6}问题分析}{121}{section.3.6}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {3.6.1}词对齐及对称化}{121}{subsection.3.6.1}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {3.6.2}Deficiency}{122}{subsection.3.6.2}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {3.6.3}句子长度}{123}{subsection.3.6.3}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {3.6.4}其它问题}{123}{subsection.3.6.4}
-\defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {3.7}小结及深入阅读}{123}{section.3.7}
-\defcounter {refsection}{0}\relax 
-\contentsline {chapter}{\numberline {4}基于短语和句法的机器翻译模型}{125}{chapter.4}
-\defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {4.1}翻译中的结构信息}{125}{section.4.1}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {4.1.1}更大粒度的翻译单元}{127}{subsection.4.1.1}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {4.1.2}句子的结构信息}{128}{subsection.4.1.2}
-\defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {4.2}基于短语的翻译模型}{130}{section.4.2}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {4.2.1}机器翻译中的短语}{130}{subsection.4.2.1}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {4.2.2}数学建模及判别式模型}{133}{subsection.4.2.2}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{基于翻译推导的建模}{133}{section*.121}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{对数线性模型}{134}{section*.122}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{搭建模型的基本流程}{135}{section*.123}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {4.2.3}短语抽取}{136}{subsection.4.2.3}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{与词对齐一致的短语}{136}{section*.126}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{获取词对齐}{137}{section*.130}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{度量双语短语质量}{138}{section*.132}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {4.2.4}调序}{140}{subsection.4.2.4}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{基于距离的调序}{140}{section*.136}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{基于方向的调序}{141}{section*.138}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{基于分类的调序}{142}{section*.141}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {4.2.5}特征}{143}{subsection.4.2.5}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {4.2.6}最小错误率训练}{143}{subsection.4.2.6}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {4.2.7}栈解码}{147}{subsection.4.2.7}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{翻译候选匹配}{148}{section*.146}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{翻译假设扩展}{148}{section*.148}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{剪枝}{149}{section*.150}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{解码中的栈结构}{150}{section*.152}
-\defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {4.3}基于层次短语的模型}{151}{section.4.3}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {4.3.1}同步上下文无关文法}{154}{subsection.4.3.1}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{文法定义}{155}{section*.157}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{推导}{156}{section*.158}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{胶水规则}{157}{section*.159}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{处理流程}{158}{section*.160}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {4.3.2}层次短语规则抽取}{158}{subsection.4.3.2}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {4.3.3}翻译模型及特征}{160}{subsection.4.3.3}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {4.3.4}CYK解码}{161}{subsection.4.3.4}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {4.3.5}立方剪枝}{164}{subsection.4.3.5}
-\defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {4.4}基于语言学句法的模型}{166}{section.4.4}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {4.4.1}基于句法的翻译模型分类}{169}{subsection.4.4.1}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {4.4.2}基于树结构的文法}{171}{subsection.4.4.2}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{树到树翻译规则}{172}{section*.176}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{基于树结构的翻译推导}{173}{section*.178}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{树到串翻译规则}{175}{section*.181}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {4.4.3}树到串翻译规则抽取}{176}{subsection.4.4.3}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{树的切割与最小规则}{177}{section*.183}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{空对齐处理}{180}{section*.189}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{组合规则}{181}{section*.191}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{SPMT规则}{183}{section*.193}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{句法树二叉化}{184}{section*.195}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {4.4.4}树到树翻译规则抽取}{185}{subsection.4.4.4}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{基于节点对齐的规则抽取}{186}{section*.199}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{基于对齐矩阵的规则抽取}{187}{section*.202}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {4.4.5}句法翻译模型的特征}{187}{subsection.4.4.5}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {4.4.6}基于超图的推导空间表示}{189}{subsection.4.4.6}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {4.4.7}基于树的解码 vs 基于串的解码}{193}{subsection.4.4.7}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{基于树的解码}{194}{section*.209}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{基于串的解码}{194}{section*.212}
-\defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {4.5}小结及深入阅读}{196}{section.4.5}
-\defcounter {refsection}{0}\relax 
-\contentsline {part}{\@mypartnumtocformat {III}{神经机器翻译}}{199}{part.3}
-\ttl@stoptoc {default@2}
-\ttl@starttoc {default@3}
-\defcounter {refsection}{0}\relax 
-\contentsline {chapter}{\numberline {5}人工神经网络和神经语言建模}{201}{chapter.5}
-\defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {5.1}深度学习与人工神经网络}{202}{section.5.1}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {5.1.1}发展简史}{202}{subsection.5.1.1}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（一）早期的人工神经网络和第一次寒冬}{202}{section*.214}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（二）神经网络的第二次高潮和第二次寒冬}{203}{section*.215}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（三）深度学习和神经网络的崛起}{204}{section*.216}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {5.1.2}为什么需要深度学习}{205}{subsection.5.1.2}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（一）端到端学习和表示学习}{205}{section*.218}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（二）深度学习的效果}{206}{section*.220}
-\defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {5.2}神经网络基础}{206}{section.5.2}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {5.2.1}线性代数基础}{206}{subsection.5.2.1}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{标量、向量和矩阵}{207}{section*.222}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{矩阵的转置}{208}{section*.223}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{矩阵加法和数乘}{208}{section*.224}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{矩阵乘法和矩阵点乘}{209}{section*.225}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{线性映射}{210}{section*.226}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{范数}{211}{section*.227}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {5.2.2}人工神经元和感知机}{212}{subsection.5.2.2}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（一）感知机\ \raisebox {0.5mm}{------}\ 最简单的人工神经元模型}{212}{section*.230}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（二）神经元内部权重}{214}{section*.233}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（三）神经元的输入\ \raisebox {0.5mm}{------}\ 离散 vs 连续}{214}{section*.235}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（四）神经元内部的参数学习}{215}{section*.237}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {5.2.3}多层神经网络}{216}{subsection.5.2.3}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{线性变换和激活函数}{216}{section*.239}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{单层神经网络$\rightarrow $多层神经网络}{218}{section*.246}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {5.2.4}函数拟合能力}{218}{subsection.5.2.4}
-\defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {5.3}神经网络的张量实现}{224}{section.5.3}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {5.3.1} 张量及其计算}{224}{subsection.5.3.1}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{张量}{224}{section*.256}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{张量的矩阵乘法}{226}{section*.259}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{张量的单元操作}{227}{section*.261}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {5.3.2}张量的物理存储形式}{228}{subsection.5.3.2}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {5.3.3}使用开源框架实现张量计算}{229}{subsection.5.3.3}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {5.3.4}神经网络中的前向传播}{233}{subsection.5.3.4}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {5.3.5}神经网络实例}{234}{subsection.5.3.5}
-\defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {5.4}神经网络的参数训练}{235}{section.5.4}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {5.4.1}损失函数}{236}{subsection.5.4.1}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {5.4.2}基于梯度的参数优化}{237}{subsection.5.4.2}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（一）梯度下降}{238}{section*.279}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（二）梯度获取}{240}{section*.281}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（三）基于梯度的方法的变种和改进}{242}{section*.285}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {5.4.3}参数更新的并行化策略}{245}{subsection.5.4.3}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {5.4.4}梯度消失、梯度爆炸和稳定性训练}{247}{subsection.5.4.4}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（一）梯度消失现象及解决方法}{247}{section*.288}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（二）梯度爆炸现象及解决方法}{248}{section*.292}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（三）稳定性训练}{248}{section*.293}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {5.4.5}过拟合}{250}{subsection.5.4.5}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {5.4.6}反向传播}{251}{subsection.5.4.6}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（一）输出层的反向传播}{252}{section*.296}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（二）隐藏层的反向传播}{254}{section*.300}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（三）程序实现}{255}{section*.303}
-\defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {5.5}神经语言模型}{257}{section.5.5}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {5.5.1}基于神经网络的语言建模}{257}{subsection.5.5.1}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（一）基于前馈神经网络的语言模型}{258}{section*.306}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（二）基于循环神经网络的语言模型}{260}{section*.309}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（三）基于自注意力机制的语言模型}{261}{section*.311}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（四）语言模型的评价}{262}{section*.313}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {5.5.2}单词表示模型}{263}{subsection.5.5.2}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（一）One-hot编码}{263}{section*.314}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（二）分布式表示}{263}{section*.316}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {5.5.3}句子表示模型及预训练}{265}{subsection.5.5.3}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（一）简单的上下文表示模型}{265}{section*.320}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（二）ELMO模型}{267}{section*.323}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（三）GPT模型}{267}{section*.325}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（四）BERT模型}{268}{section*.327}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（五）为什么要预训练？}{269}{section*.329}
-\defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {5.6}小结及深入阅读}{269}{section.5.6}
-\defcounter {refsection}{0}\relax 
-\contentsline {chapter}{\numberline {6}神经机器翻译模型}{271}{chapter.6}
-\defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {6.1}神经机器翻译的发展简史}{271}{section.6.1}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {6.1.1}神经机器翻译的起源}{273}{subsection.6.1.1}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {6.1.2}神经机器翻译的品质 }{275}{subsection.6.1.2}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {6.1.3}神经机器翻译的优势 }{278}{subsection.6.1.3}
-\defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {6.2}编码器-解码器框架}{280}{section.6.2}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {6.2.1}框架结构}{280}{subsection.6.2.1}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {6.2.2}表示学习}{281}{subsection.6.2.2}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {6.2.3}简单的运行实例}{282}{subsection.6.2.3}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {6.2.4}机器翻译范式的对比}{283}{subsection.6.2.4}
-\defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {6.3}基于循环神经网络的翻译模型及注意力机制}{284}{section.6.3}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {6.3.1}建模}{286}{subsection.6.3.1}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {6.3.2}输入（词嵌入）及输出（Softmax）}{288}{subsection.6.3.2}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {6.3.3}循环神经网络结构}{292}{subsection.6.3.3}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{循环神经单元（RNN）}{292}{section*.351}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{长短时记忆网络（LSTM）}{292}{section*.352}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{门控循环单元（GRU）}{294}{section*.355}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{双向模型}{295}{section*.357}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{多层循环神经网络}{297}{section*.359}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {6.3.4}注意力机制}{297}{subsection.6.3.4}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{翻译中的注意力机制}{298}{section*.362}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{上下文向量的计算}{299}{section*.365}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{注意力机制的解读}{302}{section*.370}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {6.3.5}训练}{304}{subsection.6.3.5}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{损失函数}{305}{section*.373}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{长参数初始化}{305}{section*.374}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{优化策略}{306}{section*.375}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{梯度裁剪}{306}{section*.377}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{学习率策略}{307}{section*.378}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{并行训练}{308}{section*.381}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {6.3.6}推断}{309}{subsection.6.3.6}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{贪婪搜索}{311}{section*.385}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{束搜索}{312}{section*.388}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{长度惩罚}{313}{section*.390}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {6.3.7}实例-GNMT}{314}{subsection.6.3.7}
-\defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {6.4}Transformer}{316}{section.6.4}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {6.4.1}自注意力模型}{317}{subsection.6.4.1}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {6.4.2}Transformer架构}{318}{subsection.6.4.2}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {6.4.3}位置编码}{320}{subsection.6.4.3}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {6.4.4}基于点乘的注意力机制}{322}{subsection.6.4.4}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {6.4.5}掩码操作}{324}{subsection.6.4.5}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {6.4.6}多头注意力}{326}{subsection.6.4.6}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {6.4.7}残差网络和层正则化}{327}{subsection.6.4.7}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {6.4.8}前馈全连接网络子层}{328}{subsection.6.4.8}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {6.4.9}训练}{329}{subsection.6.4.9}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {6.4.10}推断}{332}{subsection.6.4.10}
-\defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {6.5}序列到序列问题及应用}{332}{section.6.5}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {6.5.1}自动问答}{333}{subsection.6.5.1}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {6.5.2}自动文摘}{333}{subsection.6.5.2}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {6.5.3}文言文翻译}{333}{subsection.6.5.3}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {6.5.4}对联生成}{335}{subsection.6.5.4}
-\defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {6.5.5}古诗生成}{335}{subsection.6.5.5}
-\defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {6.6}小结及深入阅读}{335}{section.6.6}
-\defcounter {refsection}{0}\relax 
-\contentsline {part}{\@mypartnumtocformat {IV}{附录}}{339}{part.4}
-\ttl@stoptoc {default@3}
-\ttl@starttoc {default@4}
-\defcounter {refsection}{0}\relax 
-\contentsline {chapter}{\numberline {A}附录A}{341}{Appendix.1.A}
-\defcounter {refsection}{0}\relax 
-\contentsline {chapter}{\numberline {B}附录B}{343}{Appendix.2.B}
-\defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {B.1}IBM模型3训练方法}{343}{section.2.B.1}
-\defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {B.2}IBM模型4训练方法}{345}{section.2.B.2}
-\defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {B.3}IBM模型5训练方法}{346}{section.2.B.3}
+\contentsline {section}{\numberline {1.6}小结及深入阅读}{77}{section.1.6}
 \contentsfinish 
--- a/Book/structure.tex
+++ b/Book/structure.tex
@@ -74,7 +74,13 @@
 %	BIBLIOGRAPHY AND INDEX
 %----------------------------------------------------------------------------------------

+<<<<<<< HEAD
 \usepackage[style=numeric,citestyle=numeric,sorting=nyt,sortcites=true,autopunct=true,babel=hyphen,hyperref=true,abbreviate=false,backref=true,backend=biber,maxcitenames=2,maxbibnames=9999]{biblatex}
+=======
+\usepackage[style=numeric,citestyle=numeric,sorting=nyt,sortcites=true,maxbibnames=4,minbibnames=3,autopunct=true,babel=hyphen,hyperref=true,abbreviate=false,backref=true,backend=biber]{biblatex}
+%maxbibnames 设置参考文献最多显示作者数目
+%minbibnames 如果作者数目超过maxbibnames，则只显示minbibnames个作者
+>>>>>>> ff00c6456a4d3ce5dfdb91fe6c58c7fc2b64f059
 \addbibresource{bibliography.bib} % BibTeX bibliography file
 \defbibheading{bibempty}{}


--- a/Section03-Word-Based-Models/section03.tex
+++ b/Section03-Word-Based-Models/section03.tex
@@ -3824,7 +3824,7 @@ s.t. $\forall t_y: \sum_{s_x} f(s_x|t_y) =1 $ & \\
 \begin{eqnarray}
 \frac{\partial L(f,\lambda)}{\partial f(s_u|t_v)} & = & \frac{\partial \big[ \frac{\epsilon}{(l+1)^{m}} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \big]}{\partial f(s_u|t_v)} - \nonumber \\
 &   & \frac{\partial \big[ \sum_{t_y} \lambda_{t_y} (\sum_{s_x} f(s_x|t_y) -1) \big]}{\partial f(s_u|t_v)} \nonumber \\
- & = & \frac{\epsilon}{(l+1)^{m}} \cdot \frac{\partial \big[ \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_{a_j}) \big]}{\partial f(s_u|t_v)} - \lambda_{t_v} \nonumber
+ & = & \frac{\epsilon}{(l+1)^{m}} \cdot \frac{\partial \big[ \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \big]}{\partial f(s_u|t_v)} - \lambda_{t_v} \nonumber
 \end{eqnarray}

 \vspace{-0.3em}