update

ff00c645 · 曹润柘 · d8fd2a93 · 423ee279 · ff00c645 · ff00c645
Commit ff00c645 authored Apr 16, 2020 by 曹润柘
--- a/.gitignore
+++ b/.gitignore
@@ -26,7 +26,7 @@ Section07-Making-A-Strong-NMT-System/section07.pdf
 Section07-Towards-Strong-NMT-Systems/section07.pdf
 Book/mt-book.run.xml
 Book/mt-book-xelatex.bcf
-Book/mt-book-xelatex.idx
+
 Book/mt-book-xelatex.run.xml
 Book/mt-book-xelatex.synctex(busy)
 Book/mt-book-xelatex.pdf
--- a/Book/Chapter5/chapter5.tex
+++ b/Book/Chapter5/chapter5.tex
@@ -86,11 +86,11 @@
 %--5.1.2为什么需要深度学习---------------------
 \subsection{为什么需要深度学习}\index{Chapter5.1.2}

-\parinterval 深度神经网络提供了一种新的机制，即直接学习输入与输出的关系，通常把这种机制称为``端到端学习''。与传统方法不同，``端到端学习''并不需要人工定义特征或者进行过多的先验性假设，所有的学习过程都是由一个模型完成，从外面看这个模型只是建立了一种输入到输出的映射，而这种映射具体是如何形成的完全由模型的结构和参数决定。这样做的最大好处是，整个建模不需要特征工程和人对问题的隐含结构假设，因此模型可以更加``自由''的进行学习。此外，端到端学习也引发了一个新的思考\ \dash \ 如何表示问题？这也就是所谓的表示学习问题。在深度学习时代，问题的输入和输出的表示已经不再是人类通过简单的总结得到的规律，而是可以让计算机自己进行描述的一种可计算``量''。由于这种表示可以被自动学习，因此也大大促进了计算机对语言文字等复杂现象的处理能力。
+\parinterval 深度神经网络提供了一种简单的学习机制，即直接学习输入与输出的关系，通常把这种机制称为{\small\bfnew{端到端学习}}（End-to-End Learning）。与传统方法不同，端到端学习并不需要人工定义特征或者进行过多的先验性假设，所有的学习过程都是由一个模型完成。从外面看这个模型只是建立了一种输入到输出的映射，而这种映射具体是如何形成的完全由模型的结构和参数决定。这样做的最大好处是，模型可以更加``自由''的进行学习。此外，端到端学习也引发了一个新的思考\ \dash \ 如何表示问题？这也就是所谓的{\small\bfnew{表示学习}}（Representation Learning）问题。在深度学习时代，问题的输入和输出的表示已经不再是人类通过简单的总结得到的规律，而是可以让计算机自己进行描述的一种可计算``量''，比如一个实数向量。由于这种表示可以被自动学习，因此也大大促进了计算机对语言文字等复杂现象的处理能力。
 %--5.1.2.1端到端学习和表示学习---------------------
-\subsubsection{（一）端到端学习和表示学习}\index{Chapter5.1.2.1}
+\subsubsection{端到端学习和表示学习}\index{Chapter5.1.2.1}

-\parinterval 端到端学习使机器学习不再像以往传统的特征工程方法一样需要经过繁琐的数据预处理、特征选择、降维等过程，而是直接利用人工神经网络自动从简单特征中提取、组合更复杂的特征，大大提升了模型能力和工程效率。如图\ref{fig:vs}中的图像分类为例，在传统方法中，图像分类需要很多阶段的处理。首先，需要提取一些手工设计的图像特征，在将其降维之后，需要利用SVM等分类算法对其进行分类。与这种多阶段的流水线相比，端到端深度学习做的是，训练一个神经网络，输入就是图片的像素表示，输出直接是分类类别。
+\parinterval 端到端学习使机器学习不再像以往传统的特征工程方法一样需要经过繁琐的数据预处理、特征选择、降维等过程，而是直接利用人工神经网络自动从简单特征中提取、组合更复杂的特征，大大提升了模型能力和工程效率。如图\ref{fig:vs}中的图像分类为例，在传统方法中，图像分类需要很多阶段的处理。首先，需要提取一些手工设计的图像特征，在将其降维之后，需要利用SVM等分类算法对其进行分类。与这种多阶段的流水线似的处理流程相比，端到端深度学习只训练一个神经网络，输入就是图片的像素表示，输出直接是分类类别。
 %----------------------------------------------
 % 图
    \begin{figure}
@@ -113,7 +113,7 @@
 \end {figure}
 %-------------------------------------------

-\parinterval 传统的机器学习大多是基于特征工程的方法，需要大量人工定义的特征，这些特征的构建往往会带来对问题的隐含假设。这种方法存在三方面的问题：
+\parinterval 传统的机器学习需要大量人工定义的特征，这些特征的构建往往会带来对问题的隐含假设。这种方法存在三方面的问题：

 \vspace{0.5em}
 \begin{itemize}
@@ -125,18 +125,17 @@
 \end{itemize}
 \vspace{0.5em}

-\parinterval 端到端学习将人们从大量的特征提取工作之中解放出来。在端到端学习中，不需要太多人的先验知识，对问题的描述完全基于神经网络的学习。从某种意义上讲，对问题的特征提取全是自动完成的，这也意味着哪怕我们不是该任务的``专家''也可以完成相关任务。此外，由于端到端学习并不依赖人工的干预，它实际上也提供了一种新的对问题的表示形式，比如分布式表示。在这种框架下，模型的输入可以被描述为分布式的实数向量，这样模型可以有更多的维度描述一个事物，同时避免传统符号系统对客观事物离散化的刻画。比如，在自然语言处理中，表示学习重新定义了什么是词，什么是句子。在本章的后面的内容中也会看到，表示学习提供了一种新的能力，让计算机对语言文字的描述更加准确和充分。
+\parinterval 端到端学习将人们从大量的特征提取工作之中解放出来，可以不需要太多人的先验知识。从某种意义上讲，对问题的特征提取全是自动完成的，这也意味着哪怕我们不是该任务的``专家''也可以完成相关系统的开发。此外，端到端学习实际上也隐含了一种新的对问题的表示形式\ $\dash$\ {\small\bfnew{分布式表示}}（Distributed Representation）。在这种框架下，模型的输入可以被描述为分布式的实数向量，这样模型可以有更多的维度描述一个事物，同时避免传统符号系统对客观事物离散化的刻画。比如，在自然语言处理中，表示学习重新定义了什么是词，什么是句子。在本章的后面的内容中也会看到，表示学习可以让计算机对语言文字的描述更加准确和充分。
 %--5.1.2.2深度学习的效果---------------------
-\subsubsection{（二）深度学习的效果}\index{Chapter5.1.2.2}
+\subsubsection{深度学习的效果}\index{Chapter5.1.2.2}

-\parinterval 相比于传统的基于特征工程的方法，基于深度学习的模型更加方便、通用，在系统性能上也普遍更优。这里以语言建模任务为例。语言建模的目的是开发一个模型来描述词串出现的可能性。这个任务已经有着很长时间的历史。表\ref{tab1}给出了不同方法在标准的PTB上的困惑度结果 \footnote{困惑度越低标明语言建模的效果越好。} 。传统的$ n-{\rm{gram}} $语言模型由于面临维度灾难和数据稀疏问题，最终语言模型的性能并不是很好。而在深度学习模型，通过引入循环神经网络等结构，所得到的语言模型可以更好地描述序列生成的问题。而最新的基于Transformer架构的语言模型将PPL从最初的178.0下降到了惊人的35.7。可见深度学习为这个任务所带来的进步是巨大的。
+\parinterval 相比于传统的基于特征工程的方法，基于深度学习的模型更加方便、通用，在系统性能上也普遍更优。这里以语言建模任务为例。语言建模的目的是开发一个模型来描述词串出现的可能性（见第二章）。这个任务已经有着很长时间的历史。表\ref{tab1}给出了不同方法在标准的PTB上的困惑度结果 \footnote{困惑度越低标明语言建模的效果越好。} 。传统的$ n$-gram语言模型由于面临维度灾难和数据稀疏问题，最终语言模型的性能并不是很好。而在深度学习模型中，通过引入循环神经网络等结构，所得到的语言模型可以更好地描述序列生成的问题。而最新的基于Transformer架构的语言模型将PPL从最初的178.0下降到了惊人的35.7。可见深度学习为这个任务所带来的进步是巨大的。

 %表1--------------------------------------------------------------------
 \begin{table}[htp]
 \centering
-\caption{不同方法在PTB语言建模任务上的困惑度（PPL）}
+\caption{不同方法在PTB语言建模任务上的困惑度（PPL）（{\red 下面，加入参考文献！}）}
 \label{tab1}
-\small
 \begin{tabular}{l | l l l}
 \rule{0pt}{15pt}     模型 & 作者 & 年份 & PPL  \\
 \hline

--- a/Book/mt-book-xelatex.idx
+++ b/Book/mt-book-xelatex.idx
+\indexentry{Chapter5.1|hyperpage}{10}
+\indexentry{Chapter5.1.1|hyperpage}{10}
+\indexentry{Chapter5.1.1.1|hyperpage}{10}
+\indexentry{Chapter5.1.1.2|hyperpage}{11}
+\indexentry{Chapter5.1.1.3|hyperpage}{12}
+\indexentry{Chapter5.1.2|hyperpage}{13}
+\indexentry{Chapter5.1.2.1|hyperpage}{13}
+\indexentry{Chapter5.1.2.2|hyperpage}{14}
+\indexentry{Chapter5.2|hyperpage}{14}
+\indexentry{Chapter5.2.1|hyperpage}{14}
+\indexentry{Chapter5.2.1.1|hyperpage}{15}
+\indexentry{Chapter5.2.1.2|hyperpage}{16}
+\indexentry{Chapter5.2.1.3|hyperpage}{16}
+\indexentry{Chapter5.2.1.4|hyperpage}{17}
+\indexentry{Chapter5.2.1.5|hyperpage}{18}
+\indexentry{Chapter5.2.1.6|hyperpage}{19}
+\indexentry{Chapter5.2.2|hyperpage}{20}
+\indexentry{Chapter5.2.2.1|hyperpage}{20}
+\indexentry{Chapter5.2.2.2|hyperpage}{22}
+\indexentry{Chapter5.2.2.3|hyperpage}{22}
+\indexentry{Chapter5.2.2.4|hyperpage}{23}
+\indexentry{Chapter5.2.3|hyperpage}{24}
+\indexentry{Chapter5.2.3.1|hyperpage}{24}
+\indexentry{Chapter5.2.3.2|hyperpage}{26}
+\indexentry{Chapter5.2.4|hyperpage}{26}
+\indexentry{Chapter5.3|hyperpage}{31}
+\indexentry{Chapter5.3.1|hyperpage}{32}
+\indexentry{Chapter5.3.1.1|hyperpage}{32}
+\indexentry{Chapter5.3.1.2|hyperpage}{34}
+\indexentry{Chapter5.3.1.3|hyperpage}{35}
+\indexentry{Chapter5.3.2|hyperpage}{36}
+\indexentry{Chapter5.3.3|hyperpage}{36}
+\indexentry{Chapter5.3.4|hyperpage}{40}
+\indexentry{Chapter5.3.5|hyperpage}{41}
+\indexentry{Chapter5.4|hyperpage}{42}
+\indexentry{Chapter5.4.1|hyperpage}{43}
+\indexentry{Chapter5.4.2|hyperpage}{44}
+\indexentry{Chapter5.4.2.1|hyperpage}{45}
+\indexentry{Chapter5.4.2.2|hyperpage}{47}
+\indexentry{Chapter5.4.2.3|hyperpage}{49}
+\indexentry{Chapter5.4.3|hyperpage}{52}
+\indexentry{Chapter5.4.4|hyperpage}{54}
+\indexentry{Chapter5.4.4.1|hyperpage}{54}
+\indexentry{Chapter5.4.4.2|hyperpage}{55}
+\indexentry{Chapter5.4.4.3|hyperpage}{56}
+\indexentry{Chapter5.4.5|hyperpage}{57}
+\indexentry{Chapter5.4.6|hyperpage}{58}
+\indexentry{Chapter5.4.6.1|hyperpage}{59}
+\indexentry{Chapter5.4.6.2|hyperpage}{61}
+\indexentry{Chapter5.4.6.3|hyperpage}{62}
+\indexentry{Chapter5.5|hyperpage}{63}
+\indexentry{Chapter5.5.1|hyperpage}{64}
+\indexentry{Chapter5.5.1.1|hyperpage}{65}
+\indexentry{Chapter5.5.1.2|hyperpage}{67}
+\indexentry{Chapter5.5.1.3|hyperpage}{68}
+\indexentry{Chapter5.5.1.4|hyperpage}{69}
+\indexentry{Chapter5.5.2|hyperpage}{70}
+\indexentry{Chapter5.5.2.1|hyperpage}{70}
+\indexentry{Chapter5.5.2.2|hyperpage}{70}
+\indexentry{Chapter5.5.3|hyperpage}{72}
+\indexentry{Chapter5.5.3.1|hyperpage}{72}
+\indexentry{Chapter5.5.3.2|hyperpage}{74}
+\indexentry{Chapter5.5.3.3|hyperpage}{75}
+\indexentry{Chapter5.5.3.4|hyperpage}{75}
+\indexentry{Chapter5.5.3.5|hyperpage}{76}
+\indexentry{Chapter5.6|hyperpage}{77}
--- a/Book/mt-book-xelatex.ptc
+++ b/Book/mt-book-xelatex.ptc
+\boolfalse {citerequest}\boolfalse {citetracker}\boolfalse {pagetracker}\boolfalse {backtracker}\relax 
+\defcounter {refsection}{0}\relax 
+\select@language {english}
+\defcounter {refsection}{0}\relax 
+\contentsline {part}{\@mypartnumtocformat {I}{神经机器翻译}}{7}{part.1}
+\ttl@starttoc {default@1}
+\defcounter {refsection}{0}\relax 
+\contentsline {chapter}{\numberline {1}人工神经网络和神经语言建模}{9}{chapter.1}
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {1.1}深度学习与人工神经网络}{10}{section.1.1}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.1.1}发展简史}{10}{subsection.1.1.1}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{早期的人工神经网络和第一次寒冬}{10}{section*.2}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{神经网络的第二次高潮和第二次寒冬}{11}{section*.3}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{深度学习和神经网络方法的崛起}{12}{section*.4}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.1.2}为什么需要深度学习}{13}{subsection.1.1.2}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{端到端学习和表示学习}{13}{section*.6}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{深度学习的效果}{14}{section*.8}
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {1.2}神经网络基础}{14}{section.1.2}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.2.1}线性代数基础}{14}{subsection.1.2.1}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{标量、向量和矩阵}{15}{section*.10}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{矩阵的转置}{16}{section*.11}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{矩阵加法和数乘}{16}{section*.12}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{矩阵乘法和矩阵点乘}{17}{section*.13}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{线性映射}{18}{section*.14}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{范数}{19}{section*.15}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.2.2}人工神经元和感知机}{20}{subsection.1.2.2}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{（一）感知机\ \raisebox {0.5mm}{------}\ 最简单的人工神经元模型}{20}{section*.18}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{（二）神经元内部权重}{22}{section*.21}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{（三）神经元的输入\ \raisebox {0.5mm}{------}\ 离散 vs 连续}{22}{section*.23}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{（四）神经元内部的参数学习}{23}{section*.25}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.2.3}多层神经网络}{24}{subsection.1.2.3}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{线性变换和激活函数}{24}{section*.27}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{单层神经网络$\rightarrow $多层神经网络}{26}{section*.34}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.2.4}函数拟合能力}{26}{subsection.1.2.4}
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {1.3}神经网络的张量实现}{31}{section.1.3}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.3.1} 张量及其计算}{32}{subsection.1.3.1}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{张量}{32}{section*.44}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{张量的矩阵乘法}{34}{section*.47}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{张量的单元操作}{35}{section*.49}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.3.2}张量的物理存储形式}{36}{subsection.1.3.2}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.3.3}使用开源框架实现张量计算}{36}{subsection.1.3.3}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.3.4}神经网络中的前向传播}{40}{subsection.1.3.4}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.3.5}神经网络实例}{41}{subsection.1.3.5}
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {1.4}神经网络的参数训练}{42}{section.1.4}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.4.1}损失函数}{43}{subsection.1.4.1}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.4.2}基于梯度的参数优化}{44}{subsection.1.4.2}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{（一）梯度下降}{45}{section*.67}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{（二）梯度获取}{47}{section*.69}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{（三）基于梯度的方法的变种和改进}{49}{section*.73}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.4.3}参数更新的并行化策略}{52}{subsection.1.4.3}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.4.4}梯度消失、梯度爆炸和稳定性训练}{54}{subsection.1.4.4}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{（一）梯度消失现象及解决方法}{54}{section*.76}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{（二）梯度爆炸现象及解决方法}{55}{section*.80}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{（三）稳定性训练}{56}{section*.81}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.4.5}过拟合}{57}{subsection.1.4.5}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.4.6}反向传播}{58}{subsection.1.4.6}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{（一）输出层的反向传播}{59}{section*.84}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{（二）隐藏层的反向传播}{61}{section*.88}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{（三）程序实现}{62}{section*.91}
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {1.5}神经语言模型}{63}{section.1.5}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.5.1}基于神经网络的语言建模}{64}{subsection.1.5.1}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{（一）基于前馈神经网络的语言模型}{65}{section*.94}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{（二）基于循环神经网络的语言模型}{67}{section*.97}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{（三）基于自注意力机制的语言模型}{68}{section*.99}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{（四）语言模型的评价}{69}{section*.101}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.5.2}单词表示模型}{70}{subsection.1.5.2}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{（一）One-hot编码}{70}{section*.102}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{（二）分布式表示}{70}{section*.104}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.5.3}句子表示模型及预训练}{72}{subsection.1.5.3}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{（一）简单的上下文表示模型}{72}{section*.108}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{（二）ELMO模型}{74}{section*.111}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{（三）GPT模型}{75}{section*.113}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{（四）BERT模型}{75}{section*.115}
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{（五）为什么要预训练？}{76}{section*.117}
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {1.6}小结及深入阅读}{77}{section.1.6}
+\contentsfinish 
--- a/Book/mt-book.bbl
+++ b/Book/mt-book.bbl
--- a/Book/structure.tex
+++ b/Book/structure.tex
--- a/Section01-Introduction/section01.tex
+++ b/Section01-Introduction/section01.tex
+% !Mode:: "TeX:GBK"
+% !TEX encoding = GBK
+
 \def\CTeXPreproc{Created by ctex v0.2.13, don't edit!}
 \documentclass[cjk,t,compress,12pt]{beamer}
 %\documentclass{article}

--- a/Section02-Words-Trees-Probs/section02.tex
+++ b/Section02-Words-Trees-Probs/section02.tex
 % !Mode:: "TeX:GBK"
+% !TEX encoding = GBK

 \def\CTeXPreproc{Created by ctex v0.2.13, don't edit!}
 \documentclass[cjk,t,compress,12pt]{beamer}
@@ -57,14 +58,18 @@
 %\usetheme{Boadilla}
 %\usecolortheme{dolphin}

+\IfFileExists{C:/WINDOWS/win.ini}
+{\newcommand{\mycfont}{you}}
+{\newcommand{\mycfont}{gbsn}}
+

 \usefonttheme[onlylarge]{structurebold}

-\begin{CJK}{GBK}{song}
+\begin{CJK}{GBK}{\mycfont}
 \end{CJK}

 \setbeamerfont*{frametitle}{size=\large,series=\bfseries}
-\setbeamertemplate{navigation symbols}{\begin{CJK}{GBK}{hei} 第二章 词法、语法及概率思想基础 \hspace*{2em} 肖桐\&朱靖波 \end{CJK} \hspace*{2em} \today \hspace*{2em} \insertframenumber{}/\inserttotalframenumber}
+\setbeamertemplate{navigation symbols}{\begin{CJK}{GBK}{\mycfont} 第二章 词法、语法及概率思想基础 \hspace*{2em} 肖桐\&朱靖波 \end{CJK} \hspace*{2em} \today \hspace*{2em} \insertframenumber{}/\inserttotalframenumber}

 \setbeamertemplate{itemize items}[circle] % if you want a circle
 \setbeamertemplate{itemize subitem}[triangle] % if you wnat a triangle
@@ -72,7 +77,7 @@

 \begin{document}

-\begin{CJK}{GBK}{you}
+\begin{CJK}{GBK}{\mycfont}

 \title{\Large{词法、语法及统计思想基础}}
 \author{\large{\textbf{肖桐\ \ 朱靖波}}}

--- a/Section03-Word-Based-Models/section03.tex
+++ b/Section03-Word-Based-Models/section03.tex
- % !Mode:: "TeX:GBK"
+% !Mode:: "TeX:GBK"
+% !TEX encoding = GBK

 \def\CTeXPreproc{Created by ctex v0.2.13, don't edit!}
 \documentclass[cjk,t,compress,12pt]{beamer}
@@ -925,7 +926,7 @@
 \node [anchor=north west] (t1) at ([yshift=0.4em]s1.south west) {$t_1=$ Machine translation is just translation by computer};

 \node [anchor=north west] (s2) at (t1.south west) {$s_2=$ 那 人工 翻译 呢 ?};
-\node [anchor=north west] (t2) at ([yshift=0.4em]s2.south west) {$t_2=$ so , what is human translation ?};
+\node [anchor=north west] (t2) at ([yshift=0.4em]s2.south west) {$t_2=$ So , what is human translation ?};

 \end{tikzpicture}
 \end{flushleft}
@@ -936,7 +937,7 @@
 \begin{eqnarray}
 &   & \textrm{P}(\textrm{'翻译'},\textrm{'translation'}) \nonumber \\
 & = & \frac{c(\textrm{'翻译'},\textrm{'translation'};s^{[1]},t^{[1]})+c(\textrm{'翻译'},\textrm{'translation'};s^{[2]},t^{[2]})}{\sum_{x',y'} c(x',y';s^{[1]},t^{[1]}) + \sum_{x',y'} c(x',y';s^{[2]},t^{[2]})} \nonumber \\
-\visible<3->{& = & \frac{4 + 1}{|s^{[1]}| \times |t^{[1]}| + |s^{[2]}| \times |t^{[2]}|} = \frac{4 + 1}{9 \times 7 + 5 \times 7} = \frac{5}{102}} \nonumber
+\visible<3->{& = & \frac{4 + 1}{|s^{[1]}| \times |t^{[1]}| + |s^{[2]}| \times |t^{[2]}|} = \frac{4 + 1}{9 \times 7 + 5 \times 7} = \frac{5}{98}} \nonumber
 \end{eqnarray}
 }

@@ -3823,7 +3824,7 @@ s.t. $\forall t_y: \sum_{s_x} f(s_x|t_y) =1 $ & \\
 \begin{eqnarray}
 \frac{\partial L(f,\lambda)}{\partial f(s_u|t_v)} & = & \frac{\partial \big[ \frac{\epsilon}{(l+1)^{m}} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \big]}{\partial f(s_u|t_v)} - \nonumber \\
 &   & \frac{\partial \big[ \sum_{t_y} \lambda_{t_y} (\sum_{s_x} f(s_x|t_y) -1) \big]}{\partial f(s_u|t_v)} \nonumber \\
- & = & \frac{\epsilon}{(l+1)^{m}} \cdot \frac{\partial \big[ \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_{a_j}) \big]}{\partial f(s_u|t_v)} - \lambda_{t_v} \nonumber
+ & = & \frac{\epsilon}{(l+1)^{m}} \cdot \frac{\partial \big[ \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \big]}{\partial f(s_u|t_v)} - \lambda_{t_v} \nonumber
 \end{eqnarray}

 \vspace{-0.3em}

--- a/Section04-Phrasal-and-Syntactic-Models/section04.tex
+++ b/Section04-Phrasal-and-Syntactic-Models/section04.tex
@@ -981,7 +981,7 @@
 \begin{frame}{基于短语的翻译推导}
 \begin{beamerboxesrounded}[upper=uppercolblue,lower=lowercolblue,shadow=true]{定义 - 基于短语的翻译推导}
 {\small
-对于源语和目标语句对($\textbf{s}, \textbf{t}$)，有$l$个短语对$\{(\bar{s}_i,\bar{t}_j)\}$，且所有源语言短语$\{\bar{s}_i\}$和所有目标语短语$\{\bar{t}_j\}$分别构成$\textbf{s}$和$\textbf{t}$ 的切分，则称这些短语对$\{(\bar{s}_i,\bar{t}_j)\}$构成了$\textbf{s}$到$\textbf{t}$的\alert{基于短语的翻译推导}(简称推导)，记为$d(\{(\bar{s}_i,\bar{t}_j)\},\textbf{s},\textbf{t})$(简记为$d(\{(\bar{s}_i,\bar{t}_j)\})$或$d$)。
+对于源语和目标语句对($\textbf{s}, \textbf{t}$)，分别有短语切分$\{\bar{s}_i\}$和$\{\bar{t}_j\}$，且$\{\bar{s}_i\}$和$\{\bar{t}_j\}$之间存在一一对应的关系。令$\{\bar{a}_j\}$表示$\{\bar{t}_j\}$ 中每个短语对应到源语言短语的编号，则称短语对$\{(\bar{s}_{\bar{a}_j},\bar{t}_j)\}$构成了$\textbf{s}$到$\textbf{t}$的\alert{基于短语的翻译推导}(简称推导)，记为$d(\{(\bar{s}_{\bar{a}_j},\bar{t}_j)\},\textbf{s},\textbf{t})$(简记为$d(\{(\bar{s}_{\bar{a}_j},\bar{t}_j)\})$或$d$)。
 }
 \end{beamerboxesrounded}

@@ -1013,9 +1013,9 @@
 \path[<->, thick] (s2.south) edge (t2.north);
 \path[<->, thick] (s3.south) edge (t3.north);

-\node[anchor=south,inner sep=0pt,yshift=-0.3em] (sp1) at (s1.north) {\scriptsize{$\bar{s}_1$}};
-\node[anchor=south,inner sep=0pt,yshift=-0.3em] (sp2) at (s2.north) {\scriptsize{$\bar{s}_2$}};
-\node[anchor=south,inner sep=0pt,yshift=-0.3em] (sp3) at (s3.north) {\scriptsize{$\bar{s}_3$}};
+\node[anchor=south,inner sep=0pt,yshift=-0.3em] (sp1) at (s1.north) {\scriptsize{$\bar{s}_{a_1 = 1}$}};
+\node[anchor=south,inner sep=0pt,yshift=-0.3em] (sp2) at (s2.north) {\scriptsize{$\bar{s}_{a_2 = 2}$}};
+\node[anchor=south,inner sep=0pt,yshift=-0.3em] (sp3) at (s3.north) {\scriptsize{$\bar{s}_{a_3 = 3}$}};
 \node[anchor=north,inner sep=0pt,yshift=0.3em] (tp1) at (t1.south) {\scriptsize{$\bar{t}_1$}};
 \node[anchor=north,inner sep=0pt,yshift=0.3em] (tp2) at (t2.south) {\scriptsize{$\bar{t}_2$}};
 \node[anchor=north,inner sep=0pt,yshift=0.3em] (tp3) at (t3.south) {\scriptsize{$\bar{t}_3$}};
@@ -1027,11 +1027,11 @@
 \vspace{-1.0em}

 \begin{itemize}
-\item $\{(\bar{s}_k,\bar{t}_k)\}$构成了$(\textbf{s},\textbf{t})$的一个基于短语的翻译推导
+\item $\{(\bar{s}_{\bar{a}_j},\bar{t}_j)\}$构成了$(\textbf{s},\textbf{t})$的一个基于短语的翻译推导
 \item 需要在建模中描述的两个问题：
    \begin{itemize}
-    \item $\bar{s}_k$是如何被翻译成$\bar{t}_k$的?
-    \item $\bar{t}_k$在目标语中位置是如何决定的?
+    \item $\bar{s}_{\bar{a}_j}$是如何被翻译成$\bar{t}_j$的?
+    \item 翻译的顺序是如何决定的，即如何得到$\{\bar{a}_j\}$?
    \end{itemize}
 \end{itemize}

@@ -2527,7 +2527,7 @@ f_{\textrm{ME}}(d) = \prod_{<o,X_1,X_2> \in d} \Pr(o|X_1, X_2)
 \item 把每个子模型当作一个特征，为每个模型添加一个权重，然后使用对数线性模型对这些子模型进行建模，对数线性模型的形式如下：
 \vspace{-0.8em}
 \begin{displaymath}
-\textrm{P}(d,\textbf{t}|\textbf{s}) \propto \exp(\sum_{i=1}^{M} \lambda_i \cdot h_i(d,\textbf{s},\textbf{t}))
+\textrm{P}(d,\textbf{t}|\textbf{s}) \propto \textrm{mscore}(d,\textbf{s}|\textbf{t}) = \exp(\sum_{i=1}^{M} \lambda_i \cdot h_i(d,\textbf{s},\textbf{t}))
 \end{displaymath}
 \vspace{-1.2em}
 	\begin{itemize}
@@ -2535,7 +2535,7 @@ f_{\textrm{ME}}(d) = \prod_{<o,X_1,X_2> \in d} \Pr(o|X_1, X_2)
 	\end{itemize}
 \vspace{0.8em}
 \begin{displaymath}
-\textrm{P}(d,\textbf{t}|\textbf{s}) = \prod_{(\bar{s},\bar{t}) \in d} \Pr(\bar{t}|\bar{s})^{\lambda_{1}} \times f(d)^{\lambda_{2}} \times \Pr\nolimits_{\textrm{lm}}(\mathbf{t})^{\lambda_{lm}}
+\textrm{mscore}(d,\textbf{t}|\textbf{s}) = \prod_{(\bar{s},\bar{t}) \in d} \Pr(\bar{t}|\bar{s})^{\lambda_{1}} \times f(d)^{\lambda_{2}} \times \Pr\nolimits_{\textrm{lm}}(\mathbf{t})^{\lambda_{lm}}
 \end{displaymath}
 \item 可以引入更多的特征来提高翻译质量（下面介绍）
 \end{itemize}
@@ -2546,9 +2546,9 @@ f_{\textrm{ME}}(d) = \prod_{<o,X_1,X_2> \in d} \Pr(o|X_1, X_2)
 \begin{frame}{特征}
 % 给出特征列表
 \begin{itemize}
-\item \textbf{特征1-2： 短语翻译概率}，即正向翻译概率$\Pr(\bar{s}|\bar{t})$和反向翻译概率$\Pr(\bar{t}|\bar{s})$。是基于短语的统计机器翻译模型中最主要的特征。
-\item \textbf{特征3-4： 词汇翻译概率}，即正向词汇翻译概率$\Pr_{\textrm{lex}}(\bar{t}|\bar{s})$和反向词汇翻译概率$\Pr_{\textrm{lex}}(\bar{s}|\bar{t})$。用来描述短语对中源语端单词和目标语端单词的对应关系
-\item<2-> \textbf{特征5： $n$-gram语言模型}，即$\textrm{P}_{\textrm{lm}}(\textbf{t})$。度量译文的流畅度，可以使用大规模目标语单语数据得到。
+\item \textbf{特征1-2： 短语翻译概率}，即正向翻译概率$\log(\textrm{P}(\bar{s}|\bar{t}))$和反向翻译概率$\log(\textrm{P}(\bar{t}|\bar{s}))$。是基于短语的统计机器翻译模型中最主要的特征。
+\item \textbf{特征3-4： 词汇翻译概率}，即正向词汇翻译概率$\log(\textrm{P}_{\textrm{lex}}(\bar{t}|\bar{s}))$和反向词汇翻译概率$\log(\textrm{P}_{\textrm{lex}}(\bar{s}|\bar{t}))$。用来描述短语对中源语端单词和目标语端单词的对应关系
+\item<2-> \textbf{特征5： $n$-gram语言模型}，即$\log(\textrm{P}_{\textrm{lm}}(\textbf{t}))$。度量译文的流畅度，可以使用大规模目标语单语数据得到。
 \item<2-> \textbf{特征6：译文长度}，即$|\textbf{t}|$。避免模型倾向于短译文，同时让系统自动学习对译文长度的偏好。
 \item<2-> \textbf{特征7：翻译规则数量}。这个特征是为了避免模型仅仅使用少量特征构成翻译推导(因为翻译概率相乘，因子少结果一般会大一些)，同时让系统自动学习对使用规则数量的偏好。
 \end{itemize}
@@ -2567,9 +2567,9 @@ f_{\textrm{ME}}(d) = \prod_{<o,X_1,X_2> \in d} \Pr(o|X_1, X_2)
 \begin{center}
 \begin{tikzpicture}
 \begin{scope}[minimum height = 15pt]
-\node[anchor=west,minimum width=3em] (x1) at (0, 0) {\footnotesize{$\textrm{P}(d,\textbf{t}|\textbf{s}) = \prod_{(\bar{s},\bar{t}) \in d} score(\bar{s},\bar{t}) \times f_{\textrm{ME}}(d)^{\lambda_{ME}} \times f_{\textrm{MSD}}(d)^{\lambda_{MSD}} \times$}};
+\node[anchor=west,minimum width=3em] (x1) at (0, 0) {\footnotesize{$\textrm{mscore}(d,\textbf{t}|\textbf{s}) = \prod_{(\bar{s},\bar{t}) \in d} \textrm{pscore}(\bar{s},\bar{t}) \times f_{\textrm{ME}}(d)^{\lambda_{ME}} \times f_{\textrm{MSD}}(d)^{\lambda_{MSD}} \times$}};
 \node[anchor=north west] (x2) at ([xshift=4em,yshift=0.1em]x1.south west) {\footnotesize{$\Pr\nolimits_{\textrm{lm}}(\mathbf{t})^{\lambda_{lm}} \times \exp(\lambda_{TWB} \cdot length(\mathbf{t})) / Z(\mathbf{s})$}};
-\node[anchor=north west] (x3) at ([yshift=-1.8em]x1.south west) {\footnotesize{$score(\bar{s},\bar{t}) = \Pr(\bar{t}|\bar{s})^{\lambda_{1}} \times \Pr(\bar{s}|\bar{t})^{\lambda_{2}} \times \Pr\nolimits_{\textrm{lex}}(\bar{t}|\bar{s})^{\lambda_{3}} \times \Pr\nolimits_{\textrm{lex}}(\bar{s}|\bar{t})^{\lambda_{4}} \times$}};
+\node[anchor=north west] (x3) at ([yshift=-1.8em]x1.south west) {\footnotesize{$\textrm{pscore}(\bar{s},\bar{t}) = \Pr(\bar{t}|\bar{s})^{\lambda_{1}} \times \Pr(\bar{s}|\bar{t})^{\lambda_{2}} \times \Pr\nolimits_{\textrm{lex}}(\bar{t}|\bar{s})^{\lambda_{3}} \times \Pr\nolimits_{\textrm{lex}}(\bar{s}|\bar{t})^{\lambda_{4}} \times$}};
 \node[anchor=north west] (x4) at ([xshift=5em,yshift=0.1em]x3.south west) {\footnotesize{$\exp(\lambda_{PB}) \times \exp(\lambda_{WDB} \cdot \delta(\bar{s} \to null))$}};
 \end{scope}
 \end{tikzpicture}
@@ -2616,11 +2616,11 @@ d_{i}^{*} = \argmax_{d_{ij}} \sum_{k=1}^{M} \lambda_k \cdot h_k(d_{ij})
 \item 如何得到最优的$\lambda^*$
 	\begin{itemize}
 	\item 最简单的方法是枚举所有可能的$\lambda$值，但是这样做效率很低。可以只考虑最优译文发生变化的点:）
-	\item 对于每个训练样本，假设有2-best个推导$\mathbf{d}=\{d_1,d_2\}$，每个推导$d$的得分modelscore($d$)可以表示成关于权重$\lambda_i$的函数
+	\item 对于每个训练样本，假设有2-best个推导$\mathbf{d}=\{d_1,d_2\}$，每个推导$d$的得分score($d$)可以表示成关于权重$\lambda_i$的函数
 	\end{itemize}
 \vspace{0.2em}
 \begin{displaymath}
-\textrm{modelscore}(d) = \lambda_i \cdot h_i(d) + \sum_{k{\ne}i}^{M} \lambda_k \cdot h_k(d) = a \cdot \lambda_i + b
+\textrm{score}(d) = \lambda_i \cdot h_i(d) + \sum_{k{\ne}i}^{M} \lambda_k \cdot h_k(d) = a \cdot \lambda_i + b
 \end{displaymath}
 \vspace{-0.7em}
 \begin{center}
@@ -3680,17 +3680,12 @@ d = r_1 \circ r_2 \circ r_3 \circ r_4
 % 还是David Chiang的论文
 \begin{itemize}
 \item 与短语模型一样，层次短语模型也使用判别式模型进行建模 - $\textrm{P}(d,\textbf{t}|\textbf{s}) = \frac{\exp(\sum_{i=1}^{M} \lambda_i \cdot h_i(d,\textbf{s},\textbf{t}))}{\sum_{d',t'}\exp(\sum_{i=1}^{M} \lambda_i \cdot h_i(d',\textbf{s},\textbf{t}'))}$。其中特征权重$\{\lambda_i\}$可以使用最小错误率训练进行调优，特征函数$\{h_i\}$需要用户定义。
-\item<2-> 这里，所有层次短语规则满足$\langle\ \alpha, \beta, \sim\ \rangle$的形式
+\item<2-> 这里，所有层次短语规则满足$\textrm{LHS} \to \langle\ \alpha, \beta, \sim\ \rangle$的形式
    \begin{itemize}
    \item $\alpha$和$\beta$表示源语和目标语的规则串，$\sim$表示他们的对应关系
-    \item 此外，定义$\tau(\alpha)$和$\tau(\beta)$为源语端和目标语端的规则序列。例如
-        \vspace{-0.8em}
-        \begin{eqnarray}
-        \tau(\alpha) & = & \textrm{对}\ \textrm{X}_1\ \textrm{感到}\ \textrm{X}_2 \nonumber \\
-        \tau(\beta) & = & \textrm{be}\ \textrm{X}_2\ \textrm{with}\ \textrm{X}_1 \nonumber
-        \end{eqnarray}
    \end{itemize}
-\item<3-> \textbf{特征1-2： 短语翻译概率}，即正向翻译概率$\textrm{P}(\tau(\alpha)|\tau(\beta))$和反向翻译概率$\textrm{P}(\tau(\alpha)|\tau(\beta))$。这里，$\tau(\alpha)$和$\tau(\beta)$ 都被看做短语，因此可以直接复用短语系统的方法，使用极大似然估计进行计算。
+\item<3-> \textbf{特征1-2： 短语翻译概率}，即正向翻译概率$\log(\textrm{P}(\alpha|\beta))$和反向翻译概率$\log(\textrm{P}(\beta|\alpha))$。这里，$\alpha$和$\beta$ 都被看做短语，因此可以直接复用短语系统的方法，使用极大似然估计进行计算。
+\item<3-> \textbf{特征3-4： 词汇翻译概率}，即正向词汇翻译概率$\log(\textrm{P}_{lex}(\alpha|\beta))$和反向词汇翻译概率$\log(\textrm{P}_{lex}(\beta|\alpha))$。用来描述短语对中源语端单词和目标语端单词的对应关系
 \end{itemize}
 \end{frame}

@@ -3699,11 +3694,11 @@ d = r_1 \circ r_2 \circ r_3 \circ r_4
 \begin{frame}{特征(续)}
 % 给出特征列表
 \begin{itemize}
-\item \textbf{特征3-4： 词汇翻译概率}，即正向词汇翻译概率$\Pr_{lex}(\bar{t}|\bar{s})$和反向词汇翻译概率$\Pr_{lex}(\bar{s}|\bar{t})$。用来描述短语对中源语端单词和目标语端单词的对应关系
-\item \textbf{特征5： $n$-gram语言模型}，即$\textrm{P}_{\textrm{lm}}(\textbf{t})$。度量译文的流畅度，可以使用大规模目标语单语数据得到。
-\item<2-> \textbf{特征6：译文长度}，即$|\textbf{t}|$。避免模型倾向于短译文，同时让系统自动学习对译文长度的偏好。
+
+\item \textbf{特征5： $n$-gram语言模型}，即$\log(\textrm{P}_{\textrm{lm}}(\textbf{t}))$。度量译文的流畅度，可以使用大规模目标语单语数据得到。
+\item \textbf{特征6：译文长度}，即$|\textbf{t}|$。避免模型倾向于短译文，同时让系统自动学习对译文长度的偏好。
 \item<2-> \textbf{特征7：翻译规则数量}。这个特征是为了避免模型仅仅使用少量特征构成翻译推导(因为翻译概率相乘，因子少结果一般会大一些)，同时让系统自动学习对使用规则数量的偏好。
-\item<2-> \textbf{特征8：源语言被翻译为空的单词数量}。注意，空翻译规则(或特征)有时也被称作evil feature，这类特征在一些数据集上对BLEU有很好的提升作用，但是会造成人工评价的下降，因此需要谨慎使用。
+\item<2-> \textbf{特征8：胶水规则数量}。这个特征是为了让系统可以控制使用胶水规则的偏好。
 \end{itemize}
 \end{frame}

@@ -3729,7 +3724,7 @@ d = r_1 \circ r_2 \circ r_3 \circ r_4
 \begin{center}
 \begin{tikzpicture}
 \node [anchor=south west,rectangle,draw=ublue,thick,inner sep=0.4em,fill=white,drop shadow] (sourceG) at (0,0) {{\color{ublue} \footnotesize{\textbf{S端文法}}}};
-\node [anchor=west,rectangle,draw=ublue,thick,inner sep=0.4em,fill=white,drop shadow] (chom) at ([xshift=3.5em]sourceG.east) {{\color{ublue} \footnotesize{\textbf{乔姆斯基范式}}}};
+\node [anchor=west,rectangle,draw=ublue,thick,inner sep=0.4em,fill=white,drop shadow] (chom) at ([xshift=3.5em]sourceG.east) {{\color{ublue} \footnotesize{\textbf{乔姆斯基范式(CNF)}}}};
 \node [anchor=west,rectangle,draw=ublue,thick,inner sep=0.4em,fill=white,drop shadow] (targetG) at ([xshift=3.5em]chom.east) {{\color{ublue} \footnotesize{\textbf{T端文法}}}};

 \draw[->,very thick] ([xshift=0.1em]sourceG.east) -- ([xshift=-0.1em]chom.west);
@@ -3745,7 +3740,7 @@ d = r_1 \circ r_2 \circ r_3 \circ r_4
 \end{tikzpicture}
 \end{center}
 \vspace{0.3em}
-\item 由于对文法中的非终结符进行了限制，可以直接使用CYK算法进行解码，无需转换成乔姆斯基范式
+%\item 由于对文法中的非终结符进行了限制，可以直接使用CYK算法进行解码，无需转换成乔姆斯基范式
 \end{itemize}
 \end{frame}

@@ -3756,28 +3751,29 @@ d = r_1 \circ r_2 \circ r_3 \circ r_4
 \begin{itemize}
 \item CYK算法通过遍历不同\alert{span}来判断字符串是否符合文法
 	\begin{itemize}
-	\item 输入：源语串\textbf{s =} $s_1 ... s_J$，以及上下文无关文法$G$
-	\item 输出：判断字符串是否符合上下文无关文法
+	\item 输入：源语串\textbf{s =} $s_1 ... s_J$，以及CNF文法$G$
+	\item 输出：判断字符串是否符合G
 	\end{itemize}
-%\vspace{-0.5em}
+\vspace{-0.3em}
 \begin{center}
 \begin{tikzpicture}
 \tikzstyle{alignmentnode} = [rectangle,fill=blue!30,minimum size=0.45em,text=white,inner sep=0.1pt]
 \tikzstyle{selectnode} = [rectangle,fill=green!20,minimum height=1.5em,minimum width=1.5em,inner sep=1.2pt]
 \tikzstyle{srcnode} = [anchor=south west]
 \begin{scope}[scale=0.85]
-\node[srcnode] (c1) at (0,0) {\small{\textbf{Function} CKY-Algorithm($s,G$)}};
-\node[srcnode,anchor=north west] (c21) at ([xshift=2em,yshift=0.4em]c1.south west) {\small{\textbf{foreach} ($j_1, j_2$): 1$ \leq j_1 \leq J$ and 1$ \leq j_2 \leq J$}};
-\node[srcnode,anchor=north west] (c22) at ([xshift=2em,yshift=0.4em]c21.south west) {\small{Initialize $cell[j_1,j_2 ]$}};
-\node[srcnode,anchor=north west] (c3) at ([xshift=-2em,yshift=0.4em]c22.south west) {\small{\textbf{for} $j_1$ = 1 to $J$}};
-\node[srcnode,anchor=west] (c31) at ([xshift=5em]c3.east) {\small{// beginning of span}};
-\node[srcnode,anchor=north west] (c4) at ([xshift=2em,yshift=0.4em]c3.south west) {\small{\textbf{for} $j_2$ = $j_1$ to $J$}};
-\node[srcnode,anchor=north west] (c41) at ([yshift=0.4em]c31.south west) {\small{// ending of span}};
-\node[srcnode,anchor=north west] (c5) at ([xshift=2em,yshift=0.4em]c4.south west) {\small{\textbf{for} $k$ = $j_1$ to $j_2$}};
+
+\node[srcnode] (c1) at (0,0) {\small{\textbf{Function} CYK-Algorithm($\textbf{s},G$)}};
+\node[srcnode,anchor=north west] (c21) at ([xshift=1.5em,yshift=0.4em]c1.south west) {\small{\textbf{fore} $j=0$ to $ J - 1$}};
+\node[srcnode,anchor=north west] (c22) at ([xshift=1.5em,yshift=0.4em]c21.south west) {\small{$span[j,j+1 ]$.Add($A \to a \in G$)}};
+\node[srcnode,anchor=north west] (c3) at ([xshift=-1.5em,yshift=0.4em]c22.south west) {\small{\textbf{for} $l$ = 1 to $J$}};
+\node[srcnode,anchor=west] (c31) at ([xshift=6em]c3.east) {\small{// length of span}};
+\node[srcnode,anchor=north west] (c4) at ([xshift=1.5em,yshift=0.4em]c3.south west) {\small{\textbf{for} $j$ = 0 to $J-l$}};
+\node[srcnode,anchor=north west] (c41) at ([yshift=0.4em]c31.south west) {\small{// beginning of span}};
+\node[srcnode,anchor=north west] (c5) at ([xshift=1.5em,yshift=0.4em]c4.south west) {\small{\textbf{for} $k$ = $j$ to $j+l$}};
 \node[srcnode,anchor=north west] (c51) at ([yshift=0.4em]c41.south west) {\small{// partition of span}};
-\node[srcnode,anchor=north west] (c6) at ([xshift=2em,yshift=0.4em]c5.south west) {\small{$hypos$ = Compose($cell[j_1, k], cell[k, j_2]$)}};
-\node[srcnode,anchor=north west] (c7) at ([yshift=0.4em]c6.south west) {\small{$cell[j_1, j_2]$.update($hypos$)}};
-\node[srcnode,anchor=north west] (c8) at ([xshift=-6em,yshift=0.4em]c7.south west) {\small{\textbf{return} $cell[1, J]$}};
+\node[srcnode,anchor=north west] (c6) at ([xshift=1.5em,yshift=0.4em]c5.south west) {\small{$hypos$ = Compose($span[j, k], span[k, j+l]$)}};
+\node[srcnode,anchor=north west] (c7) at ([yshift=0.4em]c6.south west) {\small{$span[j, j+l]$.Update($hypos$)}};
+\node[srcnode,anchor=north west] (c8) at ([xshift=-4.5em,yshift=0.4em]c7.south west) {\small{\textbf{return} $span[0, J]$}};


 \node[srcnode] (s1) at ([yshift=-2.5em]c8.south west) {\textbf{s:}};
@@ -3789,13 +3785,13 @@ d = r_1 \circ r_2 \circ r_3 \circ r_4
 \node[srcnode] (s7) at ([xshift=1em]s6.south east) {$s_6$};
 \node[srcnode] (s8) at ([xshift=1em]s7.south east) {$s_7$};

-\node[srcnode,anchor=center] (j1) at ([yshift=-1.4em]s3.south) {$j_1$};
-\node[srcnode,anchor=center] (j2) at ([yshift=-1.4em]s7.south) {$j_2$};
+\node[srcnode,anchor=center] (j1) at ([yshift=-1.4em,xshift=-0.8em]s3.south) {$j$};
+\node[srcnode,anchor=center] (j2) at ([yshift=-1.4em,xshift=0.8em]s7.south) {$j+l$};

 \node[srcnode,anchor=center] (k) at ([xshift=1.5em,yshift=-1.5em]s4.south) {$k$};

-\draw[->,thick] ([yshift=-0.1em]j1.north)--([yshift=0.1em]s3.south);
-\draw[->,thick] ([yshift=-0.1em]j2.north)--([yshift=0.1em]s7.south);
+\draw[->,thick] ([yshift=-0.3em,xshift=-0.8em]j1.north)--([yshift=0.5em,xshift=-0.8em]j1.north);
+\draw[->,thick] ([yshift=-0.3em,xshift=0.8em]j2.north)--([yshift=0.5em,xshift=0.8em]j2.north);
 \draw[->,thick] ([yshift=-0.1em]k.north)--([xshift=1.5em,yshift=0.1em]s4.south);

 \node [rectangle,inner sep=0.3em,rounded corners=1pt,very thick,dotted,draw=ugreen] [fit = (s3) (s7)] (box1) {};
@@ -3954,10 +3950,10 @@ d = r_1 \circ r_2 \circ r_3 \circ r_4
 \begin{frame}{CYK解码（续）}
 % 看NiuTrans Manual
 \begin{itemize}
-\item CYK解码提出了一种cell的数据结构，用来记录所有可能出现的翻译假设。
+\item 实际上，在层次短语解码的时候，不能直接使用CYK算法，需要先转化为乔姆斯基范式，才能进行解码
    \begin{itemize}
-    \item<2-> 对于每个源语句子，使用短语规则表初始化它的cell
-    \item<3-> 自底向上对cell中的每个子cell进行重新组合（正向、反向）
+    \item<2-> 对于每个源语句子，使用短语规则表初始化它的span
+    \item<3-> 自底向上对span中的每个子span进行重新组合（正、反向）
    \item<4-> 计算每个推导的得分并记录下来，最终选择最优推导所对应的译文作为输出
    \end{itemize}
 \end{itemize}
@@ -4143,10 +4139,20 @@ d = r_1 \circ r_2 \circ r_3 \circ r_4
 \visible<4->{
 \node [anchor=center,selectnode] (c2) at (alig11.center) {\footnotesize{5.1}};
 \node [anchor=center,selectnode] (c3) at (alig2.center) {\footnotesize{5.5}};
-\node [anchor=center,selectnode,fill=red!20] (c4) at (alig12.center) {\footnotesize{8.2}};
 \node [anchor=center,selectnode,fill=red!20] (c5) at (alig21.center) {\footnotesize{8.5}};
 \node [anchor=center,selectnode,fill=red!20] (c6) at (alig3.center) {\footnotesize{7.7}};
 }
+
+\visible<5->{
+\node [anchor=center,selectnode] (c5) at (alig21.center) {\footnotesize{8.5}};
+\node [anchor=center,selectnode] (c6) at (alig3.center) {\footnotesize{7.7}};
+\node [anchor=center,selectnode,fill=red!20] (c7) at (alig22.center) {\footnotesize{4.2}};
+\node [anchor=center,selectnode,fill=red!20] (c8) at (alig31.center) {\footnotesize{8.2}};
+}
+
+\draw [->,thick] ([xshift=-1.0em,yshift=1.0em]alig1.north west)--([xshift=-1.0em,yshift=-0.7em]alig4.south west);
+\draw [->,thick] ([xshift=-1.0em,yshift=1.0em]alig1.north west)--([xshift=0.8em,yshift=1.0em]alig13.north east);
+
 \end{scope}

 \end{tikzpicture}
@@ -4492,15 +4498,15 @@ d = r_1 \circ r_2 \circ r_3 \circ r_4
 术语 & 说明 \\ \hline
 翻译规则 & 翻译的最小单元(或步骤) \\ \hline
 推导 & 由一系列规则组成的分析或翻译过程，推导可以 \\
-     & 被看做是规则的序列 \\ \hline
+     & 被看作是规则的序列 \\ \hline
 规则表 & 翻译规则的存储表示形式，可以高效进行查询\\ \hline
 层次短语模型 & 基于同步上下文无关文法的翻译模型，非终结符\\
-             & 只有S和X两种，规则和文法并不需要符合语言学\\
+             & 只有S和X两种，文法并不需要符合语言学\\
             & 句法约束\\ \hline
 树到串模型 & 一类翻译模型，它使用源语语言学句法树，因此\\
-           & 翻译可以被看做从一棵句法树到词串的转换\\ \hline
+           & 翻译可以被看作是从一棵句法树到词串的转换\\ \hline
 串到树模型 & 一类翻译模型，它使用目标语语言学句法树，因\\
-           & 此翻译可以被看做从词串到句法树的转换\\
+           & 此翻译可以被看作是从词串到句法树的转换\\
 \end{tabular}
 \end{center}
 }
@@ -4517,7 +4523,7 @@ d = r_1 \circ r_2 \circ r_3 \circ r_4
 \begin{tabular}{l | l}
 术语 & 说明 \\ \hline
 树到树模型 & 一类翻译模型，它同时使用源语和目标语语言学\\
-           & 句法树，因此此翻译可以被看做从句法树到句法\\
+           & 句法树，因此翻译可以被看做从句法树到句法\\
           & 树的转换 \\ \hline
 基于句法 & 使用语言学句法 \\ \hline
 基于树 & (源语言)使用树结构(大多指句法树)\\ \hline
@@ -4741,10 +4747,10 @@ $x$表示叶子非终结符(可替换的变量)，显然这是调序规则
 \vspace{-1em}
 \begin{eqnarray}
 \langle\ \textrm{VP}, \textrm{VP}\ \rangle & \to & \langle\ \textrm{VP(PP}_{\alert{1}}\ \textrm{VP(VV(表示) NN}_{\alert{2}})), \nonumber \\
-& & \ \ \textrm{VP(VBZ(was) VP(VBZ}_{\alert{2}}\ \textrm{PP}_{\alert{1}}))\ \rangle \nonumber
+& & \ \ \textrm{VP(VBZ(was) VP(VBN}_{\alert{2}}\ \textrm{PP}_{\alert{1}}))\ \rangle \nonumber
 \end{eqnarray}

-其中变量的对应关系用下标数字表示，比如：$\textrm{PP}_1 \leftrightarrow \textrm{PP}_1$，$\textrm{NN}_2 \leftrightarrow \textrm{VBZ}_2$
+其中变量的对应关系用下标数字表示，比如：$\textrm{PP}_1 \leftrightarrow \textrm{PP}_1$，$\textrm{NN}_2 \leftrightarrow \textrm{VBN}_2$

 \item<2-> 在这个规则的树结构中，每个叶子非终结符本质上定义了一个变量，这个节点也被称作边缘节点(frontier node)。边缘节点可以被其它树结构替换，组合为更大的树结构，这个操作被称作组合(composition) 或树替换

@@ -4789,7 +4795,7 @@ $x$表示叶子非终结符(可替换的变量)，显然这是调序规则
 \begin{itemize}
 \item 规则的推导描述双语句子同步生成(和分析)的过程
    \begin{itemize}
-    \item \raisebox{0.3em}{\tikz{\draw[*-*] (0,0)--(0.5,0);}}表示对边缘阶段(变量)的替换操作
+    \item \raisebox{0.3em}{\tikz{\draw[*-*] (0,0)--(0.5,0);}}表示对变量的替换操作
    \end{itemize}
 \end{itemize}

@@ -4979,7 +4985,7 @@ $x$表示叶子非终结符(可替换的变量)，显然这是调序规则
 \alpha_h & = & \textrm{VP} \nonumber \\
 \beta_h & = & \textrm{VP}\ (=\alpha_h) \nonumber \\
 \alpha_r & = & \textrm{VP(VV(提高) NN:}x) \nonumber \\
-\beta_r & = & \textrm{increases\ NN:}x \nonumber \\
+\beta_r & = & \textrm{VP(increases\ NN:}x) \nonumber \\
 \sim & = & \{1-1\} \nonumber
 \end{eqnarray}
 }
@@ -5130,7 +5136,7 @@ $\textrm{VP(VV(提高) NN}_1) \to \textrm{increases\ NN}_1$ \\

 \begin{beamerboxesrounded}[upper=uppercolblue,lower=lowercolblue,shadow=true]{定义 - Span}
 {\small
-源语树节点$n$的Span是它所对应到到目标语的第一个单词和最后一个单词所构成的索引范围
+对于一个源语言句法树节点，它的Span是这个节点所对应到目标语的第一个单词和最后一个单词所构成的索引范围
 }
 \end{beamerboxesrounded}

@@ -5205,7 +5211,7 @@ $\textrm{VP(VV(提高) NN}_1) \to \textrm{increases\ NN}_1$ \\

 \begin{beamerboxesrounded}[upper=uppercolblue,lower=lowercolblue,shadow=true]{定义 - Complement Span}
 {\small
-源语树节点$n$的Complement Span是除了它的祖先和子孙阶段外的其它节点Span的并集
+对于一个源语言句法树节点，它的Complement Span是除了它的祖先和子孙节点外的其它节点Span的并集
 }
 \end{beamerboxesrounded}

@@ -5272,7 +5278,7 @@ $\textrm{VP(VV(提高) NN}_1) \to \textrm{increases\ NN}_1$ \\

 \begin{beamerboxesrounded}[upper=uppercolblue,lower=lowercolblue,shadow=true]{定义 - 可信节点(Admissible Node)}
 {\small
-对于源语树节点$n$，如果他的Span和Complement Span不相交，节点$n$就是一个可信节点，否则是一个不可信节点
+对于源语言树节点$n$，如果它的Span和Complement Span不相交，节点$n$就是一个可信节点，否则是一个不可信节点
 }
 \end{beamerboxesrounded}

@@ -6309,13 +6315,13 @@ NP-BAR(NN$_1$ NP-BAR$_2$) $\to$ NN$_1$ NP-BAR$_2$
 \vspace{-1.3em}
 \begin{eqnarray}
 \langle\ \textrm{VP}, \textrm{VP}\ \rangle & \to & \langle\ \textrm{VP(PP}_{1}\ \textrm{VP(VV(表示) NN}_{2})), \nonumber \\
-& & \ \ \textrm{VP(VBZ(was) VP(VBZ}_{2}\ \textrm{PP}_{1}))\ \rangle \nonumber
+& & \ \ \textrm{VP(VBZ(was) VP(VBN}_{2}\ \textrm{PP}_{1}))\ \rangle \nonumber
 \end{eqnarray}
 表示为\alert{树片段到树片段}的映射形式\\
 \vspace{-1.3em}
 \begin{eqnarray}
 & & \textrm{VP(PP}_{1}\ \textrm{VP(VV(表示) NN}_{2})) \nonumber \\
-& \to & \textrm{VP(VBZ(was) VP(VBZ}_{2}\ \textrm{PP}_{1})) \nonumber
+& \to & \textrm{VP(VBZ(was) VP(VBN}_{2}\ \textrm{PP}_{1})) \nonumber
 \end{eqnarray}

 \item<2-> 可以通过扩展GHKM方法进行树到树规则抽取
@@ -6709,9 +6715,9 @@ NP-BAR(NN$_1$ NP-BAR$_2$) $\to$ NN$_1$ NP-BAR$_2$
 %%%  翻译特征(续)
 \begin{frame}{特征(续)}
 \begin{itemize}
-\item \textbf{特征1-2： 短语翻译概率}，即正向翻译概率$\textrm{P}(\tau(\beta_r)|\tau(\alpha_r))$和反向翻译概率$\textrm{P}(\tau(\alpha_r)|\tau(\beta_r))$。这里，$\tau(\alpha_r)$和$\tau(\beta_r)$ 都被看做短语，因此可以直接复用短语系统的方法进行计算。
-\item \textbf{特征3-4： 词汇翻译概率}，即$\textrm{P}_{\textrm{lex}}(\tau(\beta_r)|\tau(\alpha_r))$和$\textrm{P}_{\textrm{lex}}(\tau(\alpha_r)|\tau(\beta_r))$。可以用短语系统中的词汇翻译概率描述源语和目标语单词对应的情况。
-\item<2-> \textbf{特征5： $n$-gram语言模型}，即$\textrm{P}_{\textrm{lm}}(\textbf{t})$。度量译文的流畅度，可以使用大规模目标语单语数据得到。
+\item \textbf{特征1-2： 短语翻译概率}，即正向翻译概率$\log(\textrm{P}(\tau(\beta_r)|\tau(\alpha_r)))$和反向翻译概率$\log(\textrm{P}(\tau(\alpha_r)|\tau(\beta_r)))$。这里，$\tau(\alpha_r)$ 和$\tau(\beta_r)$ 都被看做短语，因此可以直接复用短语系统的方法进行计算。
+\item \textbf{特征3-4： 词汇翻译概率}，即$\log(\textrm{P}_{\textrm{lex}}(\tau(\beta_r)|\tau(\alpha_r)))$和$\log(\textrm{P}_{\textrm{lex}}(\tau(\alpha_r)|\tau(\beta_r)))$。可以用短语系统中的词汇翻译概率描述源语和目标语单词对应的情况。
+\item<2-> \textbf{特征5： $n$-gram语言模型}，即$\log(\textrm{P}_{\textrm{lm}}(\textbf{t}))$。度量译文的流畅度，可以使用大规模目标语单语数据得到。
 \item<2-> \textbf{特征6：译文长度}，即$|\textbf{t}|$。避免模型倾向于短译文，同时让系统自动学习对译文长度的偏好。
 \item<2-> \textbf{特征7：翻译规则数量}。这个特征是为了避免模型仅仅使用少量特征构成翻译推导(因为翻译概率相乘，因子少结果一般会大一些)，同时让系统自动学习对使用规则数量的偏好。
 \end{itemize}
@@ -6722,7 +6728,7 @@ NP-BAR(NN$_1$ NP-BAR$_2$) $\to$ NN$_1$ NP-BAR$_2$
 \begin{frame}{特征(续2)}
 \begin{itemize}
 \item \textbf{特征8：源语言被翻译为空的单词数量}。注意，空翻译规则(或特征)有时也被称作evil feature，这类特征在一些数据集上对BLEU有很好的提升作用，但是会造成人工评价的下降，因此需要谨慎使用。
-\item<2-> \textbf{特征9： 翻译规则生成概率}，即$\textrm{P}_{\textrm{rule}}(\alpha_r,\beta_r,\sim|\alpha_h,\beta_h)$。这个特征可以被看做是生成翻译推导的概率。
+\item<2-> \textbf{特征9： 翻译规则生成概率}，即$\log(\textrm{P}_{\textrm{rule}}(\alpha_r,\beta_r,\sim|\alpha_h,\beta_h))$。这个特征可以被看做是生成翻译推导的概率。
 \item<2-> \textbf{特征10：组合规则的数量}。学习使用组合规则(或最小规则)的偏好。
 \item<2-> \textbf{特征11：词汇化规则的数量}。学习使用含有终结符规则的偏好。
 \item<2-> \textbf{特征12：低频规则的数量}。学习使用训练数据中出现频次低于3的规则的偏好。低频规则大多并不可靠，这个特征本质上也是为了区分不同质量规则。
@@ -6757,7 +6763,7 @@ NP-BAR(NN$_1$ NP-BAR$_2$) $\to$ NN$_1$ NP-BAR$_2$
 %%%  基于树的解码 vs 基于串的解码
 \begin{frame}{基于树的解码 vs 基于串的解码}
 \begin{itemize}
-\item 前面的公式本质上描述了一种基于串的解码，即对输入的源语言句子通过句法模型进行翻译，得到译文串。不过，搜索所有的推导导致巨大的解码空间。对于树到串和树到树翻译来说，源语言句法树是可见的，因此可以使用另一种解码方法 - 基于树的解码，即把输出入的源语句法树翻译为目标语串\\
+\item 前面的公式本质上描述了一种基于串的解码，即对输入的源语言句子通过句法模型进行翻译，得到译文串。不过，搜索所有的推导会导致巨大的解码空间。对于树到串和树到树翻译来说，源语言句法树是可见的，因此可以使用另一种解码方法 - 基于树的解码，即把输出入的源语句法树翻译为目标语串\\
 \end{itemize}

 \centering
@@ -7121,7 +7127,7 @@ NP-BAR(NN$_1$ NP-BAR$_2$) $\to$ NN$_1$ NP-BAR$_2$
 \item 不同于基于树的解码，\alert{基于串的解码}方法并不要求输入句法树，它直接对输入词串进行翻译，最终得到译文。
    \begin{itemize}
    \item 这种方法适用于树到串、串到树、树到树等多种模型
-    \item 本质上，由于并不受固定输入的句法树约束，基于串的解码可以探索更多潜在的树结构，这也增大了搜索空间(相比基于串的解码)，因此该方法更有可能找到高质量翻译结果
+    \item 本质上，由于并不受固定输入的句法树约束，基于串的解码可以探索更多潜在的树结构，这也增大了搜索空间(相比基于树的解码)，因此该方法更有可能找到高质量翻译结果
    \end{itemize}
 \item<2-> 在基于串的方法中，句法结构被看做是翻译的隐含变量，而非线性的输入和输出。比如，层次短语翻译解码就是一种典型的基于串的解码方法，所有的翻译推导在翻译过程里动态生成，但是并不要输入或者输出这些推导所对应的层次结构
 \end{itemize}