合并分支 'caorunzhe' 到 'master'

Caorunzhe 查看合并请求 !285

合并分支 'caorunzhe' 到 'master'
Caorunzhe 查看合并请求 !285
db6488c4 · 曹润柘 · 728ae6a4 · 7f08f351 · db6488c4 · db6488c4
Commit db6488c4 authored Sep 28, 2020 by 曹润柘
--- a/Chapter2/Figures/figure-full-probability-word-segmentation-3.tex
+++ b/Chapter2/Figures/figure-full-probability-word-segmentation-3.tex
@@ -17,7 +17,7 @@



-\node [anchor=north west] (label11) at ([xshift=18.0em,yshift=1.63em]label1.south west) {更多数据-总词数:100K $\sim$ 1M};
+\node [anchor=north west] (label11) at ([xshift=18.0em,yshift=1.63em]label1.south west) {更多数据-总词数:1百万个词};
 \node [anchor=north west] (p12) at (label11.south west) {$\funp{P}(\textrm{很})=0.000010$};
 \node [anchor=north west] (p22) at (p12.south west) {$\funp{P}(\textrm{。})=0.001812$};
 \node [anchor=north west] (p32) at (p22.south west) {$\funp{P}(\textrm{确实})=0.000001$};

--- a/Chapter2/chapter2.tex
+++ b/Chapter2/chapter2.tex
@@ -43,7 +43,7 @@
 \subsection{随机变量和概率}
 \parinterval 在自然界中，很多{\small\bfnew{事件}}\index{事件}（Event）\index{Event}是否会发生是不确定的。例如，明天会下雨、掷一枚硬币是正面朝上、扔一个骰子的点数是1等。这些事件可能会发生也可能不会发生。通过大量的重复试验，能发现具有某种规律性的事件叫做{\small\sffamily\bfseries{随机事件}}\index{随机事件}。

-\parinterval {\small\sffamily\bfseries{随机变量}}\index{随机变量}（Random Variable）\index{Random Variable}是对随机事件发生可能状态的描述，是随机事件的数量表征。设$\Omega = \{ \omega \}$为一个随机试验的样本空间，$X=X(\omega)$就是定义在样本空间$\Omega$上的单值实数函数，即$X=X(\omega)$为随机变量，记为$X$。随机变量是一种能随机选取数值的变量，常用大写的英语字母或希腊字母表示，其取值通常用小写字母来表示。例如，用$A$ 表示一个随机变量，用$a$表示变量$A$的一个取值。根据随机变量可以选取的值的某些性质，可以将其划分为离散变量和连续变量。
+\parinterval {\small\sffamily\bfseries{随机变量}}\index{随机变量}（Random Variable）\index{Random Variable}是对随机事件发生可能状态的描述，是随机事件的数量表征。设$\varOmega = \{ \omega \}$为一个随机试验的样本空间，$X=X(\omega)$就是定义在样本空间$\varOmega$上的单值实数函数，即$X=X(\omega)$为随机变量，记为$X$。随机变量是一种能随机选取数值的变量，常用大写的英语字母或希腊字母表示，其取值通常用小写字母来表示。例如，用$A$ 表示一个随机变量，用$a$表示变量$A$的一个取值。根据随机变量可以选取的值的某些性质，可以将其划分为离散变量和连续变量。

 \parinterval 离散变量是在其取值区间内可以被一一列举、总数有限并且可计算的数值变量。例如，用随机变量$X$代表某次投骰子出现的点数，点数只可能取1$\sim$6这6个整数，$X$就是一个离散变量。

@@ -546,7 +546,7 @@ F(x)=\int_{-\infty}^x f(x)\textrm{d}x
 \label{eq:2-26}
 \end{eqnarray}

-\parinterval 显然，这个结果是不合理的。因为即使语料中没有 “确实”和“现在”两个词连续出现，这种搭配也是客观存在的。这时简单地用极大似然估计得到概率却是0，导致整个句子出现的概率为0。 更常见的问题是那些根本没有出现在词表中的词，称为{\small\sffamily\bfseries{未登录词}}\index{未登录词}（Out-of-vocabulary Word，OOV Word）\index{Out-of-vocabulary Word，OOV Word}，比如一些生僻词，可能模型训练阶段从来没有看到过，这时模型仍然会给出0 概率。图\ref{fig:2-11}展示了一个真实语料库中词语出现频次的分布，可以看到绝大多数词都是低频词。
+\parinterval 显然，这个结果是不合理的。因为即使语料中没有 “确实”和“现在”两个词连续出现，这种搭配也是客观存在的。这时简单地用极大似然估计得到概率却是0，导致整个句子出现的概率为0。 更常见的问题是那些根本没有出现在词表中的词，称为{\small\sffamily\bfseries{未登录词}}\index{未登录词}（Out-Of-Vocabulary Word，OOV Word）\index{Out-Of-Vocabulary Word，OOV Word}，比如一些生僻词，可能模型训练阶段从来没有看到过，这时模型仍然会给出0概率。图\ref{fig:2-11}展示了一个真实语料库中词语出现频次的分布，可以看到绝大多数词都是低频词。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -735,7 +735,7 @@ c(\cdot) & \textrm{当计算最高阶模型时}  \\
 \end{array}\right.
 \label{eq:2-41}
 \end{eqnarray}
-\noindent 其中catcount$(\cdot)$表示的是单词$w_i$作为n-gram中第n个词时$w_{i-n+1} \ldots w_i$的种类数目。
+\noindent 其中catcount$(\cdot)$表示的是单词$w_i$作为$n$-gram中第$n$个词时$w_{i-n+1} \ldots w_i$的种类数目。

 \parinterval Kneser-Ney平滑是很多语言模型工具的基础\upcite{heafield2011kenlm,stolcke2002srilm}。还有很多以此为基础衍生出来的算法，感兴趣的读者可以通过参考文献自行了解\upcite{parsing2009speech,ney1994structuring,chen1999empirical}。

@@ -1046,7 +1046,7 @@ c(\cdot) & \textrm{当计算最高阶模型时}  \\
 \vspace{0.5em}
 \item 本章更多地关注了语言模型的基本问题和求解思路，但是基于$n$-gram的方法并不是语言建模的唯一方法。从现在自然语言处理的前沿看，端到端的深度学习方法在很多任务中都取得了领先的性能。语言模型同样可以使用这些方法\upcite{jing2019a}，而且在近些年取得了巨大成功。例如，最早提出的前馈神经语言模型\upcite{bengio2003a}和后来的基于循环单元的语言模型\upcite{mikolov2010recurrent}、基于长短期记忆单元的语言模型\upcite{sundermeyer2012lstm}以及现在非常流行的Transformer\upcite{vaswani2017attention}。 关于神经语言模型的内容，会在{\chapternine}进行进一步介绍。
 \vspace{0.5em}
-\item 最后，本章结合语言模型的序列生成任务对搜索技术进行了介绍。类似地，机器翻译任务也需要从大量的翻译候选中快速寻找最优译文。因此在机器翻译任务中也使用了搜索方法，这个过程通常被称作{\small\bfnew{解码}}\index{解码}（Decoding）\index{Decoding}。例如，有研究者在基于词的翻译模型中尝试使用启发式搜索\upcite{DBLP:conf/acl/OchUN01,DBLP:conf/acl/WangW97,tillmann1997a}以及贪婪搜索方法\upcite{germann2001fast}\upcite{germann2003greedy}，也有研究者探索基于短语的栈解码方法\upcite{Koehn2007Moses,DBLP:conf/amta/Koehn04}。此外，解码方法还包括有限状态机解码\upcite{bangalore2001a}\upcite{DBLP:journals/mt/BangaloreR02}以及基于语言学约束的解码\upcite{venugopal2007an,zollmann2007the,liu2006tree,galley2006scalable,chiang2005a}。相关内容将在{\chaptereight}和{\chapterfourteen}进行介绍。
+\item 最后，本章结合语言模型的序列生成任务对搜索技术进行了介绍。类似地，机器翻译任务也需要从大量的翻译候选中快速寻找最优译文。因此在机器翻译任务中也使用了搜索方法，这个过程通常被称作解码。例如，有研究者在基于词的翻译模型中尝试使用启发式搜索\upcite{DBLP:conf/acl/OchUN01,DBLP:conf/acl/WangW97,tillmann1997a}以及贪婪搜索方法\upcite{germann2001fast}\upcite{germann2003greedy}，也有研究者探索基于短语的栈解码方法\upcite{Koehn2007Moses,DBLP:conf/amta/Koehn04}。此外，解码方法还包括有限状态机解码\upcite{bangalore2001a}\upcite{DBLP:journals/mt/BangaloreR02}以及基于语言学约束的解码\upcite{venugopal2007an,zollmann2007the,liu2006tree,galley2006scalable,chiang2005a}。相关内容将在{\chaptereight}和{\chapterfourteen}进行介绍。
 \vspace{0.5em}
 \end{itemize}
 \end{adjustwidth}
--- a/Chapter5/Figures/figure-a-more-detailed-explanation-of-formula-3.40.tex
+++ b/Chapter5/Figures/figure-a-more-detailed-explanation-of-formula-3.40.tex
@@ -46,7 +46,7 @@
 {
 \draw[decorate,thick,decoration={brace,amplitude=5pt,mirror}] ([yshift=-0.2em]eq5.south west) -- ([yshift=-0.2em]eq6.south east) node [pos=0.4,below,xshift=-0.0em,yshift=-0.3em] (expcount1) {\footnotesize{{``$t_v$翻译为$s_u$''这个事件}}};
 \node [anchor=north west] (expcount2) at ([yshift=0.5em]expcount1.south west) {\footnotesize{{出现次数的期望的估计}}};
-\node [anchor=north west] (expcount3) at ([yshift=0.5em]expcount2.south west) {\footnotesize{{称之为期望频次}}（Expected Count）};
+\node [anchor=north west] (expcount3) at ([yshift=0.5em]expcount2.south west) {\footnotesize{{称之为期望频次}}};
 }

 \end{tikzpicture}

--- a/Chapter5/Figures/figure-calculation-of-the-expected-frequency-2.tex
+++ b/Chapter5/Figures/figure-calculation-of-the-expected-frequency-2.tex
 \qquad
 \begin{tabular}{cccc}
-\multicolumn{1}{c|}{$x_i$} & c($x_i$) & P($x_i$) & $c(x_i)\cdot$P($x_i$) \\ \hline
+\multicolumn{1}{c|}{$x_i$} & c($x_i$) & ($\funp{P}x_i$) & $c(x_i)\cdot$($\funp{P}x_i$) \\ \hline
 \multicolumn{1}{c|}{$x_1$} & 2        & 0.1      & 0.2                   \\
 \multicolumn{1}{c|}{$x_2$} & 1        & 0.3      & 0.3                   \\
 \multicolumn{1}{c|}{$x_3$} & 5        & 0.2      & 1.0                   \\ \hline

--- a/Chapter5/Figures/figure-em-algorithm-flow-chart.tex
+++ b/Chapter5/Figures/figure-em-algorithm-flow-chart.tex
@@ -9,15 +9,15 @@
 \node [anchor=north west] (line2) at ([yshift=-0.3em]line1.south west) {输入: 平行语料${(\seq{s}^{[1]},\seq{t}^{[1]}),...,(\seq{s}^{[K]},\seq{t}^{[K]})}$};
 \node [anchor=north west] (line3) at ([yshift=-0.1em]line2.south west) {输出: 参数$f(\cdot|\cdot)$的最优值};
 \node [anchor=north west] (line4) at ([yshift=-0.1em]line3.south west) {1: \textbf{Function} \textsc{EM}($\{(\seq{s}^{[1]},\seq{t}^{[1]}),...,(\seq{s}^{[K]},\seq{t}^{[K]})\}$) };
-\node [anchor=north west] (line5) at ([yshift=-0.1em]line4.south west) {2: \ \ Initialize $f(\cdot|\cdot)$ \hspace{5em} $\rhd$ 比如给$f(\cdot|\cdot)$一个均匀分布};
-\node [anchor=north west] (line6) at ([yshift=-0.1em]line5.south west) {3: \ \ Loop until $f(\cdot|\cdot)$ converges};
-\node [anchor=north west] (line7) at ([yshift=-0.1em]line6.south west) {4: \ \ \ \ \textbf{foreach} $k = 1$ to $K$ \textbf{do}};
-\node [anchor=north west] (line8) at ([yshift=-0.1em]line7.south west) {5: \ \ \ \ \ \ \ \footnotesize{$c_{\mathbb{E}}(\seq{s}_u|\seq{t}_v;\seq{s}^{[k]},\seq{t}^{[k]}) = \sum\limits_{j=1}^{|\seq{s}^{[k]}|} \delta(s_j,s_u) \sum\limits_{i=0}^{|\seq{t}^{[k]}|} \delta(t_i,t_v) \cdot \frac{f(s_u|t_v)}{\sum_{i=0}^{l}f(s_u|t_i)}$}\normalsize{}};
-\node [anchor=north west] (line9) at ([yshift=-0.1em]line8.south west) {6: \ \ \ \ \textbf{foreach} $t_v$ appears at least one of $\{\seq{t}^{[1]},...,\seq{t}^{[K]}\}$ \textbf{do}};
-\node [anchor=north west] (line10) at ([yshift=-0.1em]line9.south west) {7: \ \ \ \ \ \ \ $\lambda_{t_v}^{'} = \sum_{s_u} \sum_{k=1}^{K} c_{\mathbb{E}}(s_u|t_v;\seq{s}^{[k]},\seq{t}^{[k]})$};
-\node [anchor=north west] (line11) at ([yshift=-0.1em]line10.south west) {8: \ \ \ \ \ \ \ \textbf{foreach} $s_u$ appears at least one of $\{\seq{s}^{[1]},...,\seq{s}^{[K]}\}$ \textbf{do}};
-\node [anchor=north west] (line12) at ([yshift=-0.1em]line11.south west) {9: \ \ \ \ \ \ \ \ \ $f(s_u|t_v) = \sum_{k=1}^{K} c_{\mathbb{E}}(s_u|t_v;\seq{s}^{[k]},\seq{t}^{[k]}) \cdot (\lambda_{t_v}^{'})^{-1}$};
-\node [anchor=north west] (line13) at ([yshift=-0.1em]line12.south west) {10: \ \textbf{return} $f(\cdot|\cdot)$};
+\node [anchor=north west] (line5) at ([yshift=-0.1em]line4.south west) {2: \quad Initialize $f(\cdot|\cdot)$ \hspace{3em} $\rhd$ 比如给$f(\cdot|\cdot)$一个均匀分布};
+\node [anchor=north west] (line6) at ([yshift=-0.1em]line5.south west) {3: \quad Loop until $f(\cdot|\cdot)$ converges};
+\node [anchor=north west] (line7) at ([yshift=-0.1em]line6.south west) {4: \quad \quad  \textbf{foreach} $k = 1$ to $K$ \textbf{do}};
+\node [anchor=north west] (line8) at ([yshift=-0.1em]line7.south west) {5: \quad \quad \quad \footnotesize{$c_{\mathbb{E}}(\seq{s}_u|\seq{t}_v;\seq{s}^{[k]},\seq{t}^{[k]}) = \sum\limits_{j=1}^{|\seq{s}^{[k]}|} \delta(s_j,s_u) \sum\limits_{i=0}^{|\seq{t}^{[k]}|} \delta(t_i,t_v) \cdot \frac{f(s_u|t_v)}{\sum_{i=0}^{l}f(s_u|t_i)}$}\normalsize{}};
+\node [anchor=north west] (line9) at ([yshift=-0.1em]line8.south west) {6: \quad \quad \textbf{foreach} $t_v$ appears at least one of $\{\seq{t}^{[1]},...,\seq{t}^{[K]}\}$ \textbf{do}};
+\node [anchor=north west] (line10) at ([yshift=-0.1em]line9.south west) {7: \quad \quad \quad $\lambda_{t_v}^{'} = \sum_{s_u} \sum_{k=1}^{K} c_{\mathbb{E}}(s_u|t_v;\seq{s}^{[k]},\seq{t}^{[k]})$};
+\node [anchor=north west] (line11) at ([yshift=-0.1em]line10.south west) {8: \quad \quad \quad \textbf{foreach} $s_u$ appears at least one of $\{\seq{s}^{[1]},...,\seq{s}^{[K]}\}$ \textbf{do}};
+\node [anchor=north west] (line12) at ([yshift=-0.1em]line11.south west) {9: \quad \quad \quad \quad $f(s_u|t_v) = \sum_{k=1}^{K} c_{\mathbb{E}}(s_u|t_v;\seq{s}^{[k]},\seq{t}^{[k]}) \cdot (\lambda_{t_v}^{'})^{-1}$};
+\node [anchor=north west] (line13) at ([yshift=-0.1em]line12.south west) {10: \textbf{return} $f(\cdot|\cdot)$};

 \begin{pgfonlayer}{background}
 {

--- a/Chapter5/Figures/figure-greedy-mt-decoding-process-1.tex
+++ b/Chapter5/Figures/figure-greedy-mt-decoding-process-1.tex
@@ -19,7 +19,7 @@
 \node [anchor=west] (s4) at ([xshift=2.5em]s3.east) {{感到}};
 \node [anchor=west] (s5) at ([xshift=2.5em]s4.east) {{满意}};

-\node [anchor=south west,inner sep=1pt] (sentlabel) at ([yshift=0.3em]s1.north west) {\scriptsize{{输入: 待翻译句子(已经分词)}}};
+\node [anchor=south west,inner sep=1pt] (sentlabel) at ([yshift=0.3em]s1.north west) {\scriptsize{{输入: 待翻译句子（已经分词）}}};
 {
 \draw [->,very thick,ublue] ([yshift=0.2em]s1.south) -- ([yshift=-0.8em]s1.south) node [pos=0.5,right] (pi1) {\tiny{$\pi$(1)}};
 \draw [->,very thick,ublue] ([yshift=0.2em]s2.south) -- ([yshift=-0.8em]s2.south) node [pos=0.5,right] (pi2) {\tiny{$\pi$(2)}};
@@ -102,7 +102,7 @@
 \node [anchor=west] (hlabel) at ([yshift=-2.5em]jlabel.west) {\scriptsize{$i = 1, j = 1$}};
 }
 {\tiny
-\node [anchor=north west] (glabel) at (hlabel.south west) {$g(\mathbf{s},\mathbf{t})$};
+\node [anchor=north west] (glabel) at (hlabel.south west) {$g(\seq{s},\seq{t})$};
 \node [anchor=west] (translabel) at (glabel.east) {翻译结果};
 \draw [-] (glabel.north east) -- ([yshift=-1.5in]glabel.north east);
 \draw [-] (glabel.south west) -- ([xshift=3.5in]glabel.south west);
@@ -124,7 +124,7 @@
 \node [anchor=west] (s4) at ([xshift=2.5em]s3.east) {{感到}};
 \node [anchor=west] (s5) at ([xshift=2.5em]s4.east) {{满意}};

-\node [anchor=south west,inner sep=1pt] (sentlabel) at ([yshift=0.3em]s1.north west) {\scriptsize{{输入: 待翻译句子(已经分词)}}};
+\node [anchor=south west,inner sep=1pt] (sentlabel) at ([yshift=0.3em]s1.north west) {\scriptsize{{输入: 待翻译句子（已经分词）}}};

 {
 \draw [->,very thick,ublue] ([yshift=0.2em]s1.south) -- ([yshift=-0.8em]s1.south) node [pos=0.5,right] (pi1) {\tiny{$\pi$(1)}};
@@ -227,7 +227,7 @@
 }

 {\tiny%下面的表格
-\node [anchor=north west] (glabel) at (hlabel.south west) {$g(\mathbf{s},\mathbf{t})$};
+\node [anchor=north west] (glabel) at (hlabel.south west) {$g(\seq{s},\seq{t})$};
 \node [anchor=west] (translabel) at (glabel.east) {翻译结果};
 \draw [-] (glabel.north east) -- ([yshift=-1.5in]glabel.north east);
 \draw [-] (glabel.south west) -- ([xshift=3.5in]glabel.south west);

--- a/Chapter5/Figures/figure-greedy-mt-decoding-process-3.tex
+++ b/Chapter5/Figures/figure-greedy-mt-decoding-process-3.tex
@@ -14,7 +14,7 @@
 \node [anchor=west] (s4) at ([xshift=2.5em]s3.east) {{感到}};
 \node [anchor=west] (s5) at ([xshift=2.5em]s4.east) {{满意}};

-\node [anchor=south west,inner sep=1pt] (sentlabel) at ([yshift=0.3em]s1.north west) {\scriptsize{{输入: 待翻译句子(已经分词)}}};
+\node [anchor=south west,inner sep=1pt] (sentlabel) at ([yshift=0.3em]s1.north west) {\scriptsize{{输入: 待翻译句子（已经分词）}}};

 {
 \draw [->,very thick,ublue] ([yshift=0.2em]s1.south) -- ([yshift=-0.8em]s1.south) node [pos=0.5,right] (pi1) {\tiny{$\pi$(1)}};
@@ -119,7 +119,7 @@
 }

 {\tiny%下面的表格
-\node [anchor=north west] (glabel) at (hlabel.south west) {$g(\mathbf{s},\mathbf{t})$};
+\node [anchor=north west] (glabel) at (hlabel.south west) {$g(\seq{s},\seq{t})$};
 \node [anchor=west] (translabel) at (glabel.east) {翻译结果};
 \draw [-] (glabel.north east) -- ([yshift=-2.0in]glabel.north east);
 \draw [-] (glabel.south west) -- ([xshift=3.5in]glabel.south west);
@@ -179,7 +179,7 @@
 \node [anchor=west] (s4) at ([xshift=2.5em]s3.east) {{感到}};
 \node [anchor=west] (s5) at ([xshift=2.5em]s4.east) {{满意}};

-\node [anchor=south west,inner sep=1pt] (sentlabel) at ([yshift=0.3em]s1.north west) {\scriptsize{{输入: 待翻译句子(已经分词)}}};
+\node [anchor=south west,inner sep=1pt] (sentlabel) at ([yshift=0.3em]s1.north west) {\scriptsize{{输入: 待翻译句子（已经分词）}}};

 {
 \draw [->,very thick,ublue] ([yshift=0.2em]s1.south) -- ([yshift=-0.8em]s1.south) node [pos=0.5,right] (pi1) {\tiny{$\pi$(1)}};
@@ -277,7 +277,7 @@
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 {\tiny%下面的表格
-\node [anchor=north west] (glabel) at (hlabel.south west) {$g(\mathbf{s},\mathbf{t})$};
+\node [anchor=north west] (glabel) at (hlabel.south west) {$g(\seq{s},\seq{t})$};
 \node [anchor=west] (translabel) at (glabel.east) {翻译结果};
 \draw [-] (glabel.north east) -- ([yshift=-2.0in]glabel.north east);
 \draw [-] (glabel.south west) -- ([xshift=3.5in]glabel.south west);

--- a/Chapter5/Figures/figure-human-translation.tex
+++ b/Chapter5/Figures/figure-human-translation.tex
@@ -10,7 +10,7 @@
 \node [anchor=west] (s3) at ([xshift=2em]s2.east) {{你}};
 \node [anchor=west] (s4) at ([xshift=2em]s3.east) {{感到}};
 \node [anchor=west] (s5) at ([xshift=2em]s4.east) {{满意}};
-\node [anchor=south west] (sentlabel) at ([yshift=-0.5em]s1.north west) {\scriptsize{\sffamily\bfseries{待翻译句子(已经分词):}}};
+\node [anchor=south west] (sentlabel) at ([yshift=-0.5em]s1.north west) {\scriptsize{\sffamily\bfseries{待翻译句子（已经分词）:}}};

 {
 \draw [->,very thick,ublue] (s1.south) -- ([yshift=-0.7em]s1.south);

--- a/Chapter5/Figures/figure-process-of-machine-translation.tex
+++ b/Chapter5/Figures/figure-process-of-machine-translation.tex
@@ -184,14 +184,14 @@
 \draw[decorate,thick,decoration={brace,amplitude=5pt}] ([yshift=8em,xshift=2.0em]t53.south east) -- ([xshift=2.0em]t53.south east) node [pos=0.5,right,xshift=0.5em,yshift=2.0em] (label2) {\footnotesize{{从双语数}}};
 \node [anchor=north west] (label2part2) at ([yshift=0.3em]label2.south west) {\footnotesize{{据中自动}}};
 \node [anchor=north west] (label2part3) at ([yshift=0.3em]label2part2.south west) {\footnotesize{{学习词典}}};
-\node [anchor=north west] (label2part4) at ([yshift=0.3em]label2part3.south west) {\footnotesize{{(训练)}}};
+\node [anchor=north west] (label2part4) at ([yshift=0.3em]label2part3.south west) {\footnotesize{{（训练）}}};
 }

 {
 \draw[decorate,thick,decoration={brace,amplitude=5pt}] ([yshift=-1.0em,xshift=6.2em]t53.south west) -- ([yshift=-10.5em,xshift=6.2em]t53.south west) node [pos=0.5,right,xshift=0.5em,yshift=2.0em] (label3) {\footnotesize{{利用概率}}};
 \node [anchor=north west] (label3part2) at ([yshift=0.3em]label3.south west) {\footnotesize{{化的词典}}};
 \node [anchor=north west] (label3part3) at ([yshift=0.3em]label3part2.south west) {\footnotesize{{进行翻译}}};
-\node [anchor=north west] (label3part4) at ([yshift=0.3em]label3part3.south west) {\footnotesize{{(解码)}}};
+\node [anchor=north west] (label3part4) at ([yshift=0.3em]label3part3.south west) {\footnotesize{{（解码）}}};
 }
 \end{scope}


--- a/Chapter5/chapter5.tex
+++ b/Chapter5/chapter5.tex
@@ -138,7 +138,7 @@ IBM模型由Peter F. Brown等人于上世纪九十年代初提出\upcite{DBLP:jo

 \parinterval 对于第二个问题，尽管机器能够找到很多译文选择路径，但它并不知道哪些路径是好的。说地再直白一些，简单地枚举路径实际上就是一个体力活，没有太多的智能。因此计算机还需要再聪明一些，运用它的能够“掌握”的知识判断翻译结果的好与坏。这一步是最具挑战的，当然也有很多思路。在统计机器翻译中，这个问题被定义为：设计一种统计模型，它可以给每个译文一个可能性，而这个可能性越高表明译文越接近人工翻译。

-\parinterval 如图\ref{fig:5-4}所示，每个单词翻译候选的右侧黑色框里的数字就是单词的翻译概率，使用这些单词的翻译概率，可以得到整句译文的概率（用符号P表示）。这样，就用概率化的模型描述了每个翻译候选的可能性。基于这些翻译候选的可能性，机器翻译系统可以对所有的翻译路径进行打分，比如，图\ref{fig:5-4}中第一条路径的分数为0.042，第二条是0.006，以此类推。最后，系统可以选择分数最高的路径作为源语言句子的最终译文。
+\parinterval 如图\ref{fig:5-4}所示，每个单词翻译候选的右侧黑色框里的数字就是单词的翻译概率，使用这些单词的翻译概率，可以得到整句译文的概率（用符号$\funp{P}$表示）。这样，就用概率化的模型描述了每个翻译候选的可能性。基于这些翻译候选的可能性，机器翻译系统可以对所有的翻译路径进行打分，比如，图\ref{fig:5-4}中第一条路径的分数为0.042，第二条是0.006，以此类推。最后，系统可以选择分数最高的路径作为源语言句子的最终译文。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -279,7 +279,7 @@ $\seq{t}$ = machine\; \underline{translation}\; is\; a\; process\; of\; generati
 \label{eq:5-4}
 \end{eqnarray}

-\parinterval 与公式\ref{eq:5-1}相比，公式\ref{eq:5-4}的分子、分母都多了一项累加符号$\sum_{k=1}^{K} \cdot$，它表示遍历语料库中所有的句对。换句话说，当计算词的共现次数时，需要对每个句对上的计数结果进行累加。从统计学习的角度，使用更大规模的数据进行参数估计可以提高结果的可靠性。计算单词的翻译概率也是一样，在小规模的数据上看，很多翻译现象的特征并不突出，但是当使用的数据量增加到一定程度，翻译的规律会很明显的体现出来。
+\parinterval 与公式\eqref{eq:5-1}相比，公式\eqref{eq:5-4}的分子、分母都多了一项累加符号$\sum_{k=1}^{K} \cdot$，它表示遍历语料库中所有的句对。换句话说，当计算词的共现次数时，需要对每个句对上的计数结果进行累加。从统计学习的角度，使用更大规模的数据进行参数估计可以提高结果的可靠性。计算单词的翻译概率也是一样，在小规模的数据上看，很多翻译现象的特征并不突出，但是当使用的数据量增加到一定程度，翻译的规律会很明显的体现出来。

 \parinterval 举个例子，实例\ref{eg:5-2}展示了一个由两个句对构成的平行语料库。

@@ -306,7 +306,7 @@ $\seq{t}^{[2]}$ = So\; ,\; what\; is\; human\; \underline{translation}\; ?
 \label{eq:5-5}
 \end{eqnarray}
 }
-\parinterval 公式\ref{eq:5-5}所展示的计算过程很简单，分子是两个句对中“翻译”和“translation”共现次数的累计，分母是两个句对的源语言单词和目标语言单词的组合数的累加。显然，这个方法也很容易推广到处理更多句子的情况。
+\parinterval 公式\eqref{eq:5-5}所展示的计算过程很简单，分子是两个句对中“翻译”和“translation”共现次数的累计，分母是两个句对的源语言单词和目标语言单词的组合数的累加。显然，这个方法也很容易推广到处理更多句子的情况。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -323,14 +323,14 @@ $\seq{t}^{[2]}$ = So\; ,\; what\; is\; human\; \underline{translation}\; ?

 \subsubsection{1. 基础模型}

-\parinterval 计算句子级翻译概率并不简单。因为自然语言非常灵活，任何数据无法覆盖足够多的句子，因此，无法像公式\ref{eq:5-4}一样直接用简单计数的方式对句子的翻译概率进行估计。这里，采用一个退而求其次的方法：找到一个函数$g(\seq{s},\seq{t})\ge 0$来模拟翻译概率对译文可能性进行估计。可以定义一个新的函数$g(\seq{s},\seq{t})$，令其满足：给定$\seq{s}$，翻译结果$\seq{t}$出现的可能性越大，$g(\seq{s},\seq{t})$的值越大；$\seq{t}$出现的可能性越小，$g(\seq{s},\seq{t})$的值越小。换句话说，$g(\seq{s},\seq{t})$和翻译概率$\funp{P}(\seq{t}|\seq{s})$呈正相关。如果存在这样的函数$g(\seq{s},\seq{t}
+\parinterval 计算句子级翻译概率并不简单。因为自然语言非常灵活，任何数据无法覆盖足够多的句子，因此，无法像公式\eqref{eq:5-4}一样直接用简单计数的方式对句子的翻译概率进行估计。这里，采用一个退而求其次的方法：找到一个函数$g(\seq{s},\seq{t})\ge 0$来模拟翻译概率对译文可能性进行估计。可以定义一个新的函数$g(\seq{s},\seq{t})$，令其满足：给定$\seq{s}$，翻译结果$\seq{t}$出现的可能性越大，$g(\seq{s},\seq{t})$的值越大；$\seq{t}$出现的可能性越小，$g(\seq{s},\seq{t})$的值越小。换句话说，$g(\seq{s},\seq{t})$和翻译概率$\funp{P}(\seq{t}|\seq{s})$呈正相关。如果存在这样的函数$g(\seq{s},\seq{t}
 )$，可以利用$g(\seq{s},\seq{t})$近似表示$\funp{P}(\seq{t}|\seq{s})$，如下：
 \begin{eqnarray}
 \funp{P}(\seq{t}|\seq{s})  \equiv  \frac{g(\seq{s},\seq{t})}{\sum_{\seq{t}'}g(\seq{s},\seq{t}')}
 \label{eq:5-6}
 \end{eqnarray}

-\parinterval 公式\ref{eq:5-6}相当于在函数$g(\cdot)$上做了归一化，这样等式右端的结果具有一些概率的属性，比如，$0 \le \frac{g(\seq{s},\seq{t})}{\sum_{\seq{t'}}g(\seq{s},\seq{t'})} \le 1$。具体来说，对于源语言句子$\seq{s}$，枚举其所有的翻译结果，并把所对应的函数$g(\cdot)$相加作为分母，而分子是某个翻译结果$\seq{t}$所对应的$g(\cdot)$的值。
+\parinterval 公式\eqref{eq:5-6}相当于在函数$g(\cdot)$上做了归一化，这样等式右端的结果具有一些概率的属性，比如，$0 \le \frac{g(\seq{s},\seq{t})}{\sum_{\seq{t'}}g(\seq{s},\seq{t'})} \le 1$。具体来说，对于源语言句子$\seq{s}$，枚举其所有的翻译结果，并把所对应的函数$g(\cdot)$相加作为分母，而分子是某个翻译结果$\seq{t}$所对应的$g(\cdot)$的值。

 \parinterval 上述过程初步建立了句子级翻译模型，并没有直接求$\funp{P}(\seq{t}|\seq{s})$，而是把问题转化为对$g(\cdot)$的设计和计算上。但是，面临着两个新的问题：

@@ -338,7 +338,7 @@ $\seq{t}^{[2]}$ = So\; ,\; what\; is\; human\; \underline{translation}\; ?
 \vspace{0.5em}
 \item 如何定义函数$g(\seq{s},\seq{t})$？即，在知道单词翻译概率的前提下，如何计算$g(\seq{s},\seq{t})$；
 \vspace{0.5em}
-\item 公式\ref{eq:5-6}中分母$\sum_{seq{t'}}g(\seq{s},{\seq{t}'})$需要累加所有翻译结果的$g(\seq{s},{\seq{t}'})$，但枚举所有${\seq{t}'}$是不现实的。
+\item 公式\eqref{eq:5-6}中分母$\sum_{seq{t'}}g(\seq{s},{\seq{t}'})$需要累加所有翻译结果的$g(\seq{s},{\seq{t}'})$，但枚举所有${\seq{t}'}$是不现实的。
 \vspace{0.5em}
 \end{itemize}

@@ -381,7 +381,7 @@ g(\seq{s},\seq{t}) = \prod_{(j,i)\in \widehat{A}}\funp{P}(s_j,t_i)

 \subsubsection{2. 生成流畅的译文}

-\parinterval 公式\ref{eq:5-7}定义的$g(\seq{s},\seq{t})$存在的问题是没有考虑词序信息。这里用一个简单的例子说明这个问题。如图\ref{fig:5-8}所示，源语言句子“我 对 你 感到 满意”有两个翻译结果，第一个翻译结果是“I am satisfied with you”，第二个是“I with you am satisfied”。虽然这两个译文包含的目标语单词是一样的，但词序存在很大差异。比如，它们都选择了“satisfied”作为源语单词“满意”的译文，但是在第一个翻译结果中“satisfied”处于第3个位置，而第二个结果中处于最后的位置。显然第一个翻译结果更符合英语的表达习惯，翻译的质量更高。遗憾的是，对于有明显差异的两个译文，公式\ref{eq:5-7}计算得到的函数$g(\cdot)$的值却是一样的。
+\parinterval 公式\eqref{eq:5-7}定义的$g(\seq{s},\seq{t})$存在的问题是没有考虑词序信息。这里用一个简单的例子说明这个问题。如图\ref{fig:5-8}所示，源语言句子“我 对 你 感到 满意”有两个翻译结果，第一个翻译结果是“I am satisfied with you”，第二个是“I with you am satisfied”。虽然这两个译文包含的目标语单词是一样的，但词序存在很大差异。比如，它们都选择了“satisfied”作为源语单词“满意”的译文，但是在第一个翻译结果中“satisfied”处于第3个位置，而第二个结果中处于最后的位置。显然第一个翻译结果更符合英语的表达习惯，翻译的质量更高。遗憾的是，对于有明显差异的两个译文，公式\eqref{eq:5-7}计算得到的函数$g(\cdot)$的值却是一样的。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -403,13 +403,13 @@ g(\seq{s},\seq{t}) = \prod_{(j,i)\in \widehat{A}}\funp{P}(s_j,t_i)

 \noindent  其中，$\seq{t}=t_1...t_l$表示由$l$个单词组成的句子，$\funp{P}_{\textrm{lm}}(\seq{t})$表示语言模型给句子$\seq{t}$的打分。具体而言，$\funp{P}_{\textrm{lm}}(\seq{t})$被定义为$\funp{P}(t_i|t_{i-1})(i=1,2,...,l)$的连乘\footnote{为了确保数学表达的准确性，本书中定义$\funp{P}(t_1|t_0) \equiv \funp{P}(t_1)$}，其中$\funp{P}(t_i|t_{i-1})(i=1,2,...,l)$表示前面一个单词为$t_{i-1}$时，当前单词为$t_i$的概率。语言模型的训练方法可以参看{\chaptertwo}相关内容。

-\parinterval 回到建模问题上来。既然语言模型可以帮助系统度量每个译文的流畅度，那么可以使用它对翻译进行打分。一种简单的方法是把语言模型$\funp{P}_{\textrm{lm}}{(\seq{t})}$ 和公式\ref{eq:5-7}中的$g(\seq{s},\seq{t})$相乘，这样就得到了一个新的$g(\seq{s},\seq{t})$，它同时考虑了翻译准确性（$\prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)}$）和流畅度（$\funp{P}_{\textrm{lm}}(\seq{t})$）:
+\parinterval 回到建模问题上来。既然语言模型可以帮助系统度量每个译文的流畅度，那么可以使用它对翻译进行打分。一种简单的方法是把语言模型$\funp{P}_{\textrm{lm}}{(\seq{t})}$ 和公式\eqref{eq:5-7}中的$g(\seq{s},\seq{t})$相乘，这样就得到了一个新的$g(\seq{s},\seq{t})$，它同时考虑了翻译准确性（$\prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)}$）和流畅度（$\funp{P}_{\textrm{lm}}(\seq{t})$）:
 \begin{eqnarray}
 g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times  \funp{P}_{\textrm{lm}}(\seq{t})
 \label{eq:5-10}
 \end{eqnarray}

-\parinterval 如图\ref{fig:5-9}所示，语言模型$\funp{P}_{\textrm{lm}}(\seq{t})$分别给$\seq{t}^{'}$和$\seq{t}^{”}$赋予0.0107和0.0009的概率，这表明句子$\seq{t}^{'}$更符合英文的表达，这与期望是相吻合的。它们再分别乘以$\prod_{j,i \in \widehat{A}}{\funp{P}(s_j},t_i)$的值，就得到公式\ref{eq:5-10}定义的函数$g(\cdot)$的值。显然句子$\seq{t}^{'}$的分数更高。至此，完成了对函数$g(\seq{s},\seq{t})$的一个简单定义，把它带入公式\ref{eq:5-6}就得到了同时考虑准确性和流畅性的句子级统计翻译模型。
+\parinterval 如图\ref{fig:5-9}所示，语言模型$\funp{P}_{\textrm{lm}}(\seq{t})$分别给$\seq{t}^{'}$和$\seq{t}^{”}$赋予0.0107和0.0009的概率，这表明句子$\seq{t}^{'}$更符合英文的表达，这与期望是相吻合的。它们再分别乘以$\prod_{j,i \in \widehat{A}}{\funp{P}(s_j},t_i)$的值，就得到公式\eqref{eq:5-10}定义的函数$g(\cdot)$的值。显然句子$\seq{t}^{'}$的分数更高。至此，完成了对函数$g(\seq{s},\seq{t})$的一个简单定义，把它带入公式\eqref{eq:5-6}就得到了同时考虑准确性和流畅性的句子级统计翻译模型。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -434,19 +434,19 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \label{eq:5-11}
 \end{eqnarray}

-\noindent  其中$\argmax_{\seq{t}} \funp{P}(\seq{t}|\seq{s})$表示找到使$\funp{P}(\seq{t}|\seq{s})$达到最大时的译文$\seq{t}$。结合上一小节中关于$\funp{P}(\seq{t}|\seq{s})$的定义，把公式\ref{eq:5-6}带入公式\ref{eq:5-11}得到：
+\noindent  其中$\argmax_{\seq{t}} \funp{P}(\seq{t}|\seq{s})$表示找到使$\funp{P}(\seq{t}|\seq{s})$达到最大时的译文$\seq{t}$。结合上一小节中关于$\funp{P}(\seq{t}|\seq{s})$的定义，把公式\eqref{eq:5-6}带入公式\eqref{eq:5-11}得到：
 \begin{eqnarray}
 \widehat{\seq{t}}=\argmax_{\seq{t}}\frac{g(\seq{s},\seq{t})}{\sum_{\seq{t}^{'}g(\seq{s},\seq{t}^{'})}}
 \label{eq:5-12}
 \end{eqnarray}

-\parinterval 在公式\ref{eq:5-12}中，可以发现${\sum_{\seq{t}^{'}g(\seq{s},\seq{t}^{'})}}$是一个关于$\seq{s}$的函数，当给定源语句$\seq{s}$时，它是一个常数，而且$g(\cdot) \ge 0$，因此${\sum_{\seq{t}^{'}g(\seq{s},\seq{t}^{'})}}$不影响对$\widehat{\seq{t}}$的求解，也不需要计算。基于此，公式\ref{eq:5-12}可以被化简为：
+\parinterval 在公式\eqref{eq:5-12}中，可以发现${\sum_{\seq{t}^{'}g(\seq{s},\seq{t}^{'})}}$是一个关于$\seq{s}$的函数，当给定源语句$\seq{s}$时，它是一个常数，而且$g(\cdot) \ge 0$，因此${\sum_{\seq{t}^{'}g(\seq{s},\seq{t}^{'})}}$不影响对$\widehat{\seq{t}}$的求解，也不需要计算。基于此，公式\eqref{eq:5-12}可以被化简为：
 \begin{eqnarray}
 \widehat{\seq{t}}=\argmax_{\seq{t}}g(\seq{s},\seq{t})
 \label{eq:5-13}
 \end{eqnarray}

-\parinterval 公式\ref{eq:5-13}定义了解码的目标，剩下的问题是实现$\argmax$，以快速准确地找到最佳译文$\widehat{\seq{t}}$。但是，简单遍历所有可能的译文并计算$g(\seq{s},\seq{t})$ 的值是不可行的，因为所有潜在译文构成的搜索空间是十分巨大的。为了理解机器翻译的搜索空间的规模，假设源语言句子$\seq{s}$有$m$个词，每个词有$n$个可能的翻译候选。如果从左到右一步步翻译每个源语言单词，那么简单的顺序翻译会有$n^m$种组合。如果进一步考虑目标语单词的任意调序，每一种对翻译候选进行选择的结果又会对应$m!$种不同的排序。因此，源语句子$\seq{s}$至少有$n^m \cdot m!$ 个不同的译文。
+\parinterval 公式\eqref{eq:5-13}定义了解码的目标，剩下的问题是实现$\argmax$，以快速准确地找到最佳译文$\widehat{\seq{t}}$。但是，简单遍历所有可能的译文并计算$g(\seq{s},\seq{t})$ 的值是不可行的，因为所有潜在译文构成的搜索空间是十分巨大的。为了理解机器翻译的搜索空间的规模，假设源语言句子$\seq{s}$有$m$个词，每个词有$n$个可能的翻译候选。如果从左到右一步步翻译每个源语言单词，那么简单的顺序翻译会有$n^m$种组合。如果进一步考虑目标语单词的任意调序，每一种对翻译候选进行选择的结果又会对应$m!$种不同的排序。因此，源语句子$\seq{s}$至少有$n^m \cdot m!$ 个不同的译文。

 \parinterval $n^{m}\cdot m!$是什么样的概念呢？如表\ref{tab:5-2}所示，当$m$和$n$分别为2和10时，译文只有200个，不算多。但是当$m$和$n$分别为20和10时，即源语言句子的长度20，每个词有10个候选译文，系统会面对$2.4329 \times 10^{38}$个不同的译文，这几乎是不可计算的。

@@ -497,7 +497,6 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
    \centering
 \subfigure{\input{./Chapter5/Figures/figure-greedy-mt-decoding-process-1}}
 \subfigure{\input{./Chapter5/Figures/figure-greedy-mt-decoding-process-3}}
-%\setlength{\belowcaptionskip}{2.0em}
    \caption{贪婪的机器翻译解码过程实例}
    \label{fig:5-11}
 \end{figure}
@@ -547,14 +546,14 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \label{eq:5-14}
 \end{eqnarray}

-\parinterval 公式\ref{eq:5-14}的核心内容之一是定义$\funp{P}(\seq{t}|\seq{s})$。在IBM模型中，可以使用贝叶斯准则对$\funp{P}(\seq{t}|\seq{s})$进行如下变换：
+\parinterval 公式\eqref{eq:5-14}的核心内容之一是定义$\funp{P}(\seq{t}|\seq{s})$。在IBM模型中，可以使用贝叶斯准则对$\funp{P}(\seq{t}|\seq{s})$进行如下变换：
 \begin{eqnarray}
 \funp{P}(\seq{t}|\seq{s}) & = &\frac{\funp{P}(\seq{s},\seq{t})}{\funp{P}(\seq{s})} \nonumber \\
                       & = & \frac{\funp{P}(\seq{s}|\seq{t})\funp{P}(\seq{t})}{\funp{P}(\seq{s})}
 \label{eq:5-15}
 \end{eqnarray}

-\parinterval 公式\ref{eq:5-15}把$\seq{s}$到$\seq{t}$的翻译概率转化为$\frac{\funp{P}(\seq{s}|\seq{t})\textrm{P(t)}}{\funp{P}(\seq{s})}$，它包括三个部分：
+\parinterval 公式\eqref{eq:5-15}把$\seq{s}$到$\seq{t}$的翻译概率转化为$\frac{\funp{P}(\seq{s}|\seq{t})\textrm{P(t)}}{\funp{P}(\seq{s})}$，它包括三个部分：

 \begin{itemize}
 \vspace{0.5em}
@@ -574,7 +573,7 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \label{eq:5-16}
 \end{eqnarray}

-\parinterval 公式\ref{eq:5-16}展示了IBM模型最基础的建模方式，它把模型分解为两项：（反向）翻译模型$\funp{P}(\seq{s}|\seq{t})$和语言模型$\funp{P}(\seq{t})$。一个很自然的问题是：直接用$\funp{P}(\seq{t}|\seq{s})$定义翻译问题不就可以了吗，为什么要用$\funp{P}(\seq{s}|\seq{t})$和$\funp{P}(\seq{t})$的联合模型？从理论上来说，正向翻译模型$\funp{P}(\seq{t}|\seq{s})$和反向翻译模型$\funp{P}(\seq{s}|\seq{t})$的数学建模可以是一样的，因为我们只需要在建模的过程中把两个语言调换即可。使用$\funp{P}(\seq{s}|\seq{t})$和$\funp{P}(\seq{t})$的联合模型的意义在于引入了语言模型，它可以很好地对译文的流畅度进行评价，确保结果是通顺的目标语言句子。
+\parinterval 公式\eqref{eq:5-16}展示了IBM模型最基础的建模方式，它把模型分解为两项：（反向）翻译模型$\funp{P}(\seq{s}|\seq{t})$和语言模型$\funp{P}(\seq{t})$。一个很自然的问题是：直接用$\funp{P}(\seq{t}|\seq{s})$定义翻译问题不就可以了吗，为什么要用$\funp{P}(\seq{s}|\seq{t})$和$\funp{P}(\seq{t})$的联合模型？从理论上来说，正向翻译模型$\funp{P}(\seq{t}|\seq{s})$和反向翻译模型$\funp{P}(\seq{s}|\seq{t})$的数学建模可以是一样的，因为我们只需要在建模的过程中把两个语言调换即可。使用$\funp{P}(\seq{s}|\seq{t})$和$\funp{P}(\seq{t})$的联合模型的意义在于引入了语言模型，它可以很好地对译文的流畅度进行评价，确保结果是通顺的目标语言句子。

 \parinterval 可以回忆一下\ref{sec:sentence-level-translation}节中讨论的问题，如果只使用翻译模型可能会造成一个局面：译文的单词都和源语言单词对应的很好，但是由于语序的问题，读起来却不像人说的话。从这个角度说，引入语言模型是十分必要的。这个问题在Brown等人的论文中也有讨论\upcite{DBLP:journals/coling/BrownPPM94}，他们提到单纯使用$\funp{P}(\seq{s}|\seq{t})$会把概率分配给一些翻译对应比较好但是不合法的目标语句子，而且这部分概率可能会很大，影响模型的决策。这也正体现了IBM模型的创新之处，作者用数学技巧把$\funp{P}(\seq{t})$引入进来，保证了系统的输出是通顺的译文。语言模型也被广泛使用在语音识别等领域以保证结果的流畅性，甚至应用的历史比机器翻译要长得多，这里的方法也有借鉴相关工作的味道。

@@ -586,7 +585,7 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 

 \section{统计机器翻译的三个基本问题}

-\parinterval 公式\ref{eq:5-16}给出了统计机器翻译的数学描述。为了实现这个过程，面临着三个基本问题：
+\parinterval 公式\eqref{eq:5-16}给出了统计机器翻译的数学描述。为了实现这个过程，面临着三个基本问题：

 \begin{itemize}
 \vspace{0.5em}
@@ -598,13 +597,13 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \vspace{0.5em}
 \end{itemize}

-\parinterval 为了理解以上的问题，可以先回忆一下\ref{sec:sentence-level-translation}小节中的公式\ref{eq:5-10}，即$g(\seq{s},\seq{t})$函数的定义，它用于评估一个译文的好与坏。如图\ref{fig:5-14}所示，$g(\seq{s},\seq{t})$函数与公式\ref{eq:5-16}的建模方式非常一致，即$g(\seq{s},\seq{t})$函数中红色部分描述译文$\seq{t}$的可能性大小，对应翻译模型$\funp{P}(\seq{s}|\seq{t})$；蓝色部分描述译文的平滑或流畅程度，对应语言模型$\funp{P}(\seq{t})$。尽管这种对应并不十分严格的，但也可以看出在处理机器翻译问题上，很多想法的本质是一样的。
+\parinterval 为了理解以上的问题，可以先回忆一下\ref{sec:sentence-level-translation}小节中的公式\eqref{eq:5-10}，即$g(\seq{s},\seq{t})$函数的定义，它用于评估一个译文的好与坏。如图\ref{fig:5-14}所示，$g(\seq{s},\seq{t})$函数与公式\eqref{eq:5-16}的建模方式非常一致，即$g(\seq{s},\seq{t})$函数中红色部分描述译文$\seq{t}$的可能性大小，对应翻译模型$\funp{P}(\seq{s}|\seq{t})$；蓝色部分描述译文的平滑或流畅程度，对应语言模型$\funp{P}(\seq{t})$。尽管这种对应并不十分严格的，但也可以看出在处理机器翻译问题上，很多想法的本质是一样的。

 %----------------------------------------------
 \begin{figure}[htp]
    \centering
 \input{./Chapter5/Figures/figure-correspondence-between-ibm-model&formula-1.13}
-    \caption{IBM模型与公式\ref{eq:5-10}的对应关系}
+    \caption{IBM模型与公式\eqref{eq:5-10}的对应关系}
    \label{fig:5-14}
 \end{figure}
 %----------------------------------------------
@@ -661,9 +660,9 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \label{eq:5-17}
 \end{eqnarray}

-\parinterval 公式\ref{eq:5-17}使用了简单的全概率公式把$\funp{P}(\seq{s}|\seq{t})$进行展开。通过访问$\seq{s}$和$\seq{t}$之间所有可能的词对齐$\seq{a}$，并把对应的对齐概率进行求和，得到了$\seq{t}$到$\seq{s}$的翻译概率。这里，可以把词对齐看作翻译的隐含变量，这样从$\seq{t}$到$\seq{s}$的生成就变为从$\seq{t}$同时生成$\seq{s}$和隐含变量$\seq{a}$的问题。引入隐含变量是生成式模型常用的手段，通过使用隐含变量，可以把较为困难的端到端学习问题转化为分步学习问题。
+\parinterval 公式\eqref{eq:5-17}使用了简单的全概率公式把$\funp{P}(\seq{s}|\seq{t})$进行展开。通过访问$\seq{s}$和$\seq{t}$之间所有可能的词对齐$\seq{a}$，并把对应的对齐概率进行求和，得到了$\seq{t}$到$\seq{s}$的翻译概率。这里，可以把词对齐看作翻译的隐含变量，这样从$\seq{t}$到$\seq{s}$的生成就变为从$\seq{t}$同时生成$\seq{s}$和隐含变量$\seq{a}$的问题。引入隐含变量是生成式模型常用的手段，通过使用隐含变量，可以把较为困难的端到端学习问题转化为分步学习问题。

-\parinterval 举个例子说明公式\ref{eq:5-17}的实际意义。如图\ref{fig:5-17}所示，可以把从“谢谢\ 你”到“thank you”的翻译分解为9种可能的词对齐。因为源语言句子$\seq{s}$有2个词，目标语言句子$\seq{t}$加上空标记$t_0$共3个词，因此每个源语言单词有3个可能对齐的位置，整个句子共有$3\times3=9$种可能的词对齐。
+\parinterval 举个例子说明公式\eqref{eq:5-17}的实际意义。如图\ref{fig:5-17}所示，可以把从“谢谢\ 你”到“thank you”的翻译分解为9种可能的词对齐。因为源语言句子$\seq{s}$有2个词，目标语言句子$\seq{t}$加上空标记$t_0$共3个词，因此每个源语言单词有3个可能对齐的位置，整个句子共有$3\times3=9$种可能的词对齐。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -680,7 +679,7 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \label{eq:5-18}
 \end{eqnarray}

-\noindent  其中$s_j$和$a_j$分别表示第$j$个源语言单词及第$j$个源语言单词对齐到的目标位置，\seq{s}${{}_1^{j-1}}$表示前$j-1$个源语言单词（即\seq{s}${}_1^{j-1}=s_1...s_{j-1}$），\seq{a}${}_1^{j-1}$表示前$j-1$个源语言的词对齐（即\seq{a}${}_1^{j-1}=a_1...a_{j-1}$），$m$表示源语句子的长度。公式\ref{eq:5-18}将$\funp{P}(\seq{s},\seq{a}|\seq{t})$分解为四个部分，具体含义如下：
+\noindent  其中$s_j$和$a_j$分别表示第$j$个源语言单词及第$j$个源语言单词对齐到的目标位置，\seq{s}${{}_1^{j-1}}$表示前$j-1$个源语言单词（即\seq{s}${}_1^{j-1}=s_1...s_{j-1}$），\seq{a}${}_1^{j-1}$表示前$j-1$个源语言的词对齐（即\seq{a}${}_1^{j-1}=a_1...a_{j-1}$），$m$表示源语句子的长度。公式\eqref{eq:5-18}将$\funp{P}(\seq{s},\seq{a}|\seq{t})$分解为四个部分，具体含义如下：

 \begin{itemize}
 \vspace{0.5em}
@@ -695,7 +694,7 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \end{itemize}
 \parinterval 换句话说，当求$\funp{P}(\seq{s},\seq{a}|\seq{t})$时，首先根据译文$\seq{t}$确定源语言句子$\seq{s}$的长度$m$；当知道源语言句子有多少个单词后，循环$m$次，依次生成第1个到第$m$个源语言单词；当生成第$j$个源语言单词时，要先确定它是由哪个目标语译文单词生成的，即确定生成的源语言单词对应的译文单词的位置；当知道了目标语译文单词的位置，就能确定第$j$个位置的源语言单词。

-\parinterval 需要注意的是公式\ref{eq:5-18}定义的模型并没有做任何化简和假设，也就是说公式的左右两端是严格相等的。在后面的内容中会看到，这种将一个整体进行拆分的方法可以有助于分步骤化简并处理问题。
+\parinterval 需要注意的是公式\eqref{eq:5-18}定义的模型并没有做任何化简和假设，也就是说公式的左右两端是严格相等的。在后面的内容中会看到，这种将一个整体进行拆分的方法可以有助于分步骤化简并处理问题。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -703,7 +702,7 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 

 \subsection{基于词对齐的翻译实例}

-\parinterval 用前面图\ref{fig:5-16}中例子来对公式\ref{eq:5-18}进行说明。例子中，源语言句子“在\ \ 桌子\ \ 上”目标语译文“on the table”之间的词对齐为$\seq{a}=\{\textrm{1-0, 2-3, 3-1}\}$。 公式\ref{eq:5-18}的计算过程如下：
+\parinterval 用前面图\ref{fig:5-16}中例子来对公式\eqref{eq:5-18}进行说明。例子中，源语言句子“在\ \ 桌子\ \ 上”目标语译文“on the table”之间的词对齐为$\seq{a}=\{\textrm{1-0, 2-3, 3-1}\}$。 公式\eqref{eq:5-18}的计算过程如下：

 \begin{itemize}
 \vspace{0.5em}
@@ -734,13 +733,13 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 

 \sectionnewpage
 \section{IBM模型1}
-\parinterval 公式\ref{eq:5-17}和公式\ref{eq:5-18}把翻译问题定义为对译文和词对齐同时进行生成的问题。其中有两个问题：
+\parinterval 公式\eqref{eq:5-17}和公式\eqref{eq:5-18}把翻译问题定义为对译文和词对齐同时进行生成的问题。其中有两个问题：

 \begin{itemize}
 \vspace{0.3em}
-\item 首先，公式\ref{eq:5-17}的右端（$ \sum_{\seq{a}}\funp{P}(\seq{s},\seq{a}|\seq{t})$）要求对所有的词对齐概率进行求和，但是词对齐的数量随着句子长度是呈指数增长，如何遍历所有的对齐$\seq{a}$？
+\item 首先，公式\eqref{eq:5-17}的右端（$ \sum_{\seq{a}}\funp{P}(\seq{s},\seq{a}|\seq{t})$）要求对所有的词对齐概率进行求和，但是词对齐的数量随着句子长度是呈指数增长，如何遍历所有的对齐$\seq{a}$？
 \vspace{0.3em}
-\item 其次，公式\ref{eq:5-18}虽然对词对齐的问题进行了描述，但是模型中的很多参数仍然很复杂，如何计算$\funp{P}(m|\seq{t})$、$\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t})$ 和$\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\seq{t})$？
+\item 其次，公式\eqref{eq:5-18}虽然对词对齐的问题进行了描述，但是模型中的很多参数仍然很复杂，如何计算$\funp{P}(m|\seq{t})$、$\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t})$ 和$\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\seq{t})$？
 \vspace{0.3em}
 \end{itemize}

@@ -751,7 +750,7 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 %----------------------------------------------------------------------------------------
 \vspace{-0.5em}
 \subsection{IBM模型1}
-\parinterval IBM模型1对公式\ref{eq:5-18}中的三项进行了简化。具体方法如下：
+\parinterval IBM模型1对公式\eqref{eq:5-18}中的三项进行了简化。具体方法如下：

 \begin{itemize}
 \item 假设$\funp{P}(m|\seq{t})$为常数$\varepsilon$，即源语言句子长度的生成概率服从均匀分布，如下：
@@ -771,10 +770,10 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \label{eq:5-22}
 \end{eqnarray}

-用一个简单的例子对公式\ref{eq:5-22}进行说明。比如，在图\ref{fig:5-18}中，“桌子”对齐到“table”，可被描述为$f(s_2 |t_{a_2})=f(\textrm{“桌子”}|\textrm{“table”})$，表示给定“table”翻译为“桌子”的概率。通常，$f(s_2 |t_{a_2})$被认为是一种概率词典，它反应了两种语言词汇一级的对应关系。
+用一个简单的例子对公式\eqref{eq:5-22}进行说明。比如，在图\ref{fig:5-18}中，“桌子”对齐到“table”，可被描述为$f(s_2 |t_{a_2})=f(\textrm{“桌子”}|\textrm{“table”})$，表示给定“table”翻译为“桌子”的概率。通常，$f(s_2 |t_{a_2})$被认为是一种概率词典，它反应了两种语言词汇一级的对应关系。
 \end{itemize}

-\parinterval 将上述三个假设和公式\ref{eq:5-18}代入公式\ref{eq:5-17}中，得到$\funp{P}(\seq{s}|\seq{t})$的表达式：
+\parinterval 将上述三个假设和公式\eqref{eq:5-18}代入公式\eqref{eq:5-17}中，得到$\funp{P}(\seq{s}|\seq{t})$的表达式：
 \begin{eqnarray}
 \funp{P}(\seq{s}|\seq{t}) & = &  \sum_{\seq{a}}{\funp{P}(\seq{s},\seq{a}|\seq{t})} \nonumber \\
                        & = &  \sum_{\seq{a}}{\funp{P}(m|\seq{t})}\prod_{j=1}^{m}{\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t})\funp{P}(s_j |a_1^j,s_1^{j-1},m,\seq{t})} \nonumber \\
@@ -792,19 +791,19 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \end{figure}
 %----------------------------------------------

-\parinterval 在公式\ref{eq:5-23}中，需要遍历所有的词对齐，即$ \sum_{\seq{a}}{\cdot}$。但这种表示不够直观，因此可以把这个过程重新表示为如下形式：
+\parinterval 在公式\eqref{eq:5-23}中，需要遍历所有的词对齐，即$ \sum_{\seq{a}}{\cdot}$。但这种表示不够直观，因此可以把这个过程重新表示为如下形式：
 \begin{eqnarray}
 \funp{P}(\seq{s}|\seq{t})={\sum_{a_1=0}^{l}\cdots}{\sum_{a_m=0}^{l}\frac{\varepsilon}{(l+1)^m}}{\prod_{j=1}^{m}f(s_j|t_{a_j})}
 \label{eq:5-24}
 \end{eqnarray}

-\parinterval 公式\ref{eq:5-24}分为两个主要部分。第一部分：遍历所有的对齐$\seq{a}$。其中$\seq{a}$由$\{a_1,...,a_m\}$\\ 组成，每个$a_j\in \{a_1,...,a_m\}$从译文的开始位置$(0)$循环到截止位置$(l)$。如图\ref{fig:5-19}表示的例子，描述的是源语单词$s_3$从译文的开始$t_0$遍历到结尾$t_3$，即$a_3$的取值范围。第二部分: 对于每个$\seq{a}$累加对齐概率$\funp{P}(\seq{s},a| \seq{t})=\frac{\varepsilon}{(l+1)^m}{\prod_{j=1}^{m}f(s_j|t_{a_j})}$。
+\parinterval 公式\eqref{eq:5-24}分为两个主要部分。第一部分：遍历所有的对齐$\seq{a}$。其中$\seq{a}$由$\{a_1,...,a_m\}$\\ 组成，每个$a_j\in \{a_1,...,a_m\}$从译文的开始位置$(0)$循环到截止位置$(l)$。如图\ref{fig:5-19}表示的例子，描述的是源语单词$s_3$从译文的开始$t_0$遍历到结尾$t_3$，即$a_3$的取值范围。第二部分: 对于每个$\seq{a}$累加对齐概率$\funp{P}(\seq{s},a| \seq{t})=\frac{\varepsilon}{(l+1)^m}{\prod_{j=1}^{m}f(s_j|t_{a_j})}$。

 %----------------------------------------------
 \begin{figure}[htp]
    \centering
 \input{./Chapter5/Figures/figure-formula-3.25-part-1-example}
-    \caption{公式{\ref{eq:5-24}}第一部分实例}
+    \caption{公式{\eqref{eq:5-24}}第一部分实例}
    \label{fig:5-19}
 \end{figure}
 %----------------------------------------------
@@ -817,7 +816,7 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 

 \subsection{解码及计算优化}\label{decoding&computational-optimization}

-\parinterval 如果模型参数给定，可以使用IBM模型1对新的句子进行翻译。比如，可以使用\ref{sec:simple-decoding}节描述的解码方法搜索最优译文。在搜索过程中，只需要通过公式\ref{eq:5-24}计算每个译文候选的IBM模型翻译概率。但是，公式\ref{eq:5-24}的高计算复杂度导致这些模型很难直接使用。以IBM模型1为例，这里把公式\ref{eq:5-24}重写为：
+\parinterval 如果模型参数给定，可以使用IBM模型1对新的句子进行翻译。比如，可以使用\ref{sec:simple-decoding}节描述的解码方法搜索最优译文。在搜索过程中，只需要通过公式\eqref{eq:5-24}计算每个译文候选的IBM模型翻译概率。但是，公式\eqref{eq:5-24}的高计算复杂度导致这些模型很难直接使用。以IBM模型1为例，这里把公式\eqref{eq:5-24}重写为：
 \begin{eqnarray}
 \funp{P}(\seq{s}| \seq{t}) = \frac{\varepsilon}{(l+1)^{m}} \underbrace{\sum\limits_{a_1=0}^{l} ... \sum\limits_{a_m=0}^{l}}_{(l+1)^m\textrm{次循环}} \underbrace{\prod\limits_{j=1}^{m} f(s_j|t_{a_j})}_{m\textrm{次循环}}
 \label{eq:5-27}
@@ -829,7 +828,7 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \label{eq:5-28}
 \end{eqnarray}

-\noindent  公式\ref{eq:5-28}的技巧在于把若干个乘积的加法（等式左手端）转化为若干加法结果的乘积（等式右手端），这样省去了多次循环，把$O((l+1)^m m)$的计算复杂度降为$O((l+1)m)$。此外，公式\ref{eq:5-28}相比公式\ref{eq:5-27}的另一个优点在于，公式\ref{eq:5-28}中乘法的数量更少，因为现代计算机中乘法运算的代价要高于加法，因此公式\ref{eq:5-28}的计算机实现效率更高。图\ref{fig:5-21} 对这个过程进行了进一步解释。
+\noindent  公式\eqref{eq:5-28}的技巧在于把若干个乘积的加法（等式左手端）转化为若干加法结果的乘积（等式右手端），这样省去了多次循环，把$O((l+1)^m m)$的计算复杂度降为$O((l+1)m)$。此外，公式\eqref{eq:5-28}相比公式\eqref{eq:5-27}的另一个优点在于，公式\eqref{eq:5-28}中乘法的数量更少，因为现代计算机中乘法运算的代价要高于加法，因此公式\eqref{eq:5-28}的计算机实现效率更高。图\ref{fig:5-21} 对这个过程进行了进一步解释。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -840,13 +839,13 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \end{figure}
 %----------------------------------------------

-\parinterval 接着，利用公式\ref{eq:5-28}的方式，可以把公式\ref{eq:5-24}重写表示为：
+\parinterval 接着，利用公式\eqref{eq:5-28}的方式，可以把公式\eqref{eq:5-24}重写表示为：
 \begin{eqnarray}
 \textrm{IBM模型1：\ \ \ \ } \funp{P}(\seq{s}| \seq{t}) & = & \frac{\varepsilon}{(l+1)^{m}} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \label{eq:5-64}
 \label{eq:5-29}
 \end{eqnarray}

-公式\ref{eq:5-64}是IBM模型1的最终表达式，在解码和训练中可以被直接使用。
+公式\eqref{eq:5-64}是IBM模型1的最终表达式，在解码和训练中可以被直接使用。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -881,9 +880,9 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 

 \noindent 其中，$\argmax_{\theta}$表示求最优参数的过程（或优化过程）。

-\parinterval 公式\ref{eq:5-30}实际上也是一种基于极大似然的模型训练方法。这里，可以把$\funp{P}_{\theta}(\seq{s}|\seq{t})$看作是模型对数据描述的一个似然函数，记作$L(\seq{s},\seq{t};\theta)$。也就是，优化目标是对似然函数的优化：$\{\widehat{\theta}\}=\{\argmax_{\theta \in \Theta}L(\seq{s},\seq{t};\theta)\}$，其中\{$\widehat{\theta}$\} 表示可能有多个结果，$\Theta$表示参数空间。
+\parinterval 公式\eqref{eq:5-30}实际上也是一种基于极大似然的模型训练方法。这里，可以把$\funp{P}_{\theta}(\seq{s}|\seq{t})$看作是模型对数据描述的一个似然函数，记作$L(\seq{s},\seq{t};\theta)$。也就是，优化目标是对似然函数的优化：$\{\widehat{\theta}\}=\{\argmax_{\theta \in \Theta}L(\seq{s},\seq{t};\theta)\}$，其中\{$\widehat{\theta}$\} 表示可能有多个结果，$\Theta$表示参数空间。

-\parinterval 回到IBM模型的优化问题上。以IBM模型1为例，优化的目标是最大化翻译概率$\funp{P}(\seq{s}| \seq{t})$。使用公式\ref{eq:5-64} ，可以把这个目标表述为：
+\parinterval 回到IBM模型的优化问题上。以IBM模型1为例，优化的目标是最大化翻译概率$\funp{P}(\seq{s}| \seq{t})$。使用公式\eqref{eq:5-64} ，可以把这个目标表述为：
 \begin{eqnarray}
 &                    & \textrm{max}\Big(\frac{\varepsilon}{(l+1)^m}\prod_{j=1}^{m}\sum_{i=0}^{l}{f({s_j|t_i})}\Big) \nonumber \\
 & \textrm{s.t.} & \textrm{任意单词} t_{y}:\;\sum_{s_x}{f(s_x|t_y)}=1 \nonumber
@@ -968,7 +967,7 @@ f(s_u|t_v) = \lambda_{t_v}^{-1} \frac{\varepsilon}{(l+1)^{m}} \prod\limits_{j=1}
 \label{eq:5-39}
 \end{eqnarray}

-\parinterval  可以看出，这不是一个计算$f(s_u|t_v)$的解析式，因为等式右端仍含有$f(s_u|t_v)$。不过它蕴含着一种非常经典的方法\ $\dash$\ {\small\sffamily\bfseries{期望最大化}}\index{期望最大化}（Expectation Maximization）\index{Expectation Maximization}方法，简称EM方法（或算法）。使用EM方法可以利用上式迭代地计算$f(s_u|t_v)$，使其最终收敛到最优值。EM方法的思想是：用当前的参数，求似然函数的期望，之后最大化这个期望同时得到新的一组参数的值。对于IBM模型来说，其迭代过程就是反复使用公式\ref{eq:5-39}，具体如图\ref{fig:5-24}所示。
+\parinterval  可以看出，这不是一个计算$f(s_u|t_v)$的解析式，因为等式右端仍含有$f(s_u|t_v)$。不过它蕴含着一种非常经典的方法\ $\dash$\ {\small\sffamily\bfseries{期望最大化}}\index{期望最大化}（Expectation Maximization）\index{Expectation Maximization}方法，简称EM方法（或算法）。使用EM方法可以利用上式迭代地计算$f(s_u|t_v)$，使其最终收敛到最优值。EM方法的思想是：用当前的参数，求似然函数的期望，之后最大化这个期望同时得到新的一组参数的值。对于IBM模型来说，其迭代过程就是反复使用公式\eqref{eq:5-39}，具体如图\ref{fig:5-24}所示。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -979,13 +978,13 @@ f(s_u|t_v) = \lambda_{t_v}^{-1} \frac{\varepsilon}{(l+1)^{m}} \prod\limits_{j=1}
 \end{figure}
 %----------------------------------------------

-\parinterval 为了化简$f(s_u|t_v)$的计算，在此对公式\ref{eq:5-39}进行了重新组织，见图\ref{fig:5-25}。其中，红色部分表示翻译概率P$(\seq{s}|\seq{t})$；蓝色部分表示$(s_u,t_v)$ 在句对$(\seq{s},\seq{t})$中配对的总次数，即“$t_v$翻译为$s_u$”在所有对齐中出现的次数；绿色部分表示$f(s_u|t_v)$对于所有的$t_i$的相对值，即“$t_v$翻译为$s_u$”在所有对齐中出现的相对概率；蓝色与绿色部分相乘表示“$t_v$翻译为$s_u$”这个事件出现次数的期望的估计，称之为{\small\sffamily\bfseries{期望频次}}\index{期望频次}(Expected Count)\index{Expected Count}。
+\parinterval 为了化简$f(s_u|t_v)$的计算，在此对公式\eqref{eq:5-39}进行了重新组织，见图\ref{fig:5-25}。其中，红色部分表示翻译概率P$(\seq{s}|\seq{t})$；蓝色部分表示$(s_u,t_v)$ 在句对$(\seq{s},\seq{t})$中配对的总次数，即“$t_v$翻译为$s_u$”在所有对齐中出现的次数；绿色部分表示$f(s_u|t_v)$对于所有的$t_i$的相对值，即“$t_v$翻译为$s_u$”在所有对齐中出现的相对概率；蓝色与绿色部分相乘表示“$t_v$翻译为$s_u$”这个事件出现次数的期望的估计，称之为{\small\sffamily\bfseries{期望频次}}\index{期望频次}(Expected Count)\index{Expected Count}。
 \vspace{-0.3em}
 %----------------------------------------------
 \begin{figure}[htp]
    \centering
 \input{./Chapter5/Figures/figure-a-more-detailed-explanation-of-formula-3.40}
-   \caption{公式\ref{eq:5-39}的解释}
+   \caption{公式\eqref{eq:5-39}的解释}
   \label{fig:5-25}
 \end{figure}
 %----------------------------------------------
@@ -996,7 +995,7 @@ f(s_u|t_v) = \lambda_{t_v}^{-1} \frac{\varepsilon}{(l+1)^{m}} \prod\limits_{j=1}
 c_{\mathbb{E}}(X)=\sum_i c(x_i) \cdot \funp{P}(x_i)
 \end{equation}

-\noindent 其中$c(x_i)$表示$X$取$x_i$时出现的次数，P$(x_i)$表示$X=x_i$出现的概率。图\ref{fig:5-26}展示了事件$X$的期望频次的详细计算过程。其中$x_1$、$x_2$和$x_3$分别表示事件$X$出现2次、1次和5次的情况。
+\noindent 其中$c(x_i)$表示$X$取$x_i$时出现的次数，$\funp{P}(x_i)$表示$X=x_i$出现的概率。图\ref{fig:5-26}展示了事件$X$的期望频次的详细计算过程。其中$x_1$、$x_2$和$x_3$分别表示事件$X$出现2次、1次和5次的情况。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -1098,7 +1097,7 @@ c_{\mathbb{E}}(s_u|t_v)=\sum\limits_{k=1}^{K}  c_{\mathbb{E}}(s_u|t_v;s^{[k]},t^
 \item 随着词对齐概念的不断深入，也有很多词对齐方面的工作并不依赖IBM模型。比如，可以直接使用判别式模型利用分类器解决词对齐问题\upcite{ittycheriah2005maximum}；使用带参数控制的动态规划方法来提高词对齐准确率\upcite{DBLP:conf/naacl/GaleC91}；甚至可以把对齐的思想用于短语和句法结构的双语对应\upcite{xiao2013unsupervised}；无监督的对称词对齐方法，正向和反向模型联合训练，结合数据的相似性\upcite{DBLP:conf/naacl/LiangTK06}；除了GIZA++，研究人员也开发了很多优秀的自动对齐工具，比如，FastAlign\upcite{DBLP:conf/naacl/DyerCS13}、Berkeley Word Aligner\upcite{taskar2005a}等，这些工具现在也有很广泛的应用。

 \vspace{0.5em}
-\item 一种较为通用的词对齐评价标准是{\bfnew{对齐错误率}}(Alignment Error Rate, AER)\upcite{DBLP:journals/coling/FraserM07}。在此基础之上也可以对词对齐评价方法进行改进，以提高对齐质量与机器翻译评价得分BLEU的相关性\upcite{DBLP:conf/acl/DeNeroK07,paul2007all,黄书剑2009一种错误敏感的词对齐评价方法}。也有工作通过统计机器翻译系统性能的提升来评价对齐质量\upcite{DBLP:journals/coling/FraserM07}。不过，在相当长的时间内，词对齐质量对机器翻译系统的影响究竟如何并没有统一的结论。有些时候，词对齐的错误率下降了，但是机器翻译系统的译文品质没有带来性能提升。但是，这个问题比较复杂，需要进一步的论证。不过，可以肯定的是，词对齐可以帮助人们分析机器翻译的行为。甚至在最新的神经机器翻译中，如何在神经网络模型中寻求两种语言单词之间的对应关系也是对模型进行解释的有效手段之一\upcite{DBLP:journals/corr/FengLLZ16}。
+\item 一种较为通用的词对齐评价标准是{\bfnew{对齐错误率}}（Alignment Error Rate, AER）\upcite{DBLP:journals/coling/FraserM07}。在此基础之上也可以对词对齐评价方法进行改进，以提高对齐质量与机器翻译评价得分BLEU的相关性\upcite{DBLP:conf/acl/DeNeroK07,paul2007all,黄书剑2009一种错误敏感的词对齐评价方法}。也有工作通过统计机器翻译系统性能的提升来评价对齐质量\upcite{DBLP:journals/coling/FraserM07}。不过，在相当长的时间内，词对齐质量对机器翻译系统的影响究竟如何并没有统一的结论。有些时候，词对齐的错误率下降了，但是机器翻译系统的译文品质没有带来性能提升。但是，这个问题比较复杂，需要进一步的论证。不过，可以肯定的是，词对齐可以帮助人们分析机器翻译的行为。甚至在最新的神经机器翻译中，如何在神经网络模型中寻求两种语言单词之间的对应关系也是对模型进行解释的有效手段之一\upcite{DBLP:journals/corr/FengLLZ16}。

 \vspace{0.5em}
 \item 基于单词的翻译模型的解码问题也是早期研究者所关注的。比较经典的方法的是贪婪方法\upcite{germann2003greedy}。也有研究者对不同的解码方法进行了对比\upcite{germann2001fast}，并给出了一些加速解码的思路。随后，也有工作进一步对这些方法进行改进\upcite{DBLP:conf/coling/UdupaFM04,DBLP:conf/naacl/RiedelC09}。实际上，基于单词的模型的解码是一个NP完全问题\upcite{knight1999decoding}，这也是为什么机器翻译的解码十分困难的原因。关于翻译模型解码算法的时间复杂度也有很多讨论\upcite{DBLP:conf/eacl/UdupaM06,DBLP:conf/emnlp/LeuschMN08,DBLP:journals/mt/FlemingKN15}。

--- a/Chapter6/chapter6.tex
+++ b/Chapter6/chapter6.tex
@@ -21,9 +21,9 @@
 %	CHAPTER 6
 %----------------------------------------------------------------------------------------

-\chapter{基于扭曲度和繁衍率的模型}
+\chapter{基于扭曲度和繁衍率的翻译模型}

-{\chapterfive}展示了一种基于单词的翻译模型。这种模型的形式非常简单，而且其隐含的词对齐信息具有较好的可解释性。不过，语言翻译的复杂性远远超出人们的想象。有两方面挑战\ \dash\ 如何对`` 调序''问题进行建模以及如何对``一对多翻译''问题进行建模。调序是翻译问题中所特有的现象，比如，汉语到日语的翻译中，需要对谓词进行调序。另一方面，一个单词在另一种语言中可能会被翻译为多个连续的词，比如，汉语`` 联合国''翻译到英语会对应三个单词``The United Nations''。这种现象也被称作一对多翻译，它与句子长度预测有着密切的联系。
+{\chapterfive}展示了一种基于单词的翻译模型。这种模型的形式非常简单，而且其隐含的词对齐信息具有较好的可解释性。不过，语言翻译的复杂性远远超出人们的想象。有两方面挑战\ \dash\ 如何对“ 调序”问题进行建模以及如何对“一对多翻译”问题进行建模。调序是翻译问题中所特有的现象，比如，汉语到日语的翻译中，需要对谓词进行调序。另一方面，一个单词在另一种语言中可能会被翻译为多个连续的词，比如，汉语“ 联合国”翻译到英语会对应三个单词“The United Nations”。这种现象也被称作一对多翻译，它与句子长度预测有着密切的联系。

 无论是调序还是一对多翻译，简单的翻译模型（如IBM模型1）都无法对其进行很好的处理。因此，需要考虑对这两个问题单独进行建模。本章将会对机器翻译中两个常用的概念进行介绍\ \dash\ 扭曲度（Distortion）和繁衍率（Fertility）。它们可以被看作是对调序和一对多翻译现象的一种统计描述。基于此，本章会进一步介绍基于扭曲度和繁衍率的翻译模型，建立相对完整的基于单词的统计建模体系。相关的技术和概念在后续章节也会被进一步应用。

@@ -53,7 +53,7 @@

 \parinterval 在对调序问题进行建模的方法中，最基本的是使用调序距离方法。这里，可以假设完全进行顺序翻译时，调序的“代价”是最低的。当调序出现时，可以用调序相对于顺序翻译产生的位置偏移来度量调序的程度，也被称为调序距离。图\ref{fig:6-2}展示了翻译时两种语言中词的对齐矩阵。比如，在图\ref{fig:6-2}(a)中，系统需要跳过“对”和“你”来翻译“感到”和“满意”，之后再回过头翻译“对”和“你”，这样就完成了对单词的调序。这时可以简单地把需要跳过的单词数看作一种距离。

-\parinterval 可以看到，调序距离实际上是在度量目标语言词序相对于源语言词序的一种扭曲程度。因此，也常常把这种调序距离称作{\small\sffamily\bfseries{扭曲度}}（Distortion）。调序距离越大对应的扭曲度也越大。比如，可以明显看出图\ref{fig:6-2}（b）中调序的扭曲度要比图\ref{fig:6-2}（a）中调序的扭曲度大，因此\ref{fig:6-2}（b）实例的调序代价也更大。
+\parinterval 可以看到，调序距离实际上是在度量目标语言词序相对于源语言词序的一种扭曲程度。因此，也常常把这种调序距离称作{\small\sffamily\bfseries{扭曲度}}（Distortion）。调序距离越大对应的扭曲度也越大。比如，可以明显看出图\ref{fig:6-2}(b)中调序的扭曲度要比图\ref{fig:6-2}(a)中调序的扭曲度大，因此\ref{fig:6-2}(b)实例的调序代价也更大。

 \parinterval 在机器翻译中使用扭曲度进行翻译建模是一种十分自然的想法。接下来，会介绍两个基于扭曲度的翻译模型，分别是IBM模型2和隐马尔可夫模型。不同于IBM模型1，它们利用了单词的位置信息定义了扭曲度，并将扭曲度融入翻译模型中，使得对翻译问题的建模更加合理。

@@ -78,7 +78,7 @@
 \label{eq:6-1}
 \end{eqnarray}

-\parinterval 这里还用{\chapterthree}中的例子（图\ref{fig:6-3}）来进行说明。在IBM模型1中，``桌子''对齐到目标语言四个位置的概率是一样的。但在IBM模型2中，``桌子''对齐到``table''被形式化为$a(a_j |j,m,l)=a(3|2,3,3)$，意思是对于源语言位置2（$j=2$）的词，如果它的源语言和目标语言都是3个词（$l=3,m=3$），对齐到目标语言位置3（$a_j=3$）的概率是多少？因为$a(a_j|j,m,l)$也是模型需要学习的参数，因此``桌子''对齐到不同目标语言单词的概率也是不一样的。理想的情况下，通过$a(a_j|j,m,l)$，``桌子''对齐到``table''应该得到更高的概率。
+\parinterval 这里还用{\chapterthree}中的例子（图\ref{fig:6-3}）来进行说明。在IBM模型1中，“桌子”对齐到目标语言四个位置的概率是一样的。但在IBM模型2中，“桌子”对齐到“table”被形式化为$a(a_j |j,m,l)=a(3|2,3,3)$，意思是对于源语言位置2（$j=2$）的词，如果它的源语言和目标语言都是3个词（$l=3,m=3$），对齐到目标语言位置3（$a_j=3$）的概率是多少？因为$a(a_j|j,m,l)$也是模型需要学习的参数，因此“桌子”对齐到不同目标语言单词的概率也是不一样的。理想的情况下，通过$a(a_j|j,m,l)$，“桌子”对齐到“table”应该得到更高的概率。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -136,7 +136,7 @@
 \label{eq:6-6}
 \end{eqnarray}

-\parinterval 这里用图\ref{fig:6-4}的例子对公式进行说明。在IBM模型1-2中，单词的对齐都是与单词所在的绝对位置有关。但在HMM词对齐模型中，``你''对齐到``you''被形式化为$\funp{P}(a_{j}|a_{j-1},l)= P(5|4,5)$，意思是对于源语言位置$3(j=3)$上的单词，如果它的译文是第5个目标语言单词，上一个对齐位置是$4(a_{2}=4)$，对齐到目标语言位置$5(a_{j}=5)$的概率是多少？理想的情况下，通过$\funp{P}(a_{j}|a_{j-1},l)$，``你''对齐到``you''应该得到更高的概率，并且由于源语言单词``对''和``你''距离很近，因此其对应的对齐位置``with''和``you''的距离也应该很近。
+\parinterval 这里用图\ref{fig:6-4}的例子对公式进行说明。在IBM模型1-2中，单词的对齐都是与单词所在的绝对位置有关。但在HMM词对齐模型中，“你”对齐到“you”被形式化为$\funp{P}(a_{j}|a_{j-1},l)= P(5|4,5)$，意思是对于源语言位置$3(j=3)$上的单词，如果它的译文是第5个目标语言单词，上一个对齐位置是$4(a_{2}=4)$，对齐到目标语言位置$5(a_{j}=5)$的概率是多少？理想的情况下，通过$\funp{P}(a_{j}|a_{j-1},l)$，“你”对齐到“you”应该得到更高的概率，并且由于源语言单词“对”和“你”距离很近，因此其对应的对齐位置“with”和“you”的距离也应该很近。

 \parinterval 把公式$\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\seq{t}) \equiv f(s_j|t_{a_j})$和\eqref{eq:6-6}重新带入公式$\funp{P}(\seq{s},\seq{a}|\seq{t})=\funp{P}(m|\seq{t})$\\$\prod_{j=1}^{m}{\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t})\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\seq{t})}$和$\funp{P}(\seq{s}|\seq{t})= \sum_{\seq{a}}\funp{P}(\seq{s},\seq{a}|\seq{t})$,可得HMM词对齐模型的数学描述：
 \begin{eqnarray}
@@ -179,11 +179,11 @@

 \begin{itemize}
 \vspace{0.3em}
-\item 首先，对于每个英语单词$t_i$决定它的产出率$\varphi_{i}$。比如``Scientists''的产出率是2，可表示为${\varphi}_{1}=2$。这表明它会生成2个汉语单词；
+\item 首先，对于每个英语单词$t_i$决定它的产出率$\varphi_{i}$。比如“Scientists”的产出率是2，可表示为${\varphi}_{1}=2$。这表明它会生成2个汉语单词；
 \vspace{0.3em}
-\item 其次，确定英语句子中每个单词生成的汉语单词列表。比如``Scientists''生成``科学家''和``们''两个汉语单词，可表示为${\tau}_1=\{{\tau}_{11}=\textrm{``科学家''},{\tau}_{12}=\textrm{``们''}\}$。 这里用特殊的空标记NULL表示翻译对空的情况；
+\item 其次，确定英语句子中每个单词生成的汉语单词列表。比如“Scientists”生成“科学家”和“们”两个汉语单词，可表示为${\tau}_1=\{{\tau}_{11}=\textrm{“科学家”},{\tau}_{12}=\textrm{“们”}\}$。 这里用特殊的空标记NULL表示翻译对空的情况；
 \vspace{0.3em}
-\item 最后，把生成的所有汉语单词放在合适的位置。比如``科学家''和``们''分别放在$\seq{s}$的位置1和位置2。可以用符号$\pi$记录生成的单词在源语言句子$\seq{s}$中的位置。比如``Scientists'' 生成的汉语单词在$\seq{s}$ 中的位置表示为${\pi}_{1}=\{{\pi}_{11}=1,{\pi}_{12}=2\}$。
+\item 最后，把生成的所有汉语单词放在合适的位置。比如“科学家”和“们”分别放在$\seq{s}$的位置1和位置2。可以用符号$\pi$记录生成的单词在源语言句子$\seq{s}$中的位置。比如“Scientists” 生成的汉语单词在$\seq{s}$ 中的位置表示为${\pi}_{1}=\{{\pi}_{11}=1,{\pi}_{12}=2\}$。
 \vspace{0.3em}
 \end{itemize}

@@ -200,7 +200,7 @@

 \parinterval 可以看出，一组$\tau$和$\pi$（记为$<\tau,\pi>$）可以决定一个对齐$\seq{a}$和一个源语句子$\seq{s}$。

-\noindent 相反的，一个对齐$\seq{a}$和一个源语句子$\seq{s}$可以对应多组$<\tau,\pi>$。如图\ref{fig:6-6}所示，不同的$<\tau,\pi>$对应同一个源语言句子和词对齐。它们的区别在于目标语单词``Scientists''生成的源语言单词``科学家''和`` 们''的顺序不同。这里把不同的$<\tau,\pi>$对应到的相同的源语句子$\seq{s}$和对齐$\seq{a}$记为$<\seq{s},\seq{a}>$。因此计算$\funp{P}(\seq{s},\seq{a}| \seq{t})$时需要把每个可能结果的概率加起来，如下：
+\noindent 相反的，一个对齐$\seq{a}$和一个源语句子$\seq{s}$可以对应多组$<\tau,\pi>$。如图\ref{fig:6-6}所示，不同的$<\tau,\pi>$对应同一个源语言句子和词对齐。它们的区别在于目标语单词“Scientists”生成的源语言单词“科学家”和“ 们”的顺序不同。这里把不同的$<\tau,\pi>$对应到的相同的源语句子$\seq{s}$和对齐$\seq{a}$记为$<\seq{s},\seq{a}>$。因此计算$\funp{P}(\seq{s},\seq{a}| \seq{t})$时需要把每个可能结果的概率加起来，如下：
 \begin{equation}
 \funp{P}(\seq{s},\seq{a}| \seq{t})=\sum_{{<\tau,\pi>}\in{<\seq{s},\seq{a}>}}{\funp{P}(\tau,\pi|\seq{t}) }
 \label{eq:6-9}
@@ -281,7 +281,7 @@
 \label{eq:6-15}
 \end{eqnarray}
 }
-\parinterval 而上面提到的$t_0$所对应的这些空位置是如何生成的呢？即如何确定哪些位置是要放置空对齐的源语言单词。在IBM模型3中，假设在所有的非空对齐源语言单词都被生成出来后（共$\varphi_1+\varphi_2+\cdots {\varphi}_l$个非空对源语单词），这些单词后面都以$p_1$概率随机地产生一个``槽''用来放置空对齐单词。这样，${\varphi}_0$就服从了一个二项分布。于是得到
+\parinterval 而上面提到的$t_0$所对应的这些空位置是如何生成的呢？即如何确定哪些位置是要放置空对齐的源语言单词。在IBM模型3中，假设在所有的非空对齐源语言单词都被生成出来后（共$\varphi_1+\varphi_2+\cdots {\varphi}_l$个非空对源语单词），这些单词后面都以$p_1$概率随机地产生一个“槽”用来放置空对齐单词。这样，${\varphi}_0$就服从了一个二项分布。于是得到
 {
 \begin{eqnarray}
 \funp{P}(\varphi_0|\seq{t})=\big(\begin{array}{c}
@@ -318,9 +318,9 @@ p_0+p_1                            & = & 1 \label{eq:6-21}

 \subsection{IBM 模型4}

-\parinterval IBM模型3仍然存在问题，比如，它不能很好地处理一个目标语言单词生成多个源语言单词的情况。这个问题在模型1和模型2中也存在。如果一个目标语言单词对应多个源语言单词，往往这些源语言单词构成短语或搭配。但是模型1-3把这些源语言单词看成独立的单元，而实际上它们是一个整体。这就造成了在模型1-3中这些源语言单词可能会``分散''开。为了解决这个问题，模型4对模型3进行了进一步修正。
+\parinterval IBM模型3仍然存在问题，比如，它不能很好地处理一个目标语言单词生成多个源语言单词的情况。这个问题在模型1和模型2中也存在。如果一个目标语言单词对应多个源语言单词，往往这些源语言单词构成短语或搭配。但是模型1-3把这些源语言单词看成独立的单元，而实际上它们是一个整体。这就造成了在模型1-3中这些源语言单词可能会“分散”开。为了解决这个问题，模型4对模型3进行了进一步修正。

-\parinterval 为了更清楚地阐述，这里引入新的术语\ \dash \ {\small\bfnew{概念单元}}\index{概念单元}或{\small\bfnew{概念}}\index{概念}（Concept）\index{Concept}。词对齐可以被看作概念之间的对应。这里的概念是指具有独立语法或语义功能的一组单词。依照Brown等人的表示方法\upcite{DBLP:journals/coling/BrownPPM94}，可以把概念记为cept.。每个句子都可以被表示成一系列的cept.。这里要注意的是，源语言句子中的cept.数量不一定等于目标句子中的cept.数量。因为有些cept. 可以为空，因此可以把那些空对的单词看作空cept.。比如，在图\ref{fig:6-8}的实例中，``了''就对应一个空cept.。
+\parinterval 为了更清楚地阐述，这里引入新的术语\ \dash \ {\small\bfnew{概念单元}}\index{概念单元}或{\small\bfnew{概念}}\index{概念}（Concept）\index{Concept}。词对齐可以被看作概念之间的对应。这里的概念是指具有独立语法或语义功能的一组单词。依照Brown等人的表示方法\upcite{DBLP:journals/coling/BrownPPM94}，可以把概念记为cept.。每个句子都可以被表示成一系列的cept.。这里要注意的是，源语言句子中的cept.数量不一定等于目标句子中的cept.数量。因为有些cept. 可以为空，因此可以把那些空对的单词看作空cept.。比如，在图\ref{fig:6-8}的实例中，“了”就对应一个空cept.。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -331,9 +331,9 @@ p_0+p_1                            & = & 1 \label{eq:6-21}
 \end{figure}
 %----------------------------------------------

-\parinterval 在IBM模型的词对齐框架下，目标语的cept.只能是那些非空对齐的目标语单词，而且每个cept.只能由一个目标语言单词组成（通常把这类由一个单词组成的cept.称为独立单词cept.）。这里用$[i]$表示第$i$ 个独立单词cept.在目标语言句子中的位置。换句话说，$[i]$表示第$i$个非空对的目标语单词的位置。比如在本例中``mind''在$\seq{t}$中的位置表示为$[3]$。
+\parinterval 在IBM模型的词对齐框架下，目标语的cept.只能是那些非空对齐的目标语单词，而且每个cept.只能由一个目标语言单词组成（通常把这类由一个单词组成的cept.称为独立单词cept.）。这里用$[i]$表示第$i$ 个独立单词cept.在目标语言句子中的位置。换句话说，$[i]$表示第$i$个非空对的目标语单词的位置。比如在本例中“mind”在$\seq{t}$中的位置表示为$[3]$。

-\parinterval 另外，可以用$\odot_{i}$表示位置为$[i]$的目标语言单词对应的那些源语言单词位置的平均值，如果这个平均值不是整数则对它向上取整。比如在本例中，目标语句中第4个cept. （``.''）对应在源语言句子中的第5个单词。可表示为${\odot}_{4}=5$。
+\parinterval 另外，可以用$\odot_{i}$表示位置为$[i]$的目标语言单词对应的那些源语言单词位置的平均值，如果这个平均值不是整数则对它向上取整。比如在本例中，目标语句中第4个cept. （“.”）对应在源语言句子中的第5个单词。可表示为${\odot}_{4}=5$。

 \parinterval 利用这些新引进的概念，模型4对模型3的扭曲度进行了修改。主要是把扭曲度分解为两类参数。对于$[i]$对应的源语言单词列表($\tau_{[i]}$)中的第一个单词($\tau_{[i]1}$），它的扭曲度用如下公式计算：
 \begin{equation}
@@ -360,7 +360,7 @@ p_0+p_1                            & = & 1 \label{eq:6-21}

 \subsection{ IBM 模型5}

-\parinterval 模型3和模型4并不是``准确''的模型。这两个模型会把一部分概率分配给一些根本就不存在的句子。这个问题被称作IBM模型3和模型4的{\small\bfnew{缺陷}}\index{缺陷}（Deficiency）\index{Deficiency}。说得具体一些，模型3和模型4 中并没有这样的约束：如果已经放置了某个源语言单词的位置不能再放置其他单词，也就是说句子的任何位置只能放置一个词，不能多也不能少。由于缺乏这个约束，模型3和模型4中在所有合法的词对齐上概率和不等于1。 这部分缺失的概率被分配到其他不合法的词对齐上。举例来说，如图\ref{fig:6-9}所示，``吃/早饭''和``have breakfast''之间的合法词对齐用直线表示 。但是在模型3和模型4中， 它们的概率和为$0.9<1$。 损失掉的概率被分配到像5和6这样的对齐上了（红色）。虽然IBM模型并不支持一对多的对齐，但是模型3和模型4把概率分配给这些`` 不合法''的词对齐上，因此也就产生所谓的缺陷。
+\parinterval 模型3和模型4并不是“准确”的模型。这两个模型会把一部分概率分配给一些根本就不存在的句子。这个问题被称作IBM模型3和模型4的{\small\bfnew{缺陷}}\index{缺陷}（Deficiency）\index{Deficiency}。说得具体一些，模型3和模型4 中并没有这样的约束：如果已经放置了某个源语言单词的位置不能再放置其他单词，也就是说句子的任何位置只能放置一个词，不能多也不能少。由于缺乏这个约束，模型3和模型4中在所有合法的词对齐上概率和不等于1。 这部分缺失的概率被分配到其他不合法的词对齐上。举例来说，如图\ref{fig:6-9}所示，“吃/早饭”和“have breakfast”之间的合法词对齐用直线表示 。但是在模型3和模型4中， 它们的概率和为$0.9<1$。 损失掉的概率被分配到像5和6这样的对齐上了（红色）。虽然IBM模型并不支持一对多的对齐，但是模型3和模型4把概率分配给这些“ 不合法”的词对齐上，因此也就产生所谓的缺陷。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -414,17 +414,17 @@ p_0+p_1                            & = & 1 \label{eq:6-21}

 \subsection{词对齐及对称化}

-\parinterval IBM五个模型都是基于一个词对齐的假设\ \dash \ 一个源语言单词最多只能对齐到一个目标语言单词。这个约束大大降低了建模的难度。在法英翻译中一对多的对齐情况并不多见，这个假设带来的问题也不是那么严重。但是，在像汉英翻译这样的任务中，一个汉语单词对应多个英语单词的翻译很常见，这时IBM模型的词对齐假设就表现出了明显的问题。比如在翻译`` 我/会/试一试/。''\ $\to$ \ ``I will have a try .''时，IBM模型根本不能把单词``试一试''对齐到三个单词``have a try''，因而可能无法得到正确的翻译结果。
+\parinterval IBM五个模型都是基于一个词对齐的假设\ \dash \ 一个源语言单词最多只能对齐到一个目标语言单词。这个约束大大降低了建模的难度。在法英翻译中一对多的对齐情况并不多见，这个假设带来的问题也不是那么严重。但是，在像汉英翻译这样的任务中，一个汉语单词对应多个英语单词的翻译很常见，这时IBM模型的词对齐假设就表现出了明显的问题。比如在翻译“ 我/会/试一试/。”\ $\to$ \ “I will have a try .”时，IBM模型根本不能把单词“试一试”对齐到三个单词“have a try”，因而可能无法得到正确的翻译结果。

-\parinterval 本质上，IBM模型词对齐的``不完整''问题是IBM模型本身的缺陷。解决这个问题有很多思路。一种思路是，反向训练后，合并源语言单词，然后再正向训练。这里用汉英翻译为例来解释这个方法。首先反向训练，就是把英语当作待翻译语言，而把汉语当作目标语言进行训练（参数估计）。这样可以得到一个词对齐结果（参数估计的中间结果）。在这个词对齐结果里面，一个汉语单词可对应多个英语单词。之后，扫描每个英语句子，如果有多个英语单词对应同一个汉语单词，就把这些英语单词合并成一个英语单词。处理完之后，再把汉语当作源语言而把英语当作目标语言进行训练。这样就可以把一个汉语单词对应到合并的英语单词上。虽然从模型上看，还是一个汉语单词对应一个英语``单词''，但实质上已经把这个汉语单词对应到多个英语单词上了。训练完之后，再利用这些参数进行翻译（解码）时，就能把一个中文单词翻译成多个英文单词了。但是反向训练后再训练也存在一些问题。首先，合并英语单词会使数据变得更稀疏，训练不充分。其次，由于IBM模型的词对齐结果并不是高精度的，利用它的词对齐结果来合并一些英文单词可能造成严重的错误，比如：把本来很独立的几个单词合在了一起。因此，还要考虑实际需要和问题的严重程度来决定是否使用该方法。
+\parinterval 本质上，IBM模型词对齐的“不完整”问题是IBM模型本身的缺陷。解决这个问题有很多思路。一种思路是，反向训练后，合并源语言单词，然后再正向训练。这里用汉英翻译为例来解释这个方法。首先反向训练，就是把英语当作待翻译语言，而把汉语当作目标语言进行训练（参数估计）。这样可以得到一个词对齐结果（参数估计的中间结果）。在这个词对齐结果里面，一个汉语单词可对应多个英语单词。之后，扫描每个英语句子，如果有多个英语单词对应同一个汉语单词，就把这些英语单词合并成一个英语单词。处理完之后，再把汉语当作源语言而把英语当作目标语言进行训练。这样就可以把一个汉语单词对应到合并的英语单词上。虽然从模型上看，还是一个汉语单词对应一个英语“单词”，但实质上已经把这个汉语单词对应到多个英语单词上了。训练完之后，再利用这些参数进行翻译（解码）时，就能把一个中文单词翻译成多个英文单词了。但是反向训练后再训练也存在一些问题。首先，合并英语单词会使数据变得更稀疏，训练不充分。其次，由于IBM模型的词对齐结果并不是高精度的，利用它的词对齐结果来合并一些英文单词可能造成严重的错误，比如：把本来很独立的几个单词合在了一起。因此，还要考虑实际需要和问题的严重程度来决定是否使用该方法。

-\parinterval 另一种思路是双向对齐之后进行词对齐{\small\sffamily\bfseries{对称化}}\index{对称化}（Symmetrization）\index{Symmetrization}。这个方法可以在IBM词对齐的基础上获得对称的词对齐结果。思路很简单，用正向（汉语为源语言，英语为目标语言）和反向（汉语为目标语言，英语为源语言）同时训练。这样可以得到两个词对齐结果。然后利用一些启发性方法用这两个词对齐生成对称的结果（比如，取`` 并集''、``交集''等），这样就可以得到包含一对多和多对多的词对齐结果\upcite{och2003systematic}。比如，在基于短语的统计机器翻译中已经很成功地使用了这种词对齐信息进行短语的获取。直到今天，对称化仍然是很多自然语言处理系统中的一个关键步骤。
+\parinterval 另一种思路是双向对齐之后进行词对齐{\small\sffamily\bfseries{对称化}}\index{对称化}（Symmetrization）\index{Symmetrization}。这个方法可以在IBM词对齐的基础上获得对称的词对齐结果。思路很简单，用正向（汉语为源语言，英语为目标语言）和反向（汉语为目标语言，英语为源语言）同时训练。这样可以得到两个词对齐结果。然后利用一些启发性方法用这两个词对齐生成对称的结果（比如，取“ 并集”、“交集”等），这样就可以得到包含一对多和多对多的词对齐结果\upcite{och2003systematic}。比如，在基于短语的统计机器翻译中已经很成功地使用了这种词对齐信息进行短语的获取。直到今天，对称化仍然是很多自然语言处理系统中的一个关键步骤。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
 %----------------------------------------------------------------------------------------

-\subsection{``缺陷''问题}
+\subsection{“缺陷”问题}

 \parinterval IBM模型的缺陷是指翻译模型会把一部分概率分配给一些根本不存在的源语言字符串。如果用$\funp{P}(\textrm{well}|\seq{t})$表示$\funp{P}(\seq{s}| \seq{t})$在所有的正确的（可以理解为语法上正确的）$\seq{s}$上的和，即
 \begin{eqnarray}
@@ -440,9 +440,9 @@ p_0+p_1                            & = & 1 \label{eq:6-21}

 \parinterval 本质上，模型3和模型4就是对应$\funp{P}({\textrm{failure}|\seq{t}})>0$的情况。这部分概率是模型损失掉的。有时候也把这类缺陷称为{\small\bfnew{物理缺陷}}\index{物理缺陷}（Physical Deficiency\index{Physical Deficiency}）或{\small\bfnew{技术缺陷}}\index{技术缺陷}（Technical Deficiency\index{Technical Deficiency}）。还有一种缺陷被称作{\small\bfnew{精神缺陷}}（Spiritual Deficiency\index{Spiritual Deficiency}）或{\small\bfnew{逻辑缺陷}}\index{逻辑缺陷}（Logical Deficiency\index{Logical Deficiency}），它是指$\funp{P}({\textrm{well}|\seq{t}}) + \funp{P}({\textrm{ill}|\seq{t}}) = 1$ 且$\funp{P}({\textrm{ill}|\seq{t}}) > 0$的情况。模型1 和模型2 就有逻辑缺陷。可以注意到，技术缺陷只存在于模型3 和模型4 中，模型1和模型2并没有技术缺陷问题。根本原因在于模型1和模型2的词对齐是从源语言出发对应到目标语言，$\seq{t}$到$\seq{s}$ 的翻译过程实际上是从单词$s_1$开始到单词$s_m$ 结束，依次把每个源语言单词$s_j$对应到唯一一个目标语言位置。显然，这个过程能够保证每个源语言单词仅对应一个目标语言单词。但是，模型3 和模型4中对齐是从目标语言出发对应到源语言，$\seq{t}$到$\seq{s}$的翻译过程从$t_1$开始$t_l$ 结束，依次把目标语言单词$t_i$生成的单词对应到某个源语言位置上。但是这个过程不能保证$t_i$中生成的单词所对应的位置没有被其他单词占用，因此也就产生了缺陷。

-\parinterval 这里还要强调的是，技术缺陷是模型3和模型4是模型本身的缺陷造成的，如果有一个``更好''的模型就可以完全避免这个问题。而逻辑缺陷几乎是不能从模型上根本解决的，因为对于任意一种语言都不能枚举所有的句子（$\funp{P}({\textrm{ill}|\seq{t}})$实际上是得不到的）。
+\parinterval 这里还要强调的是，技术缺陷是模型3和模型4是模型本身的缺陷造成的，如果有一个“更好”的模型就可以完全避免这个问题。而逻辑缺陷几乎是不能从模型上根本解决的，因为对于任意一种语言都不能枚举所有的句子（$\funp{P}({\textrm{ill}|\seq{t}})$实际上是得不到的）。

-\parinterval IBM的模型5已经解决了技术缺陷问题。但逻辑缺陷的解决很困难，因为即使对于人来说也很难判断一个句子是不是``良好''的句子。当然可以考虑用语言模型来缓解这个问题，不过由于在翻译的时候源语言句子都是定义``良好''的句子，$\funp{P}({\textrm{ill}|\seq{t}})$对$\funp{P}(\seq{s}| \seq{t})$的影响并不大。但用输入的源语言句子$\seq{s}$的``良好性''并不能解决技术缺陷，因为技术缺陷是模型的问题或者模型参数估计方法的问题。无论输入什么样的$\seq{s}$，模型3和模型4的技术缺陷问题都存在。
+\parinterval IBM的模型5已经解决了技术缺陷问题。但逻辑缺陷的解决很困难，因为即使对于人来说也很难判断一个句子是不是“良好”的句子。当然可以考虑用语言模型来缓解这个问题，不过由于在翻译的时候源语言句子都是定义“良好”的句子，$\funp{P}({\textrm{ill}|\seq{t}})$对$\funp{P}(\seq{s}| \seq{t})$的影响并不大。但用输入的源语言句子$\seq{s}$的“良好性”并不能解决技术缺陷，因为技术缺陷是模型的问题或者模型参数估计方法的问题。无论输入什么样的$\seq{s}$，模型3和模型4的技术缺陷问题都存在。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION

--- a/Chapter7/Figures/figure-example-of-hypothesis-recombination.tex
+++ b/Chapter7/Figures/figure-example-of-hypothesis-recombination.tex
@@ -3,12 +3,12 @@
 \begin{tikzpicture}
 \begin{scope}
 {
-\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h0) at (0,0) {\small{null}};
+\node [anchor=north,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h0) at (0,0) {\small{null}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl0) at (h0.north west) {\scriptsize{{\color{white} \textbf{0}}}};
 \node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt0) at (h0.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=1}}}};

-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h2) at ([xshift=2.2em,yshift=3.5em]h0.east) {\small{an}};
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h3) at ([xshift=2.2em]h2.east) {\small{apple}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h2) at ([xshift=2.2em,yshift=3.5em]h0.east) {\small{an}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h3) at ([xshift=2.2em]h2.east) {\small{apple}};

 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl2) at (h2.north west) {\scriptsize{{\color{white} \textbf{1}}}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl3) at (h3.north west) {\scriptsize{{\color{white} \textbf{2}}}};
@@ -20,17 +20,17 @@
 \draw [->,very thick,ublue] ([xshift=0.1em]pt2.south) -- ([xshift=-0.1em]h3.west);

 {
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h1) at ([xshift=7em]h0.east) {\small{an apple}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h1) at ([xshift=7em]h0.east) {\small{an apple}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl1) at (h1.north west) {\scriptsize{{\color{white} \textbf{1-2}}}};
 \node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt1) at (h1.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.5}}}};
 \draw [->,very thick,ublue] ([xshift=0.1em]pt0.south) -- ([xshift=-0.1em]h1.west);
 }
 }
 {
-\node [anchor=north west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h4) at ([yshift=-9em]h0.south west) {\small{null}};
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h5) at ([xshift=2.2em]h4.east) {\small{he}};
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h6) at ([xshift=2.2em,yshift=3.5em]h4.east) {\small{it}};
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h8) at ([xshift=2.2em]h6.east) {\small{is not}};
+\node [anchor=north west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h4) at ([yshift=-9em]h0.south west) {\small{null}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h5) at ([xshift=2.2em]h4.east) {\small{he}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h6) at ([xshift=2.2em,yshift=3.5em]h4.east) {\small{it}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h8) at ([xshift=2.2em]h6.east) {\small{is not}};

 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl4) at (h4.north west) {\scriptsize{{\color{white} \textbf{0}}}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl5) at (h5.north west) {\scriptsize{{\color{white} \textbf{1}}}};
@@ -47,7 +47,7 @@
 \draw [->,very thick,ublue] ([xshift=0.1em]pt6.south) -- ([xshift=-0.1em]h8.west);

 {
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h7) at ([xshift=2.2em]h5.east) {\small{is not}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h7) at ([xshift=2.2em]h5.east) {\small{is not}};
 \node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt7) at (h7.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.2}}}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl5) at (h7.north west) {\scriptsize{{\color{white} \textbf{2}}}};
 \draw [->,very thick,ublue] ([xshift=0.1em]pt5.south) -- ([xshift=-0.1em]h7.west);
@@ -66,12 +66,12 @@

 \begin{scope}[xshift = 16em, yshift = 0em]
 {
-\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h0) at (0,0) {\small{null}};
+\node [anchor=north,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h0) at (0,0) {\small{null}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl0) at (h0.north west) {\scriptsize{{\color{white} \textbf{0}}}};
 \node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt0) at (h0.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=1}}}};

-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h2) at ([xshift=2.2em,yshift=3.5em]h0.east) {\small{an}};
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h3) at ([xshift=2.2em]h2.east) {\small{apple}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h2) at ([xshift=2.2em,yshift=3.5em]h0.east) {\small{an}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h3) at ([xshift=2.2em]h2.east) {\small{apple}};

 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl2) at (h2.north west) {\scriptsize{{\color{white} \textbf{1}}}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl3) at (h3.north west) {\scriptsize{{\color{white} \textbf{2}}}};
@@ -87,10 +87,10 @@
 }
 }
 {
-\node [anchor=north west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h4) at ([yshift=-9em]h0.south west) {\small{null}};
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h5) at ([xshift=2.2em]h4.east) {\small{he}};
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h6) at ([xshift=2.2em,yshift=3.5em]h4.east) {\small{it}};
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h8) at ([xshift=2.2em]h6.east) {\small{is not}};
+\node [anchor=north west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h4) at ([yshift=-9em]h0.south west) {\small{null}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h5) at ([xshift=2.2em]h4.east) {\small{he}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h6) at ([xshift=2.2em,yshift=3.5em]h4.east) {\small{it}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h8) at ([xshift=2.2em]h6.east) {\small{is not}};

 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl4) at (h4.north west) {\scriptsize{{\color{white} \textbf{0}}}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl5) at (h5.north west) {\scriptsize{{\color{white} \textbf{1}}}};
@@ -113,14 +113,14 @@

 {
 {
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em,opacity=0.3] (h1) at ([xshift=7em]h0.east) {\small{an apple}};
-\node [anchor=north west,inner sep=1.0pt,fill=black,opacity=0.3] (hl1) at (h1.north west) {\scriptsize{{\color{white} \textbf{1-2}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black,opacity=0.3] (pt1) at (h1.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.5}}}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em,opacity=0.6] (h1) at ([xshift=7em]h0.east) {\small{an apple}};
+\node [anchor=north west,inner sep=1.0pt,fill=black,opacity=0.6] (hl1) at (h1.north west) {\scriptsize{{\color{white} \textbf{1-2}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black,opacity=0.6] (pt1) at (h1.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.5}}}};
 }
 {
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em,opacity=0.3] (h7) at ([xshift=2.2em]h5.east) {\small{is not}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black,opacity=0.3] (pt7) at (h7.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.2}}}};
-\node [anchor=north west,inner sep=1.0pt,fill=black,opacity=0.3] (hl5) at (h7.north west) {\scriptsize{{\color{white} \textbf{2}}}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em,opacity=0.6] (h7) at ([xshift=2.2em]h5.east) {\small{is not}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black,opacity=0.6] (pt7) at (h7.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.2}}}};
+\node [anchor=north west,inner sep=1.0pt,fill=black,opacity=0.6] (hl5) at (h7.north west) {\scriptsize{{\color{white} \textbf{2}}}};
 }
 }

@@ -129,7 +129,6 @@
 \node [anchor=west] (l2) at ([xshift=1em, yshift=0.5em]h7.east) {\footnotesize{舍弃概率}};
 \node [anchor=west] (l21) at ([xshift=0em, yshift=-1em]l2.west) {\footnotesize{较低假设}};

-%\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em,opacity=0.7] (h1) at ([xshift=-1em,yshift=2em]h2.north) {重组假设};
 \node[anchor=north] (l1) at ([xshift=7.5em,yshift=-1em]h0.south) {\scriptsize{重组假设}};
 \node[anchor=north] (l2) at ([xshift=7.5em,yshift=-1em]h4.south) {\scriptsize{重组假设}};
 \node[anchor=north] (part2) at ([xshift=0em,yshift=-14em]h0.south){\scriptsize{（b）译文不同时的假设重组}};

--- a/Chapter7/Figures/figure-example-of-phrase-table.tex
+++ b/Chapter7/Figures/figure-example-of-phrase-table.tex
@@ -6,7 +6,7 @@
 \node [anchor=west] (s2) at ([yshift=-1.2em]s1.west) {\small{，悲伤 $\vert\vert\vert$ , sadness $\vert\vert\vert$ -1.946 -3.659 0 -3.709 1 0 $\vert\vert\vert$ 1 $\vert\vert\vert$ 0-0 1-1}};
 \node [anchor=west] (s3) at ([yshift=-1.2em]s2.west) {\small{，北京 等 $\vert\vert\vert$ , beijing , and other $\vert\vert\vert$ 0 -7.98 0 -3.84 1 0 $\vert\vert\vert$ 2 $\vert\vert\vert$ 0-0 1-1 2-2 2-3 2-4}};
 \node [anchor=west] (s4) at ([yshift=-1.2em]s3.west) {\small{，北京 及 $\vert\vert\vert$ , beijing , and $\vert\vert\vert$ -0.69 -1.45 -0.92 -4.80 1 0  $\vert\vert\vert$ 2 $\vert\vert\vert$ 0-0 1-1 2-2}};
-\node [anchor=west] (s5) at ([yshift=-1.2em]s4.west) {\small{一个 中国 $\vert\vert\vert$ one china $\vert\vert\vert$ 0 -1.725 0 -1.636 1 0 $\vert\vert\vert$ 2 $\vert\vert\vert$ 1-1 2-2}};
+\node [anchor=west] (s5) at ([yshift=-1.2em]s4.west) {\small{一个 世界 $\vert\vert\vert$ one world $\vert\vert\vert$ 0 -1.725 0 -1.636 1 0 $\vert\vert\vert$ 2 $\vert\vert\vert$ 1-1 2-2}};
 \node [anchor=west] (s7) at ([yshift=-1.1em]s5.west) {\small{...}};
 \node [anchor=west] (s6) at ([yshift=1.0em]s1.west) {\small{...}};
 \begin{pgfonlayer}{background}

--- a/Chapter7/Figures/figure-example-of-stack-decode.tex
+++ b/Chapter7/Figures/figure-example-of-stack-decode.tex
@@ -55,14 +55,14 @@
 \draw [->,thick,red] (h1.north).. controls +(60:0.5) and +(120:0.5) .. (h2.north);
 \draw [->,thick,red] (h2.north).. controls +(60:0.5) and +(120:0.5) .. (h3.north);
 }
-\node [anchor=south east] (wtranslabel) at ([xshift=-1.5em,yshift=-2.2em]h0.south west) {\small{\textbf{：假设堆栈}}};
+\node [anchor=south east] (wtranslabel) at ([xshift=-2.4em,yshift=-2.26em]h0.south west) {\small{\textbf{假设堆栈}}};
 \node [anchor=east,inner sep=2pt,fill=blue!10,minimum height=1em,minimum width=2em] (stacklabel) at ([xshift=-0.1em]wtranslabel.west) {};
 {
-\node [anchor=east] (line1) at ([xshift=-1.0em,yshift=0.4em]h0.west) {\small{0号栈包含空假设}};
+\node [anchor=east] (line1) at ([xshift=-1.0em,yshift=0.45em]h0.west) {\small{0号栈包含空假设}};
 }
 {
-\node [anchor=east] (line2) at ([xshift=-2.3em,yshift=0.5em]h13.west) {\small{通过假设扩展产生新的假设}};
-\node [anchor=north west] (line3) at ([yshift=0.1em]line2.south west) {\small{并不断的被存入假设堆栈中}};
+\node [anchor=east] (line2) at ([xshift=-2.3em,yshift=0.44em]h13.west) {\small{通过假设扩展产生新的假设}};
+\node [anchor=north west] (line3) at ([yshift=0.1em]line2.south west) {\small{并不断地被存入假设堆栈中}};
 }
 \begin{pgfonlayer}{background}
 {

--- a/Chapter7/Figures/figure-get-word-alignment.tex
+++ b/Chapter7/Figures/figure-get-word-alignment.tex
@@ -88,8 +88,8 @@
 \node[align=center,elementnode,minimum size=0.3cm,inner sep=0.1pt,fill=red!50] (lc4) at (c22) {};
 \node[align=center,elementnode,minimum size=0.3cm,inner sep=0.1pt,fill=blue!50] (lc5) at (c30) {};

-\node[anchor=north] (l1) at ([xshift=0.5em,yshift=-0.5em]a10.south) {\footnotesize{S - T}};
-\node[anchor=north] (l2) at ([xshift=0.5em,yshift=-0.5em]b10.south) {\footnotesize{T - S}};
+\node[anchor=north] (l1) at ([xshift=0.5em,yshift=-0.5em]a10.south) {\footnotesize{$\seq{S}$ - $\seq{T}$}};
+\node[anchor=north] (l2) at ([xshift=0.5em,yshift=-0.5em]b10.south) {\footnotesize{$\seq{T}$ - $\seq{S}$}};
 \node[anchor=north] (l3) at ([xshift=0.5em,yshift=-0.5em]c10.south) {\footnotesize{交集/并集}};

 \end{scope}

--- a/Chapter7/Figures/figure-reorder-base-distance.tex
+++ b/Chapter7/Figures/figure-reorder-base-distance.tex
@@ -34,8 +34,8 @@
 \node[anchor=north] (d1) at ([xshift=-0.1em,yshift=-0.1em]distance.south) {+4};
 \node[anchor=north] (d2) at ([yshift=-1.8em]d1.south) {-5};

-\node[anchor=north west,fill=blue!20] (m1) at ([xshift=-1em,yshift=-0.0em]t1.south west) {\small{$start_1\ \ -\ \ end_{0}\ \ -\ \ 1$\quad =\quad 5\ -\ 0\ -\ 1}};
-\node[anchor=north west,fill=blue!20] (m2) at ([xshift=-1em,yshift=-0.0em]t2.south west) {\small{$start_2\ \ -\ \  end_{1}\ \ -\ \ 1$\quad =\quad 1\ -\ 5\ -\ 1}};
+\node[anchor=north west,fill=blue!20] (m1) at ([xshift=-1em,yshift=-0.0em]t1.south west) {\small{$\rm{start}_1\ \ -\ \ \rm{end}_{0}\ \ -\ \ 1$\quad =\quad 5\ -\ 0\ -\ 1}};
+\node[anchor=north west,fill=blue!20] (m2) at ([xshift=-1em,yshift=-0.0em]t2.south west) {\small{$\rm{start}_2\ \ -\ \  \rm{end}_{1}\ \ -\ \ 1$\quad =\quad 1\ -\ 5\ -\ 1}};

 \draw[-] ([xshift=0.08in]target.south west)--([xshift=2.4in]target.south west);


--- a/Chapter7/Figures/figure-word-and-phrase-translation-regard-as-path.tex
+++ b/Chapter7/Figures/figure-word-and-phrase-translation-regard-as-path.tex
@@ -143,7 +143,7 @@
 }

 {
-\node [anchor=north west] (wtranslabel) at ([yshift=-4em]t15.south west) {\scriptsize{翻译路径（仅含有单词）}};
+\node [anchor=north west] (wtranslabel) at ([yshift=-4em]t15.south west) {\scriptsize{翻译路径（仅包含单词）}};
 \draw [->,ultra thick,red,line width=1.5pt,opacity=0.7] ([xshift=0.2em]wtranslabel.east) -- ([xshift=1.2em]wtranslabel.east);
 }


--- a/Chapter7/chapter7.tex
+++ b/Chapter7/chapter7.tex
@@ -21,7 +21,7 @@
 %	CHAPTER 7
 %----------------------------------------------------------------------------------------

-\chapter{基于短语的模型}
+\chapter{基于短语的翻译模型}

 \parinterval 机器翻译的一个基本问题是要定义翻译的基本单元是什么。比如，可以像{\chapterfive}介绍的那样，以单词为单位进行翻译，即把句子的翻译看作是单词之间对应关系的一种组合。基于单词的模型是符合人类对翻译问题的认知的，因为单词本身就是人类加工语言的一种基本单元。另一方面，在进行翻译时也可以使用一些更“复杂”的知识。比如，很多词语间的搭配需要根据语境的变化进行调整，而且对于句子结构的翻译往往需要更上层的知识，如句法知识。因此，在对单词翻译进行建模的基础上，需要探索其他类型的翻译知识，使得搭配和结构翻译等问题可以更好地被建模。

@@ -39,7 +39,7 @@
 %    NEW SUB-SECTION
 %----------------------------------------------------------------------------------------

-\subsection{基于词的翻译所带来的问题}
+\subsection{词的翻译带来的问题}

 \parinterval 首先，回顾一下基于单词的统计翻译模型是如何完成翻译的。图\ref{fig:7-1}展示了一个实例。其中，左侧是一个单词的“翻译表”，它记录了源语言（汉语）单词和目标语言（英语）单词之间的对应关系，以及这种对应的可能性大小（用$\funp{P}$表示）。在翻译时，会使用这些单词一级的对应，生成译文。图\ref{fig:7-1}右侧就展示了一个基于词的模型生成的翻译结果，其中$\seq{s}$和$\seq{t}$分别表示源语言和目标语言句子，单词之间的连线表示两个句子中单词一级的对应。

@@ -539,13 +539,13 @@ d = {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)} \c

 \parinterval 基于距离的调序的一个基本假设是：语言的翻译基本上都是顺序的，也就是，译文单词出现的顺序和源语言单词的顺序基本上是一致的。反过来说，如果译文和源语言单词（或短语）的顺序差别很大，就认为出现了调序。

-\parinterval 基于距离的调序方法的核心思想就是度量当前翻译结果与顺序翻译之间的差距。对于译文中的第$i$个短语，令$start_i$表示它所对应的源语言短语中第一个词所在的位置，$end_i$表示它所对应的源语言短语中最后一个词所在的位置。于是，这个短语（相对于前一个短语）的调序距离为：
+\parinterval 基于距离的调序方法的核心思想就是度量当前翻译结果与顺序翻译之间的差距。对于译文中的第$i$个短语，令$\rm{start}_i$表示它所对应的源语言短语中第一个词所在的位置，$\rm{end}_i$表示它所对应的源语言短语中最后一个词所在的位置。于是，这个短语（相对于前一个短语）的调序距离为：
 \begin{eqnarray}
-dr = start_i-end_{i-1}-1
+dr = {\rm{start}}_i-{\rm{end}}_{i-1}-1
 \label{eq:7-15}
 \end{eqnarray}

-\parinterval 在图\ref{fig:7-20}的例子中，“the apple”所对应的调序距离为4，“on the table”所对应的调序距离为$-5$。显然，如果两个源语短语按顺序翻译，则$start_i = end_{i-1} + 1$，这时调序距离为0。
+\parinterval 在图\ref{fig:7-20}的例子中，“the apple”所对应的调序距离为4，“on the table”所对应的调序距离为$-5$。显然，如果两个源语短语按顺序翻译，则$\rm{start}_i = \rm{end}_{i-1} + 1$，这时调序距离为0。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -898,7 +898,7 @@ dr = start_i-end_{i-1}-1
 %----------------------------------------------------------------------------------------

 \sectionnewpage
-\section{小节及拓展阅读}\label{section-7.8}
+\section{小结及拓展阅读}\label{section-7.8}

 \parinterval 统计机器翻译模型是近三十年内自然语言处理的重要里程碑之一。其统计建模的思想长期影响着自然语言处理的研究。无论是前面介绍的基于单词的模型，还是本章介绍的基于短语的模型，甚至后面即将介绍的基于句法的模型，大家都在尝试回答：究竟应该用什么样的知识对机器翻译进行统计建模？不过，这个问题至今还没有确定的答案。但是，显而易见，统计机器翻译为机器翻译的研究提供了一种范式，即让计算机用概率化的 “知识” 描述翻译问题。这些 “ 知识” 体现在统计模型的结构和参数中，并且可以从大量的双语和单语数据中自动学习。这种建模思想在今天的机器翻译研究中仍然随处可见。


--- a/Chapter8/Figures/figure-an-example-of-phrase-system.tex
+++ b/Chapter8/Figures/figure-an-example-of-phrase-system.tex
@@ -4,19 +4,19 @@
 \begin{tikzpicture}
 \begin{scope}
 \node [anchor=east] (shead) at (0,0) {源语:};
-\node [anchor=west] (swords) at (shead.east) {澳洲\ \ 是\ \ 与\ \ 北韩\ \ 有\ \ 邦交\ \ 的\ \ 少数\ \ 国家\ \ 之一};
+\node [anchor=west] (swords) at (shead.east) {巴基斯坦\ \ 是\ \ 与\ \ 中国\ \ 有\ \ 邦交\ \ 的\ \ 多数\ \ 国家\ \ 之一};
 \node [anchor=north east] (thead) at ([yshift=-0.8em]shead.south east) {短语系统:};
-\node [anchor=west] (twords) at (thead.east) {Australia is diplomatic relations with North Korea};
-\node [anchor=north west] (twords2) at ([yshift=-0.2em]twords.south west) {is one of the few countries};
+\node [anchor=west] (twords) at (thead.east) {Pakistan  is diplomatic relations with China};
+\node [anchor=north west] (twords2) at ([yshift=-0.2em]twords.south west) {is one of the many countries};
 \node [anchor=north east] (rhead) at ([yshift=-2.2em]thead.south east) {参考译文:};
-\node [anchor=west] (rwords) at (rhead.east) {Australia is one of the few countries that have};
-\node [anchor=north west] (rwords2) at ([yshift=-0.2em]rwords.south west) {diplomatic relations with North Korea};
+\node [anchor=west] (rwords) at (rhead.east) {Pakistan  is one of the many countries that have};
+\node [anchor=north west] (rwords2) at ([yshift=-0.2em]rwords.south west) {diplomatic relations with China};

 \begin{pgfonlayer}{background}
 {
-\draw[fill=red!20,draw=white] ([xshift=-5.4em]twords.north) rectangle ([xshift=10.8em]twords.south);
-\draw[fill=blue!20,draw=white] ([xshift=-4.6em]twords2.north) rectangle ([xshift=6.1em]twords2.south);
-\node [anchor=south east,inner sep=1pt,fill=black] (l1) at ([xshift=10.8em]twords.south) {\tiny{{\color{white} 1}}};
+\draw[fill=red!20,draw=white] ([xshift=-5.1em]twords.north) rectangle ([xshift=9.1em]twords.south);
+\draw[fill=blue!20,draw=white] ([xshift=-4.9em]twords2.north) rectangle ([xshift=6.1em]twords2.south);
+\node [anchor=south east,inner sep=1pt,fill=black] (l1) at ([xshift=9.1em]twords.south) {\tiny{{\color{white} 1}}};
 \node [anchor=south east,inner sep=1pt,fill=black] (l2) at ([xshift=6.1em]twords2.south) {\tiny{{\color{white} 2}}};
 }
 \end{pgfonlayer}

--- a/Chapter8/Figures/figure-combination-of-translation-with-different-rules.tex
+++ b/Chapter8/Figures/figure-combination-of-translation-with-different-rules.tex
@@ -7,34 +7,33 @@
 \node[anchor=north] (q1) at (0,0) {\scriptsize\sffamily\bfseries{输入字符串：}};
 \node[anchor=west] (q2) at ([xshift=0em,yshift=-2em]q1.west) {\footnotesize{进口$\quad$和$\quad$出口$\quad$大幅度$\quad$下降$\quad$了}};

-%\node[anchor=north,fill=blue!20,minimum height=1em,minimum width=1em] (f1) at ([xshift=-4.1em,yshift=-0.8em]q2.south) {};

-\node[anchor=north,fill=blue!20,minimum height=4em,minimum width=1em] (f1) at ([xshift=2.2em,yshift=-0.7em]q2.south) {};
+\node[anchor=north,fill=blue!20,minimum height=4em,minimum width=1em] (f1) at ([xshift=2.7em,yshift=-0.7em]q2.south) {};

 \node[anchor=east] (n1) at ([xshift=1em,yshift=-2em]q2.west) {\scriptsize\sffamily\bfseries{匹配规则：}};

-\node[anchor=west] (n2) at ([xshift=0em,yshift=0em]n1.east) {\scriptsize{$\textrm{X} \to  \langle\ \textrm{X}_1\ \text{大幅度}\ \text{下降}\ \text{了},\ \textrm{X}_1\ \textrm{have}\ \textrm{drastically}\ \textrm{fallen}\ \rangle$}};
+\node[anchor=west] (n2) at ([xshift=0em,yshift=0em]n1.east) {\scriptsize{$\seq{X} \to  \langle\ \seq{X}_1\ \text{大幅度}\ \text{下降}\ \text{了},\ \seq{X}_1\ \textrm{have}\ \textrm{drastically}\ \textrm{fallen}\ \rangle$}};

-\node[anchor=west] (n3) at ([xshift=0em,yshift=-1.5em]n2.west) {\scriptsize{$\textrm{X} \to  \langle\ \textrm{X}_1\ \text{大幅度}\ \text{下降}\ \text{了},\ \textrm{X}_1\ \textrm{have}\ \textrm{fallen}\ \textrm{drastically}\ \rangle$}};
+\node[anchor=west] (n3) at ([xshift=0em,yshift=-1.5em]n2.west) {\scriptsize{$\seq{X} \to  \langle\ \seq{X}_1\ \text{大幅度}\ \text{下降}\ \text{了},\ \seq{X}_1\ \textrm{have}\ \textrm{fallen}\ \textrm{drastically}\ \rangle$}};

-\node[anchor=west] (n4) at ([xshift=0em,yshift=-1.5em]n3.west) {\scriptsize{$\textrm{X} \to  \langle\ \textrm{X}_1\ \text{大幅度}\ \text{下降}\ \text{了},\ \textrm{X}_1\ \textrm{has}\ \textrm{drastically}\ \textrm{fallen}\ \rangle$}};
+\node[anchor=west] (n4) at ([xshift=0em,yshift=-1.5em]n3.west) {\scriptsize{$\seq{X} \to  \langle\ \seq{X}_1\ \text{大幅度}\ \text{下降}\ \text{了},\ \seq{X}_1\ \textrm{has}\ \textrm{drastically}\ \textrm{fallen}\ \rangle$}};

 \draw[decorate,decoration={mirror,brace}]([xshift=0.5em,yshift=-1em]q2.west) --([xshift=7em,yshift=-1em]q2.west) node [xshift=0em,yshift=-1em,align=center](label1) {};	

 {\scriptsize
 \node[anchor=west] (h1) at ([xshift=1em,yshift=-15em]q2.west) {{Span[0,3]下的翻译假设：}};
-\node[anchor=west] (h2) at ([xshift=0em,yshift=-1.3em]h1.west) {{X：imports and exports}};
-\node[anchor=west] (h6) at ([xshift=0em,yshift=-1.3em]h2.west) {{S：the import and export}};
+\node[anchor=west] (h2) at ([xshift=0em,yshift=-1.3em]h1.west) {{$\seq{X}$：imports and exports}};
+\node[anchor=west] (h6) at ([xshift=0em,yshift=-1.3em]h2.west) {{$\seq{S}$：the import and export}};
 }

 {\scriptsize
-\node[anchor=west] (h21) at ([xshift=9em,yshift=5.0em]h1.east) {{替换$\textrm{X}_1$后生成的翻译假设：}};
-\node[anchor=west] (h22) at ([xshift=0em,yshift=-1.3em]h21.west) {{X：imports and exports have drastically fallen}};
-\node[anchor=west] (h23) at ([xshift=0em,yshift=-1.3em]h22.west) {{X：the import and export have drastically fallen}};
-\node[anchor=west] (h24) at ([xshift=0em,yshift=-1.3em]h23.west) {{X：imports and exports have drastically fallen}};
-\node[anchor=west] (h25) at ([xshift=0em,yshift=-1.3em]h24.west) {{X：the import and export have drastically fallen}};
-\node[anchor=west] (h26) at ([xshift=0em,yshift=-1.3em]h25.west) {{X：imports and exports has drastically fallen}};
-\node[anchor=west] (h27) at ([xshift=0em,yshift=-1.3em]h26.west) {{X：the import and export has drastically fallen}};
+\node[anchor=west] (h21) at ([xshift=9em,yshift=5.0em]h1.east) {{替换$\seq{X}_1$后生成的翻译假设：}};
+\node[anchor=west] (h22) at ([xshift=0em,yshift=-1.3em]h21.west) {{$\seq{X}$：imports and exports have drastically fallen}};
+\node[anchor=west] (h23) at ([xshift=0em,yshift=-1.3em]h22.west) {{$\seq{X}$：the import and export have drastically fallen}};
+\node[anchor=west] (h24) at ([xshift=0em,yshift=-1.3em]h23.west) {{$\seq{X}$：imports and exports have drastically fallen}};
+\node[anchor=west] (h25) at ([xshift=0em,yshift=-1.3em]h24.west) {{$\seq{X}$：the import and export have drastically fallen}};
+\node[anchor=west] (h26) at ([xshift=0em,yshift=-1.3em]h25.west) {{$\seq{X}$：imports and exports has drastically fallen}};
+\node[anchor=west] (h27) at ([xshift=0em,yshift=-1.3em]h26.west) {{$\seq{X}$：the import and export has drastically fallen}};
 }

 \node [rectangle,inner sep=0.1em,rounded corners=1pt,draw] [fit = (h1) (h2) (h6)] (gl1) {};

--- a/Chapter8/Figures/figure-different-representations-of-syntax-tree.tex
+++ b/Chapter8/Figures/figure-different-representations-of-syntax-tree.tex
@@ -15,7 +15,7 @@
 \end{scope}
 	
 \node [anchor=north west] (cap1) at (-1.5em,-1in) {{(a) 树状表示}};
-\node [anchor=west] (cap2) at ([xshift=0.5in]cap1.east) {{(b) 序列表示(缩进)}};
+\node [anchor=west] (cap2) at ([xshift=0.5in]cap1.east) {{(b) 序列表示（缩进）}};
 \node [anchor=west] (cap3) at ([xshift=0.5in]cap2.east) {{(c) 序列表示}};
 }
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter8/Figures/figure-example-of-translation-use-syntactic-structure.tex
+++ b/Chapter8/Figures/figure-example-of-translation-use-syntactic-structure.tex
@@ -8,16 +8,18 @@

 {\scriptsize

-\node[anchor=west] (ref) at (0,0) {{\sffamily\bfseries{人工翻译:}} {\red{After}} North Korea demanded concessions from U.S. again before the start of a new round of six-nation talks ...};

-\node[anchor=north west] (hifst) at ([yshift=-0.3em]ref.south west) {{\sffamily\bfseries{机器翻译:}} \blue{In}\black{} the new round of six-nation talks on North Korea again demanded that U.S. in the former promise ...};
+
+\node[anchor=west] (ref) at (0,0) {{\sffamily\bfseries{人工翻译:}} {\red{After}} the school team won the Championship of the China University Basketball Association for the first time ...};
+
+\node[anchor=north west] (hifst) at ([yshift=-0.3em]ref.south west) {{\sffamily\bfseries{机器翻译:}} \blue{In}\black{} the school team won the Chinese College Basketball League Championship for the first time ...};

 {
-\node[anchor=north west] (synhifst) at ([yshift=-0.3em]hifst.south west) {\sffamily\bfseries{更好?:}};
+\node[anchor=north west] (synhifst) at ([yshift=-0.2em]hifst.south west) {\sffamily\bfseries{更好?:}};

-\node[anchor=west, fill=red!20!white, inner sep=0.3em] (synhifstpart1) at ([xshift=-0.5em]synhifst.east) {After};
+\node[anchor=west, fill=red!20!white, inner sep=0.3em] (synhifstpart1) at ([xshift=-0.3em]synhifst.east) {After};

-\node[anchor=west, fill=blue!20!white, inner sep=0.25em] (synhifstpart2) at ([xshift=0.1em,yshift=-0.05em]synhifstpart1.east) {North Korea again demanded that U.S. promised concessions before the new round of six-nation talks};
+\node[anchor=west, fill=blue!20!white, inner sep=0.25em] (synhifstpart2) at ([xshift=0.1em,yshift=-0.05em]synhifstpart1.east) {the school team won the Championship of the China University Basketball Association for the first time};

 \node[anchor=west] (synhifstpart3) at ([xshift=-0.2em]synhifstpart2.east) {...};
 }
@@ -25,9 +27,9 @@
 \node [anchor=west] (inputlabel) at ([yshift=-0.4in]synhifst.west) {\sffamily\bfseries{输入:}};

 \node [anchor=west,minimum height=12pt] (inputseg1) at (inputlabel.east) {在$_1$ };
-\node [anchor=west,minimum height=12pt] (inputseg2) at ([xshift=0.2em]inputseg1.east) {北韩$_2$ 再度$_3$ 要求$_4$ 美国$_5$ 于$_6$ 新$_7$ 回合$_8$ 六$_9$ 国$_{10}$ 会谈$_{11}$ 前$_{12}$ 承诺$_{13}$ 让步$_{14}$};
-\node [anchor=west,minimum height=12pt] (inputseg3) at ([xshift=0.2em]inputseg2.east) {后$_{15}$};
-\node [anchor=west,minimum height=12pt] (inputseg4) at ([xshift=0.2em]inputseg3.east) {,$_{16}$};
+\node [anchor=west,minimum height=12pt] (inputseg2) at ([xshift=0.2em]inputseg1.east) {学校$_2$ 球队$_3$ 首次$_4$ 夺得$_5$ 中国$_6$ 大学生$_7$ 篮球$_8$ 联赛$_9$ 冠军$_{10}$};
+\node [anchor=west,minimum height=12pt] (inputseg3) at ([xshift=0.2em]inputseg2.east) {后$_{11}$};
+\node [anchor=west,minimum height=12pt] (inputseg4) at ([xshift=0.2em]inputseg3.east) {,$_{12}$};
 \node [anchor=west,minimum height=12pt] (inputseg5) at ([xshift=0.2em]inputseg4.east) {...};

 {
@@ -45,17 +47,17 @@
 }

 {
-\node [anchor=north east,align=left] (nolimitlabel) at (synlabel1.south west) {\tiny{短语结构树很容易捕捉}\\\tiny{这种介词短语结构}};
+\node [anchor=north east,align=left] (nolimitlabel) at (synlabel1.south west) {\scriptsize{短语结构树很容易捕捉}\\\scriptsize{这种介词短语结构}};
 }

 {
 \node [anchor=west,minimum height=12pt,fill=red!20] (inputseg1) at (inputlabel.east) {在$_1$ };
-\node [anchor=west,minimum height=12pt,fill=blue!20] (inputseg2) at ([xshift=0.2em]inputseg1.east) {北韩$_2$ 再度$_3$ 要求$_4$ 美国$_5$ 于$_6$ 新$_7$ 回合$_8$ 六$_9$ 国$_{10}$ 会谈$_{11}$ 前$_{12}$ 承诺$_{13}$ 让步$_{14}$};
+\node [anchor=west,minimum height=12pt,fill=blue!20] (inputseg2) at ([xshift=0.2em]inputseg1.east) {学校$_2$ 球队$_3$ 首次$_4$ 夺得$_5$ 中国$_6$ 大学生$_7$ 篮球$_8$ 联赛$_9$ 冠军$_{10}$};
 \node [anchor=west,minimum height=12pt,fill=red!20] (inputseg3) at ([xshift=0.2em]inputseg2.east) {后$_{15}$};

 \path [draw,->,dashed] (inputseg1.north) .. controls +(north:0.2) and +(south:0.3) ..  ([xshift=1em]synhifstpart1.south);
 \path [draw,->,dashed] (inputseg3.north) .. controls +(north:0.2) and +(south:0.6) ..  ([xshift=1em]synhifstpart1.south);
-\path [draw,->,dashed] ([xshift=-0.5in]inputseg2.north) --  ([xshift=-0.6in]synhifstpart2.south);
+\path [draw,->,dashed] ([xshift=-0.8in]inputseg2.north) --  ([xshift=1.9in]synhifstpart2.south);
 }

 }

--- a/Chapter8/Figures/figure-examples-of-translation-with-complex-ordering.tex
+++ b/Chapter8/Figures/figure-examples-of-translation-with-complex-ordering.tex
@@ -9,9 +9,9 @@

 \node[anchor=west] (ref) at (0,0) {{\sffamily\bfseries{参考答案：}} The Chinese star performance troupe presented a wonderful Peking opera as well as singing and dancing };

-\node[anchor=north west] (ref2) at (ref.south west) {{\color{white} \sffamily\bfseries{Reference:}} performance to Hong Kong audience .};
+\node[anchor=north west] (ref2) at (ref.south west) {{\color{white} \sffamily\bfseries{Reference:}} performance to the national audience .};

-\node[anchor=north west] (hifst) at (ref2.south west) {{\sffamily\bfseries{层次短语系统：}} Star troupe of China, highlights of Peking opera and dance show to the audience of Hong Kong .};
+\node[anchor=north west] (hifst) at (ref2.south west) {{\sffamily\bfseries{层次短语系统：}} Star troupe of China, highlights of Peking opera and dance show to the audience of the national .};

 \node[anchor=north west] (synhifst) at (hifst.south west) {{\sffamily\bfseries{句法系统：}} Chinese star troupe};

@@ -21,7 +21,7 @@

 \node[anchor=west, fill=red!20!white, inner sep=0.40em] (synhifstpart4) at ([xshift=0.2em]synhifstpart3.east) {to};

-\node[anchor=west, fill=purple!20!white, inner sep=0.25em] (synhifstpart5) at ([xshift=0.2em]synhifstpart4.east) {Hong Kong audience};
+\node[anchor=west, fill=purple!20!white, inner sep=0.25em] (synhifstpart5) at ([xshift=0.2em]synhifstpart4.east) {the national audience};

 \node[anchor=west] (synhifstpart6) at (synhifstpart5.east) {.};

@@ -39,7 +39,7 @@
            ]
            [.\node(tn8){PP};
                [.\node(tn9){P}; \node[fill=red!20!white](seg5){给$_{12}$}; ]
-                [.\node(tn10){NP}; \edge[roof]; \node[fill=purple!20!white](seg6){香港$_{13}$ 观众$_{14}$}; ]
+                [.\node(tn10){NP}; \edge[roof]; \node[fill=purple!20!white](seg6){全国$_{13}$ 观众$_{14}$}; ]
            ]
        ]
        [.\node(tn11){.}; ]

--- a/Chapter8/Figures/figure-execution-of-cube-pruning.tex
+++ b/Chapter8/Figures/figure-execution-of-cube-pruning.tex
@@ -3,10 +3,10 @@
 \tikzstyle{selectnode} = [rectangle,fill=green!20,minimum height=1.5em,minimum width=1.5em,inner sep=1.2pt]
 \tikzstyle{srcnode} = [rotate=45,anchor=south west]
 \begin{scope}[scale=0.85]
-\node [anchor=west] (s1) at (0,0) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{from}\ \textrm{X}_1>$}};
-\node [anchor=east] (s2) at ([yshift=-2em]s1.east) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{since}\ \textrm{X}_1>$}};
-\node [anchor=east] (s3) at ([yshift=-2em]s2.east) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{from the}\ \textrm{X}_1>$}};
-\node [anchor=east] (s4) at ([yshift=-2em]s3.east) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{through}\ \textrm{X}_1>$}};
+\node [anchor=west] (s1) at (0,0) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{from}\ \seq{X}_1>$}};
+\node [anchor=east] (s2) at ([yshift=-2em]s1.east) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{since}\ \seq{X}_1>$}};
+\node [anchor=east] (s3) at ([yshift=-2em]s2.east) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{from the}\ \seq{X}_1>$}};
+\node [anchor=east] (s4) at ([yshift=-2em]s3.east) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{through}\ \seq{X}_1>$}};

 \node [anchor=center,alignmentnode] (alig1) at ([xshift=12.0em,yshift=0em]s1.west) {};
 \node [anchor=center,alignmentnode] (alig11) at ([xshift=2.2em]alig1.center) {};
@@ -45,10 +45,10 @@

 %图2
 \begin{scope}[xshift=18.0em,scale=0.85]
-\node [anchor=west] (s1) at (0,0) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{from}\ \textrm{X}_1>$}};
-\node [anchor=east] (s2) at ([yshift=-2em]s1.east) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{since}\ \textrm{X}_1>$}};
-\node [anchor=east] (s3) at ([yshift=-2em]s2.east) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{from the}\ \textrm{X}_1>$}};
-\node [anchor=east] (s4) at ([yshift=-2em]s3.east) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{through}\ \textrm{X}_1>$}};
+\node [anchor=west] (s1) at (0,0) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{from}\ \seq{X}_1>$}};
+\node [anchor=east] (s2) at ([yshift=-2em]s1.east) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{since}\ \seq{X}_1>$}};
+\node [anchor=east] (s3) at ([yshift=-2em]s2.east) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{from the}\ \seq{X}_1>$}};
+\node [anchor=east] (s4) at ([yshift=-2em]s3.east) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{through}\ \seq{X}_1>$}};

 \node [anchor=center,alignmentnode] (alig1) at ([xshift=12.0em,yshift=0em]s1.west) {};
 \node [anchor=center,alignmentnode] (alig11) at ([xshift=2.2em]alig1.center) {};
@@ -92,10 +92,10 @@

 %图3
 \begin{scope}[yshift=-13.0em,scale=0.85]
-\node [anchor=west] (s1) at (0,0) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{from}\ \textrm{X}_1>$}};
-\node [anchor=east] (s2) at ([yshift=-2em]s1.east) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{since}\ \textrm{X}_1>$}};
-\node [anchor=east] (s3) at ([yshift=-2em]s2.east) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{from the}\ \textrm{X}_1>$}};
-\node [anchor=east] (s4) at ([yshift=-2em]s3.east) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{through}\ \textrm{X}_1>$}};
+\node [anchor=west] (s1) at (0,0) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{from}\ \seq{X}_1>$}};
+\node [anchor=east] (s2) at ([yshift=-2em]s1.east) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{since}\ \seq{X}_1>$}};
+\node [anchor=east] (s3) at ([yshift=-2em]s2.east) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{from the}\ \seq{X}_1>$}};
+\node [anchor=east] (s4) at ([yshift=-2em]s3.east) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{through}\ \seq{X}_1>$}};

 \node [anchor=center,alignmentnode] (alig1) at ([xshift=12.0em,yshift=0em]s1.west) {};
 \node [anchor=center,alignmentnode] (alig11) at ([xshift=2.2em]alig1.center) {};
@@ -143,10 +143,10 @@

 %图4
 \begin{scope}[xshift=18.0em,yshift=-13.0em,scale=0.85]
-\node [anchor=west] (s1) at (0,0) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{from}\ \textrm{X}_1>$}};
-\node [anchor=east] (s2) at ([yshift=-2em]s1.east) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{since}\ \textrm{X}_1>$}};
-\node [anchor=east] (s3) at ([yshift=-2em]s2.east) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{from the}\ \textrm{X}_1>$}};
-\node [anchor=east] (s4) at ([yshift=-2em]s3.east) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{through}\ \textrm{X}_1>$}};
+\node [anchor=west] (s1) at (0,0) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{from}\ \seq{X}_1>$}};
+\node [anchor=east] (s2) at ([yshift=-2em]s1.east) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{since}\ \seq{X}_1>$}};
+\node [anchor=east] (s3) at ([yshift=-2em]s2.east) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{from the}\ \seq{X}_1>$}};
+\node [anchor=east] (s4) at ([yshift=-2em]s3.east) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{through}\ \seq{X}_1>$}};

 \node [anchor=center,alignmentnode] (alig1) at ([xshift=12.0em,yshift=0em]s1.west) {};
 \node [anchor=center,alignmentnode] (alig11) at ([xshift=2.2em]alig1.center) {};

--- a/Chapter8/Figures/figure-hierarchical-phrase-rule-match-generate.tex
+++ b/Chapter8/Figures/figure-hierarchical-phrase-rule-match-generate.tex
@@ -7,29 +7,29 @@
 \node[anchor=north] (q1) at (0,0) {\scriptsize\sffamily\bfseries{输入字符串：}};
 \node[anchor=west] (q2) at ([xshift=0em,yshift=-2em]q1.west) {\footnotesize{进口$\quad$和$\quad$出口$\quad$大幅度$\quad$下降$\quad$了}};

-\node[anchor=north,fill=blue!20,minimum height=1em,minimum width=1em] (f1) at ([xshift=-4.1em,yshift=-0.8em]q2.south) {};
+\node[anchor=north,fill=blue!20,minimum height=1em,minimum width=1em] (f1) at ([xshift=-4.2em,yshift=-0.8em]q2.south) {};

 \node[anchor=east] (n1) at ([xshift=1em,yshift=-2em]q2.west) {\scriptsize\sffamily\bfseries{匹配规则：}};

-\node[anchor=west] (n2) at ([xshift=0em,yshift=0em]n1.east) {\scriptsize{$\textrm{X} \to  \langle\ \textrm{X}_1\ \text{大幅度}\ \text{下降}\ \text{了},\ \textrm{X}_1\ \textrm{have}\ \textrm{drastically}\ \textrm{fallen}\ \rangle$}};
+\node[anchor=west] (n2) at ([xshift=0em,yshift=0em]n1.east) {\scriptsize{$$\seq{X}$ \to  \langle\ $\seq{X}$_1\ \text{大幅度}\ \text{下降}\ \text{了},\ $\seq{X}$_1\ \textrm{have}\ \textrm{drastically}\ \textrm{fallen}\ \rangle$}};

 \draw[decorate,decoration={mirror,brace}]([xshift=0.5em,yshift=-1em]q2.west) --([xshift=7em,yshift=-1em]q2.west) node [xshift=0em,yshift=-1em,align=center](label1) {};	

 {\scriptsize
 \node[anchor=west] (h1) at ([xshift=1em,yshift=-7em]q2.west) {{Span[0,3]下的翻译假设：}};
-\node[anchor=west] (h2) at ([xshift=0em,yshift=-1.3em]h1.west) {{X：the imports and exports}};
-\node[anchor=west] (h3) at ([xshift=0em,yshift=-1.3em]h2.west) {{X：imports and exports}};
-\node[anchor=west] (h4) at ([xshift=0em,yshift=-1.3em]h3.west) {{X：exports and imports}};
-\node[anchor=west] (h5) at ([xshift=0em,yshift=-1.3em]h4.west) {{X：the imports and the exports}};
-\node[anchor=west] (h6) at ([xshift=0em,yshift=-1.3em]h5.west) {{S：the import and export}};
+\node[anchor=west] (h2) at ([xshift=0em,yshift=-1.3em]h1.west) {{$\seq{X}$：the imports and exports}};
+\node[anchor=west] (h3) at ([xshift=0em,yshift=-1.3em]h2.west) {{$\seq{X}$：imports and exports}};
+\node[anchor=west] (h4) at ([xshift=0em,yshift=-1.3em]h3.west) {{$\seq{X}$：exports and imports}};
+\node[anchor=west] (h5) at ([xshift=0em,yshift=-1.3em]h4.west) {{$\seq{X}$：the imports and the exports}};
+\node[anchor=west] (h6) at ([xshift=0em,yshift=-1.3em]h5.west) {{$\seq{S}$：the import and export}};
 }

 {\scriptsize
-\node[anchor=west] (h21) at ([xshift=9em,yshift=0em]h1.east) {{替换$\textrm{X}_1$后生成的翻译假设：}};
-\node[anchor=west] (h22) at ([xshift=0em,yshift=-1.3em]h21.west) {{X：the imports and exports have drastically fallen}};
-\node[anchor=west] (h23) at ([xshift=0em,yshift=-1.3em]h22.west) {{X：imports and exports have drastically fallen}};
-\node[anchor=west] (h24) at ([xshift=0em,yshift=-1.3em]h23.west) {{X：exports and imports have drastically fallen}};
-\node[anchor=west] (h25) at ([xshift=0em,yshift=-1.3em]h24.west) {{X：the imports and the exports have drastically fallen}};
+\node[anchor=west] (h21) at ([xshift=9em,yshift=0em]h1.east) {{替换$$\seq{X}$_1$后生成的翻译假设：}};
+\node[anchor=west] (h22) at ([xshift=0em,yshift=-1.3em]h21.west) {{$\seq{X}$：the imports and exports have drastically fallen}};
+\node[anchor=west] (h23) at ([xshift=0em,yshift=-1.3em]h22.west) {{$\seq{X}$：imports and exports have drastically fallen}};
+\node[anchor=west] (h24) at ([xshift=0em,yshift=-1.3em]h23.west) {{$\seq{X}$：exports and imports have drastically fallen}};
+\node[anchor=west] (h25) at ([xshift=0em,yshift=-1.3em]h24.west) {{$\seq{X}$：the imports and the exports have drastically fallen}};
 }

 \node [rectangle,inner sep=0.1em,rounded corners=1pt,draw] [fit = (h1) (h5) (h6)] (gl1) {};

--- a/Chapter8/Figures/figure-processing-of-hierarchical-phrase-system.tex
+++ b/Chapter8/Figures/figure-processing-of-hierarchical-phrase-system.tex
@@ -9,7 +9,7 @@
 \tikzstyle{decodingnode} = [minimum width=7em,minimum height=1.7em,fill=green!20,rounded corners=0.3em];

 \node [datanode,anchor=north west,minimum height=1.7em,minimum width=8em] (bitext) at (0,0) {{ \small{训练用双语数据}}};
-\node [modelnode, anchor=north west,minimum height=1.7em,minimum width=8em] (gi) at ([xshift=2em,yshift=-0.2em]bitext.south east) {{ \small{文法(规则)抽取}}};
+\node [modelnode, anchor=north west,minimum height=1.7em,minimum width=8em] (gi) at ([xshift=2em,yshift=-0.2em]bitext.south east) {{ \small{文法（规则）抽取}}};
 \node [datanode,anchor=north east,minimum height=1.7em,minimum width=8em] (birules) at ([xshift=-2em,yshift=-0.2em]gi.south west) {{ \small{同步翻译文法}}};
 \node [modelnode, anchor=north west,minimum height=1.7em,minimum width=8em] (training) at ([xshift=2em,yshift=-0.2em]birules.south east) {{ \small{特征值学习}}};
 \node [datanode,anchor=north east,minimum height=1.7em,minimum width=8em] (model) at ([xshift=-2em,yshift=-0.2em]training.south west) {{ \small{翻译模型}}};

--- a/Chapter8/Figures/figure-rule-matching-base-tree.tex
+++ b/Chapter8/Figures/figure-rule-matching-base-tree.tex
@@ -8,7 +8,7 @@

 \begin{scope}[sibling distance=2pt,level distance=20pt,grow'=up]
 \Tree[.\node(treeroot){IP};
-     [.NP [.NR 阿都拉$_1$ ]]
+     [.NP [.NR 市长$_1$ ]]
        [.\node(tn1){VP};
            [.\node(tn2){PP};
                [.\node(tn3){P}; \node(cw1){对$_2$}; ]

--- a/Chapter8/Figures/figure-translation-rule-describe-two-sentence-generation.tex
+++ b/Chapter8/Figures/figure-translation-rule-describe-two-sentence-generation.tex
@@ -7,21 +7,21 @@
 {
 % rule 1 (source)
 \node [anchor=west] (rule1s1) at (0,0) {与};
-\node [anchor=west,inner sep=2pt,fill=black] (rule1s2) at ([xshift=0.5em]rule1s1.east) {\scriptsize{{\color{white} $\textrm{X}_1$}}};
+\node [anchor=west,inner sep=2pt,fill=black] (rule1s2) at ([xshift=0.5em]rule1s1.east) {\scriptsize{{\color{white} $\funp{X}_1$}}};
 \node [anchor=west] (rule1s3) at ([xshift=0.5em]rule1s2.east) {有};
-\node [anchor=west,inner sep=2pt,fill=black] (rule1s4) at ([xshift=0.5em]rule1s3.east) {\scriptsize{{\color{white} $\textrm{X}_2$}}};
+\node [anchor=west,inner sep=2pt,fill=black] (rule1s4) at ([xshift=0.5em]rule1s3.east) {\scriptsize{{\color{white} $\funp{X}_2$}}};

 % rule 1 (target)
 \node [anchor=west] (rule1t1) at ([xshift=0.8in]rule1s4.east) {have};
-\node [anchor=west,inner sep=2pt,fill=black] (rule1t2) at ([xshift=0.5em]rule1t1.east) {\scriptsize{{\color{white} $\textrm{X}_2$}}};
+\node [anchor=west,inner sep=2pt,fill=black] (rule1t2) at ([xshift=0.5em]rule1t1.east) {\scriptsize{{\color{white} $\funp{X}_2$}}};
 \node [anchor=west] (rule1t3) at ([xshift=0.5em]rule1t2.east) {with};
-\node [anchor=west,inner sep=2pt,fill=black] (rule1t4) at ([xshift=0.5em]rule1t3.east) {\scriptsize{{\color{white} $\textrm{X}_1$}}};
+\node [anchor=west,inner sep=2pt,fill=black] (rule1t4) at ([xshift=0.5em]rule1t3.east) {\scriptsize{{\color{white} $\funp{X}_1$}}};
 }

 {
 % phrase 1 (source and target)
-\node [anchor=north] (phrase1s1) at ([yshift=-1em]rule1s2.south) {\footnotesize{北韩}};
-\node [anchor=north] (phrase1t1) at ([yshift=-1em]rule1t4.south) {\footnotesize{North Korea}};
+\node [anchor=north] (phrase1s1) at ([yshift=-1em]rule1s2.south) {\footnotesize{巴基斯坦}};
+\node [anchor=north] (phrase1t1) at ([yshift=-1em]rule1t4.south) {\footnotesize{Pakista}};
 }

 {
@@ -48,18 +48,18 @@

 {
 % rule 2 (source)
-\node [anchor=west,inner sep=2pt,fill=black] (rule2s1) at ([yshift=3.5em,xshift=-0.5em]rule1s1.north west) {\scriptsize{{\color{white} $\textrm{X}_1$}}};
+\node [anchor=west,inner sep=2pt,fill=black] (rule2s1) at ([yshift=3.5em,xshift=-0.5em]rule1s1.north west) {\scriptsize{{\color{white} $\funp{X}_1$}}};
 \node [anchor=west] (rule2s2) at ([xshift=0.5em]rule2s1.east) {的};
-\node [anchor=west,inner sep=2pt,fill=black] (rule2s3) at ([xshift=0.5em]rule2s2.east) {\scriptsize{{\color{white} $\textrm{X}_2$}}};
+\node [anchor=west,inner sep=2pt,fill=black] (rule2s3) at ([xshift=0.5em]rule2s2.east) {\scriptsize{{\color{white} $\funp{X}_2$}}};

 % rule 2 (target)
-\node [anchor=west,inner sep=2pt,fill=black] (rule2t1) at ([xshift=1.8in]rule2s3.east) {\scriptsize{{\color{white} $\textrm{X}_2$}}};
+\node [anchor=west,inner sep=2pt,fill=black] (rule2t1) at ([xshift=1.8in]rule2s3.east) {\scriptsize{{\color{white} $\funp{X}_2$}}};
 \node [anchor=west] (rule2t2) at ([xshift=0.5em]rule2t1.east) {that};
-\node [anchor=west,inner sep=2pt,fill=black] (rule2t3) at ([xshift=0.5em]rule2t2.east) {\scriptsize{{\color{white} $\textrm{X}_1$}}};
+\node [anchor=west,inner sep=2pt,fill=black] (rule2t3) at ([xshift=0.5em]rule2t2.east) {\scriptsize{{\color{white} $\funp{X}_1$}}};

 % phrase 3 (source and target)
-\node [anchor=north] (phrase3s1) at ([yshift=-0.8em]rule2s3.south) {\footnotesize{少数 国家}};
-\node [anchor=north] (phrase3t1) at ([yshift=-0.8em]rule2t1.south) {\footnotesize{the few countries}};
+\node [anchor=north] (phrase3s1) at ([yshift=-0.8em]rule2s3.south) {\footnotesize{多数 国家}};
+\node [anchor=north] (phrase3t1) at ([yshift=-0.8em]rule2t1.south) {\footnotesize{the many countries}};

 % edges (phrase 3 to rule 2 and rule1 to rule2)
 \draw [->] (phrase3s1.north) -- ([yshift=-0.1em]rule2s3.south);
@@ -78,12 +78,12 @@

 {
 % rule 3 (source)
-\node [anchor=west,inner sep=2pt,fill=black] (rule3s1) at ([yshift=2.5em,xshift=4em]rule2s1.north west) {\scriptsize{{\color{white} $\textrm{X}_1$}}};
+\node [anchor=west,inner sep=2pt,fill=black] (rule3s1) at ([yshift=2.5em,xshift=4em]rule2s1.north west) {\scriptsize{{\color{white} $\funp{X}_1$}}};
 \node [anchor=west] (rule3s2) at ([xshift=0.5em]rule3s1.east) {之一};

 % rule 3 (target)
 \node [anchor=west] (rule3t1) at ([xshift=1.0in]rule3s2.east) {one of};
-\node [anchor=west,inner sep=2pt,fill=black] (rule3t2) at ([xshift=0.5em]rule3t1.east) {\scriptsize{{\color{white} $\textrm{X}_1$}}};
+\node [anchor=west,inner sep=2pt,fill=black] (rule3t2) at ([xshift=0.5em]rule3t1.east) {\scriptsize{{\color{white} $\funp{X}_1$}}};

 % edges: rule 2 to rule 3
 \draw [->] ([xshift=-1em]rule2s.north) ..controls +(north:1.2em) and +(south:1.2em).. ([yshift=-0.1em]rule3s1.south);
@@ -100,18 +100,18 @@

 {
 % rule 4 (source)
-\node [anchor=west,inner sep=2pt,fill=black] (rule4s1) at ([yshift=3.5em,xshift=-3.5em]rule3s1.north west) {\scriptsize{{\color{white} $\textrm{X}_1$}}};
+\node [anchor=west,inner sep=2pt,fill=black] (rule4s1) at ([yshift=3.5em,xshift=-3.5em]rule3s1.north west) {\scriptsize{{\color{white} $\funp{X}_1$}}};
 \node [anchor=west] (rule4s2) at ([xshift=0.5em]rule4s1.east) {是};
-\node [anchor=west,inner sep=2pt,fill=black] (rule4s3) at ([xshift=0.5em]rule4s2.east) {\scriptsize{{\color{white} $\textrm{X}_2$}}};
+\node [anchor=west,inner sep=2pt,fill=black] (rule4s3) at ([xshift=0.5em]rule4s2.east) {\scriptsize{{\color{white} $\funp{X}_2$}}};

 % rule 2 (target)
-\node [anchor=west,inner sep=2pt,fill=black] (rule4t1) at ([xshift=2.0in]rule4s2.east) {\scriptsize{{\color{white} $\textrm{X}_1$}}};
+\node [anchor=west,inner sep=2pt,fill=black] (rule4t1) at ([xshift=2.0in]rule4s2.east) {\scriptsize{{\color{white} $\funp{X}_1$}}};
 \node [anchor=west] (rule4t2) at ([xshift=0.5em]rule4t1.east) {is};
-\node [anchor=west,inner sep=2pt,fill=black] (rule4t3) at ([xshift=0.5em]rule4t2.east) {\scriptsize{{\color{white} $\textrm{X}_2$}}};
+\node [anchor=west,inner sep=2pt,fill=black] (rule4t3) at ([xshift=0.5em]rule4t2.east) {\scriptsize{{\color{white} $\funp{X}_2$}}};

 % phrase 4 (source and target)
-\node [anchor=north] (phrase4s1) at ([yshift=-0.8em]rule4s1.south) {\footnotesize{澳洲}};
-\node [anchor=north] (phrase4t1) at ([yshift=-0.8em]rule4t1.south) {\footnotesize{Australia}};
+\node [anchor=north] (phrase4s1) at ([yshift=-0.8em]rule4s1.south) {\footnotesize{巴基斯坦}};
+\node [anchor=north] (phrase4t1) at ([yshift=-0.8em]rule4t1.south) {\footnotesize{Pakista}};

 % edges (phrase 4 to rule 4 and rule3 to rule4)
 \draw [->] (phrase4s1.north) -- ([yshift=-0.1em]rule4s1.south);

--- a/Chapter8/chapter8.tex
+++ b/Chapter8/chapter8.tex
@@ -68,7 +68,7 @@
 \end{figure}
 %-------------------------------------------

-\parinterval 句法树结构可以赋予机器翻译对语言进一步抽象的能力，这样，可以不需要使用连续词串，而是通过句法结构来对大范围的译文生成和调序进行建模。图\ref{fig:8-3}是一个在翻译中融入源语言（汉语）句法信息的实例。这个例子中，介词短语“在 $...$ 后”包含15个单词，因此，使用短语很难涵盖这样的片段。这时，系统会把“在 $...$ 后”错误地翻译为“In $...$”。通过句法树，可以知道“在 $...$ 后”对应着一个完整的子树结构PP（介词短语）。因此也很容易知道介词短语中“在 $...$ 后”是一个模板（红色），而“在”和“后”之间的部分构成从句部分（蓝色）。最终得到正确的译文“After $...$”。
+\parinterval 句法树结构可以赋予机器翻译对语言进一步抽象的能力，这样，可以不需要使用连续词串，而是通过句法结构来对大范围的译文生成和调序进行建模。图\ref{fig:8-3}是一个在翻译中融入源语言（汉语）句法信息的实例。这个例子中，介词短语“在 $...$ 后”包含12个单词，因此，使用短语很难涵盖这样的片段。这时，系统会把“在 $...$ 后”错误地翻译为“In $...$”。通过句法树，可以知道“在 $...$ 后”对应着一个完整的子树结构PP（介词短语）。因此也很容易知道介词短语中“在 $...$ 后”是一个模板（红色），而“在”和“后”之间的部分构成从句部分（蓝色）。最终得到正确的译文“After $...$”。

 \parinterval 使用句法信息在机器翻译中并不新鲜。在基于规则和模板的翻译模型中，就大量使用了句法等结构信息。只是由于早期句法分析技术不成熟，系统的整体效果并不突出。在数据驱动的方法中，句法可以很好地融合在统计建模中。通过概率化的句法设计，可以对翻译过程进行很好的描述。

@@ -92,7 +92,7 @@
 \caption{不同短语在训练数据中出现的频次}
 \label{tab:8-1}
 \begin{tabular}{p{12em} | r}
-短语(中文) & 训练数据中出现的频次 \\
+短语（中文） & 训练数据中出现的频次 \\
 \hline

 包含 & 3341\\
@@ -110,7 +110,7 @@

 \parinterval 显然，利用过长的短语来处理长距离的依赖并不是一种十分有效的方法。过于低频的长短语无法提供可靠的信息，而且使用长短语会导致模型体积急剧增加。

-\parinterval 再来看一个翻译实例\upcite{Chiang2012Hope}，图\ref{fig:8-4}是一个基于短语的机器翻译系统的翻译结果。这个例子中的调序有一些复杂，比如，“少数/国家/之一”和“与/北韩/有/邦交”的英文翻译都需要进行调序，分别是“one of the few countries”和“have diplomatic relations with North Korea”。基于短语的系统可以很好地处理这些调序问题，因为它们仅仅使用了局部的信息。但是，系统却无法在这两个短语（1和2）之间进行正确的调序。
+\parinterval 再来看一个翻译实例\upcite{Chiang2012Hope}，图\ref{fig:8-4}是一个基于短语的机器翻译系统的翻译结果。这个例子中的调序有一些复杂，比如，“多数/国家/之一”和“与/中国/有/邦交”的英文翻译都需要进行调序，分别是“one of the many countries”和“have diplomatic relations with China”。基于短语的系统可以很好地处理这些调序问题，因为它们仅仅使用了局部的信息。但是，系统却无法在这两个短语（1和2）之间进行正确的调序。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -133,30 +133,30 @@

 \parinterval 这里[什么\ \ 东西]和[什么\ \ 事]表示模板中的变量，可以被其他词序列替换。通常，可以把这个模板形式化描述为：
 \begin{eqnarray}
-\langle \ \text{与}\ \textrm{X}_1\ \text{有}\ \textrm{X}_2,\quad \textrm{have}\ \textrm{X}_2\ \textrm{with}\ \textrm{X}_1\ \rangle \nonumber
+\langle \ \text{与}\ \funp{X}_1\ \text{有}\ \funp{X}_2,\quad \textrm{have}\ \funp{X}_2\ \textrm{with}\ \funp{X}_1\ \rangle \nonumber
 \end{eqnarray}

-\noindent 其中，逗号分隔了源语言和目标语言部分，$\textrm{X}_1$和$\textrm{X}_2$表示模板中需要替换的内容，或者说变量。源语言中的变量和目标语言中的变量是一一对应的，比如，源语言中的$\textrm{X}_1$ 和目标语言中的$\textrm{X}_1$代表这两个变量可以“同时”被替换。假设给定短语对：
+\noindent 其中，逗号分隔了源语言和目标语言部分，$\funp{X}_1$和$\funp{X}_2$表示模板中需要替换的内容，或者说变量。源语言中的变量和目标语言中的变量是一一对应的，比如，源语言中的$\funp{X}_1$ 和目标语言中的$\funp{X}_1$代表这两个变量可以“同时”被替换。假设给定短语对：
 \begin{eqnarray}
-\langle \ \text{北韩},\quad \textrm{North Korea} \ \rangle \qquad\ \quad\quad\ \  \nonumber \\
+\langle \ \text{巴基斯坦},\quad \textrm{Pakistan} \ \rangle \qquad\ \quad\quad\ \  \nonumber \\
 \langle \ \text{邦交},\quad \textrm{diplomatic relations} \ \rangle\quad\ \ \ \nonumber
 \end{eqnarray}

-\parinterval 可以使用第一个短语替换模板中的变量$\textrm{X}_1$，得到：
+\parinterval 可以使用第一个短语替换模板中的变量$\funp{X}_1$，得到：
 \begin{eqnarray}
-\langle \ \text{与}\ \text{[北韩]}\ \text{有}\ \textrm{X}_2,\quad \textrm{have}\ \textrm{X}_2\ \textrm{with}\ \textrm{[North Korea]} \ \rangle \nonumber
+\langle \ \text{与}\ \text{[巴基斯坦]}\ \text{有}\ \funp{X}_2,\quad \textrm{have}\ \funp{X}_2\ \textrm{with}\ \textrm{[Pakistan]} \ \rangle \nonumber
 \end{eqnarray}

-\noindent 其中，$[\cdot]$表示被替换的部分。可以看到，在源语言和目标语言中，$\textrm{X}_1$被同时替换为相应的短语。进一步，可以用第二个短语替换$\textrm{X}_2$，得到：
+\noindent 其中，$[\cdot]$表示被替换的部分。可以看到，在源语言和目标语言中，$\funp{X}_1$被同时替换为相应的短语。进一步，可以用第二个短语替换$\funp{X}_2$，得到：
 \begin{eqnarray}
-\quad\langle \ \text{与}\ \text{北韩}\ \text{有}\ \text{[邦交]},\quad \textrm{have}\ \textrm{[diplomatic relations]}\ \textrm{with}\ \textrm{North Korea} \ \rangle \nonumber
+\quad\langle \ \text{与}\ \text{巴基斯坦}\ \text{有}\ \text{[邦交]},\quad \textrm{have}\ \textrm{[diplomatic relations]}\ \textrm{with}\ \textrm{Pakista} \ \rangle \nonumber
 \end{eqnarray}

 \parinterval 至此，就得到了一个完整词串的译文。类似的，还可以写出其他的翻译模板，如下：
 \begin{eqnarray}
-\langle \ \textrm{X}_1\ \text{是}\ \textrm{X}_2,\quad \textrm{X}_1\ \textrm{is}\ \textrm{X}_2 \ \rangle \qquad\qquad\ \nonumber \\
-\langle \ \textrm{X}_1\ \text{之一},\quad \textrm{one}\ \textrm{of}\ \textrm{X}_1 \ \rangle \qquad\qquad\ \nonumber \\
-\langle \ \textrm{X}_1\ \text{的}\ \textrm{X}_2,\quad \textrm{X}_2\ \textrm{that}\ \textrm{have}\ \textrm{X}_1\ \rangle\quad\ \nonumber
+\langle \ \funp{X}_1\ \text{是}\ \funp{X}_2,\quad \funp{X}_1\ \textrm{is}\ \funp{X}_2 \ \rangle \qquad\qquad\ \nonumber \\
+\langle \ \funp{X}_1\ \text{之一},\quad \textrm{one}\ \textrm{of}\ \funp{X}_1 \ \rangle \qquad\qquad\ \nonumber \\
+\langle \ \funp{X}_1\ \text{的}\ \funp{X}_2,\quad \funp{X}_2\ \textrm{that}\ \textrm{have}\ \funp{X}_1\ \rangle\quad\ \nonumber
 \end{eqnarray}

 \parinterval 使用上面这种变量替换的方式，就可以得到一个完整句子的翻译。
@@ -217,16 +217,16 @@

 \parinterval 下面是一个简单的SCFG实例：
 \begin{eqnarray}
-\textrm{S}\ &\to\ &\langle \ \textrm{NP}_1\ \text{希望}\ \textrm{VP}_2,\quad \textrm{NP}_1\ \textrm{wish}\ \textrm{to}\ \textrm{VP}_2\ \rangle \nonumber \\
+\funp{S}\ &\to\ &\langle \ \textrm{NP}_1\ \text{希望}\ \textrm{VP}_2,\quad \textrm{NP}_1\ \textrm{wish}\ \textrm{to}\ \textrm{VP}_2\ \rangle \nonumber \\
 \textrm{VP}\ &\to\ &\langle \ \text{对}\ \textrm{NP}_1\ \text{感到}\ \textrm{VP}_2,\quad \textrm{be}\ \textrm{VP}_2\ \textrm{wish}\ \textrm{NP}_1\ \rangle \nonumber \\
 \textrm{NN}\ &\to\ &\langle \ \text{强大},\quad \textrm{strong}\ \rangle \nonumber
 \end{eqnarray}

 \parinterval 这里的S、NP、VP等符号可以被看作是具有句法功能的标记，因此这个文法和传统句法分析中的CFG很像，只是CFG是单语文法，而SCFG是双语同步文法。非终结符的下标表示对应关系，比如，源语言的NP$_1$和目标语言的NP$_1$是对应的。因此，在上面这种表示形式中，两种语言间非终结符的对应关系$\sim$是隐含在变量下标中的。当然，复杂的句法功能标记并不是必须的。比如，也可以使用更简单的文法形式：
 \begin{eqnarray}
-\textrm{X}\ &\to\ &\langle \ \textrm{X}_1\ \text{希望}\ \textrm{X}_2,\quad \textrm{X}_1\ \textrm{wish}\ \textrm{to}\ \textrm{X}_2\ \rangle \nonumber \\
-\textrm{X}\ &\to\ &\langle \ \text{对}\ \textrm{X}_1\ \text{感到}\ \textrm{X}_2,\quad \textrm{be}\ \textrm{X}_2\ \textrm{wish}\ \textrm{X}_1\ \rangle \nonumber \\
-\textrm{X}\ &\to\ &\langle \ \text{强大},\quad \textrm{strong}\ \rangle \nonumber
+\funp{X}\ &\to\ &\langle \ \funp{X}_1\ \text{希望}\ \funp{X}_2,\quad \funp{X}_1\ \textrm{wish}\ \textrm{to}\ \funp{X}_2\ \rangle \nonumber \\
+\funp{X}\ &\to\ &\langle \ \text{对}\ \funp{X}_1\ \text{感到}\ \funp{X}_2,\quad \textrm{be}\ \funp{X}_2\ \textrm{wish}\ \funp{X}_1\ \rangle \nonumber \\
+\funp{X}\ &\to\ &\langle \ \text{强大},\quad \textrm{strong}\ \rangle \nonumber
 \end{eqnarray}

 \parinterval 这个文法只有一种非终结符X，因此所有的变量都可以使用任意的产生式进行推导。这就给翻译提供了更大的自由度，也就是说，规则可以被任意使用，进行自由组合。这也符合基于短语的模型中对短语进行灵活拼接的思想。基于此，层次短语系统中也使用这种并不依赖语言学句法标记的文法。在本章的内容中，如果没有特殊说明，把这种没有语言学句法标记的文法称作{\small\bfnew{基于层次短语的文法}}\index{基于层次短语的文法}（Hierarchical Phrase-based Grammar）\index{Hierarchical Phrase-based Grammar}，或简称层次短语文法。
@@ -239,10 +239,10 @@

 \parinterval 下面是一个完整的层次短语文法：
 \begin{eqnarray}
-r_1:\quad \textrm{X}\ &\to\ &\langle \ \text{进口}\ \textrm{X}_1,\quad \textrm{The}\ \textrm{imports}\ \textrm{X}_1\ \rangle \nonumber \\
-r_2:\quad \textrm{X}\ &\to\ &\langle \ \textrm{X}_1\ \text{下降}\ \textrm{X}_2,\quad \textrm{X}_2\ \textrm{X}_1\ \textrm{fallen}\ \rangle \nonumber \\
-r_3:\quad \textrm{X}\ &\to\ &\langle \ \text{大幅度},\quad \textrm{drastically}\ \rangle \nonumber \\
-r_4:\quad \textrm{X}\ &\to\ &\langle \ \text{了},\quad \textrm{have}\ \rangle \nonumber
+r_1:\quad \funp{X}\ &\to\ &\langle \ \text{进口}\ \funp{X}_1,\quad \textrm{The}\ \textrm{imports}\ \funp{X}_1\ \rangle \nonumber \\
+r_2:\quad \funp{X}\ &\to\ &\langle \ \funp{X}_1\ \text{下降}\ \funp{X}_2,\quad \funp{X}_2\ \funp{X}_1\ \textrm{fallen}\ \rangle \nonumber \\
+r_3:\quad \funp{X}\ &\to\ &\langle \ \text{大幅度},\quad \textrm{drastically}\ \rangle \nonumber \\
+r_4:\quad \funp{X}\ &\to\ &\langle \ \text{了},\quad \textrm{have}\ \rangle \nonumber
 \end{eqnarray}

 \noindent 其中，规则$r_1$和$r_2$是含有变量的规则，这些变量可以被其他规则的右部替换；规则$r_2$是调序规则；规则$r_3$和$r_4$是纯词汇化规则，表示单词或者短语的翻译。
@@ -255,11 +255,11 @@ r_4:\quad \textrm{X}\ &\to\ &\langle \ \text{了},\quad \textrm{have}\ \rangle \

 \parinterval 可以进行如下的推导（假设起始符号是X）：
 \begin{eqnarray}
-& & \langle\ \textrm{X}_1, \textrm{X}_1\ \rangle \nonumber \\
-& \xrightarrow[]{r_1} & \langle\ {\red{\textrm{进口}\ \textrm{X}_2}},\ {\red{\textrm{The imports}\ \textrm{X}_2}}\ \rangle \nonumber \\
-& \xrightarrow[]{r_2} & \langle\ \textrm{进口}\ {\red{\textrm{X}_3\ \textrm{下降}\ \textrm{X}_4}},\ \textrm{The imports}\ {\red{\textrm{X}_4\ \textrm{X}_3\ \textrm{fallen}}}\ \rangle \nonumber \\
-& \xrightarrow[]{r_3} & \langle\ \textrm{进口}\ {\red{\textrm{大幅度}}}\ \textrm{下降}\ \textrm{X}_4, \nonumber \\
-& & \ \textrm{The imports}\ \textrm{X}_4\ {\red{\textrm{drastically}}}\ \textrm{fallen}\ \rangle \nonumber \\
+& & \langle\ \funp{X}_1, \funp{X}_1\ \rangle \nonumber \\
+& \xrightarrow[]{r_1} & \langle\ {\red{\textrm{进口}\ \funp{X}_2}},\ {\red{\textrm{The imports}\ \funp{X}_2}}\ \rangle \nonumber \\
+& \xrightarrow[]{r_2} & \langle\ \textrm{进口}\ {\red{\funp{X}_3\ \textrm{下降}\ \funp{X}_4}},\ \textrm{The imports}\ {\red{\funp{X}_4\ \funp{X}_3\ \textrm{fallen}}}\ \rangle \nonumber \\
+& \xrightarrow[]{r_3} & \langle\ \textrm{进口}\ {\red{\textrm{大幅度}}}\ \textrm{下降}\ \funp{X}_4, \nonumber \\
+& & \ \textrm{The imports}\ \funp{X}_4\ {\red{\textrm{drastically}}}\ \textrm{fallen}\ \rangle \nonumber \\
 & \xrightarrow[]{r_4} & \langle\ \textrm{进口}\ \textrm{大幅度}\ \textrm{下降}\ {\red{\textrm{了}}}, \nonumber \\
 & & \ \textrm{The imports}\ {\red{\textrm{have}}}\ \textrm{drastically}\ \textrm{fallen}\ \rangle \nonumber
 \end{eqnarray}
@@ -270,7 +270,7 @@ d = {r_1} \circ {r_2} \circ {r_3} \circ {r_4}
 \label{eq:8-1}
 \end{eqnarray}

-\parinterval 在层次短语模型中，每个翻译推导都唯一地对应一个目标语译文。因此，可以用推导的概率$\textrm{P}(d)$描述翻译的好坏。同基于短语的模型是一样的（见{\chapterseven}数学建模小节），层次短语翻译的目标是：求概率最高的翻译推导$\hat{d}=\arg\max\textrm{P}(d)$。值得注意的是，基于推导的方法在句法分析中也十分常用。层次短语翻译实质上也是通过生成翻译规则的推导来对问题的表示空间进行建模。在\ref{section-8.3} 节还将看到，这种方法可以被扩展到语言学上基于句法的翻译模型中。而且这些模型都可以用一种被称作超图的结构来进行建模。从某种意义上讲，基于规则推导的方法将句法分析和机器翻译进行了形式上的统一。因此机器翻译也借用了很多句法分析的思想。
+\parinterval 在层次短语模型中，每个翻译推导都唯一地对应一个目标语译文。因此，可以用推导的概率$\funp{P}(d)$描述翻译的好坏。同基于短语的模型是一样的（见{\chapterseven}数学建模小节），层次短语翻译的目标是：求概率最高的翻译推导$\hat{d}=\arg\max\funp{P}(d)$。值得注意的是，基于推导的方法在句法分析中也十分常用。层次短语翻译实质上也是通过生成翻译规则的推导来对问题的表示空间进行建模。在\ref{section-8.3} 节还将看到，这种方法可以被扩展到语言学上基于句法的翻译模型中。而且这些模型都可以用一种被称作超图的结构来进行建模。从某种意义上讲，基于规则推导的方法将句法分析和机器翻译进行了形式上的统一。因此机器翻译也借用了很多句法分析的思想。

 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -280,16 +280,16 @@ d = {r_1} \circ {r_2} \circ {r_3} \circ {r_4}

 \parinterval 由于翻译现象非常复杂，在实际系统中往往需要把两个局部翻译线性拼接到一起。在层次短语模型中，这个问题通过引入{\small\bfnew{胶水规则}}\index{胶水规则}（Glue Rule）\index{Glue Rule}来处理，形式如下：
 \begin{eqnarray}
-\textrm{S} & \to & \langle\ \textrm{S}_1\ \textrm{X}_2,\ \textrm{S}_1\ \textrm{X}_2\ \rangle \nonumber \\
-\textrm{S} & \to & \langle\ \textrm{X}_1,\ \textrm{X}_1\ \rangle \nonumber
+\funp{S} & \to & \langle\ \funp{S}_1\ \funp{X}_2,\ \funp{S}_1\ \funp{X}_2\ \rangle \nonumber \\
+\funp{S} & \to & \langle\ \funp{X}_1,\ \funp{X}_1\ \rangle \nonumber
 \end{eqnarray}

 \parinterval 胶水规则引入了一个新的非终结符S，S只能和X进行顺序拼接，或者S由X生成。如果把S看作文法的起始符，使用胶水规则后，相当于把句子划分为若干个部分，每个部分都被归纳为X。之后，顺序地把这些X拼接到一起，得到最终的译文。比如，最极端的情况，整个句子会生成一个X，之后再归纳为S，这时并不需要进行胶水规则的顺序拼接；另一种极端的情况，每个单词都是独立地被翻译，被归纳为X，之后先把最左边的X归纳为S，再依次把剩下的X依次拼到一起。这样的推导形式如下：
 \begin{eqnarray}
-\textrm{S} & \to & \langle\ \textrm{S}_1\ \textrm{X}_2,\ \textrm{S}_1\ \textrm{X}_2\ \rangle \nonumber \\
-                & \to & \langle\ \textrm{S}_3\ \textrm{X}_4\ \textrm{X}_2,\ \textrm{S}_3\ \textrm{X}_4\ \textrm{X}_2\ \rangle \nonumber \\
+\funp{S} & \to & \langle\ \funp{S}_1\ \funp{X}_2,\ \funp{S}_1\ \funp{X}_2\ \rangle \nonumber \\
+                & \to & \langle\ \funp{S}_3\ \funp{X}_4\ \funp{X}_2,\ \funp{S}_3\ \funp{X}_4\ \funp{X}_2\ \rangle \nonumber \\
                & \to & ... \nonumber \\
-                & \to & \langle\ \textrm{X}_n\ ...\ \textrm{X}_4\ \textrm{X}_2,\ \textrm{X}_n\ ...\ \textrm{X}_4\ \textrm{X}_2\ \rangle \nonumber
+                & \to & \langle\ \funp{X}_n\ ...\ \funp{X}_4\ \funp{X}_2,\ \funp{X}_n\ ...\ \funp{X}_4\ \funp{X}_2\ \rangle \nonumber
 \end{eqnarray}

 \parinterval 实际上，胶水规则在很大程度上模拟了基于短语的系统中对字符串顺序翻译的操作，而且在实践中发现，这个步骤是十分必要的。特别是对法英翻译这样的任务，由于语言的结构基本上是顺序翻译的，因此引入顺序拼接的操作符合翻译的整体规律。同时，这种拼接给翻译增加了灵活性，系统会更加健壮。
@@ -334,17 +334,17 @@ d = {r_1} \circ {r_2} \circ {r_3} \circ {r_4}
 \begin{definition} 与词对齐相兼容的层次短语规则

 {\small
-对于句对$(\seq{s},\seq{t})$和它们之间的词对齐$\seq{a}$，令$\Phi$表示在句对$(\seq{s},\seq{t})$上与$\seq{a}$相兼容的双语短语集合。则：
+对于句对$(\seq{s},\seq{t})$和它们之间的词对齐$\seq{a}$，令$\varPhi$表示在句对$(\seq{s},\seq{t})$上与$\seq{a}$相兼容的双语短语集合。则：
 \begin{enumerate}
-\item 	如果$(x,y)\in \Phi$，则$\textrm{X} \to \langle x,y,\phi \rangle$是与词对齐相兼容的层次短语规则。
-\item 	对于$(x,y)\in \Phi$，存在$m$个双语短语$(x_i,y_j)\in \Phi$，同时存在(1,$...$,$m$)上面的一个排序$\sim = \{\pi_1 , ... ,\pi_m\}$，且：
+\item 	如果$(x,y)\in \varPhi$，则$\funp{X} \to \langle x,y,\varPhi \rangle$是与词对齐相兼容的层次短语规则。
+\item 	对于$(x,y)\in \varPhi$，存在$m$个双语短语$(x_i,y_j)\in \varPhi$，同时存在(1,$...$,$m$)上面的一个排序$\sim = \{\pi_1 , ... ,\pi_m\}$，且：
 \vspace{-1.5em}
 \begin{eqnarray}
 x&=&\alpha_0 x_1 \alpha_1 x_2 ... \alpha_{m-1} x_m \alpha_m \label{eq:8-2}\\
 y&=&\beta_0 y_{\pi_1} \beta_1 y_{\pi_2} ... \beta_{m-1} y_{\pi_m} \beta_m
 \label{eq:8-3}
 \end{eqnarray}
-其中，${\alpha_0, ... ,\alpha_m}$和${\beta_0, ... ,\beta_m}$表示源语言和目标语言的若干个词串（包含空串）。则$\textrm{X} \to \langle x,y,\sim \rangle$是与词对齐相兼容的层次短语规则。这条规则包含$m$个变量，变量的对齐信息是$\sim$。
+其中，${\alpha_0, ... ,\alpha_m}$和${\beta_0, ... ,\beta_m}$表示源语言和目标语言的若干个词串（包含空串）。则$\funp{X} \to \langle x,y,\sim \rangle$是与词对齐相兼容的层次短语规则。这条规则包含$m$个变量，变量的对齐信息是$\sim$。
 \end{enumerate}
 }
 \end{definition}
@@ -388,15 +388,15 @@ y&=&\beta_0 y_{\pi_1} \beta_1 y_{\pi_2} ... \beta_{m-1} y_{\pi_m} \beta_m

 \begin{itemize}
 \vspace{0.5em}
-\item 	(h1-2)短语翻译概率（取对数），即$\textrm{log}(\textrm{P}(\alpha \mid \beta))$和$\textrm{log}(\textrm{P}(\beta \mid \alpha))$，特征的计算与基于短语的模型完全一样；
+\item 	($h_{1-2}$)短语翻译概率（取对数），即$\textrm{log}(\funp{P}(\alpha \mid \beta))$和$\textrm{log}(\funp{P}(\beta \mid \alpha))$，特征的计算与基于短语的模型完全一样；
 \vspace{0.5em}
-\item 	(h3-4)词汇化翻译概率（取对数），即$\textrm{log}(\textrm{P}_{\textrm{lex}}(\alpha \mid \beta))$和$\textrm{log}(\textrm{P}_{\textrm{lex}}(\beta \mid \alpha))$，特征的计算与基于短语的模型完全一样；
+\item 	($h_{3-4}$)词汇化翻译概率（取对数），即$\textrm{log}(\funp{P}_{\textrm{lex}}(\alpha \mid \beta))$和$\textrm{log}(\funp{P}_{\textrm{lex}}(\beta \mid \alpha))$，特征的计算与基于短语的模型完全一样；
 \vspace{0.5em}
-\item (h5)翻译规则数量，让模型自动学习对规则数量的偏好，同时避免使用过少规则造成分数偏高的现象；
+\item ($h_{5}$)翻译规则数量，让模型自动学习对规则数量的偏好，同时避免使用过少规则造成分数偏高的现象；
 \vspace{0.5em}
-\item (h6)胶水规则数量，让模型自动学习使用胶水规则的偏好；
+\item ($h_{6}$)胶水规则数量，让模型自动学习使用胶水规则的偏好；
 \vspace{0.5em}
-\item (h7)短语规则数量，让模型自动学习使用纯短语规则的偏好。
+\item ($h_{7}$)短语规则数量，让模型自动学习使用纯短语规则的偏好。
 \vspace{0.5em}
 \end{itemize}

@@ -414,7 +414,7 @@ h_i (d,\seq{t},\seq{s})=\sum_{r \in d}h_i (r)

 \parinterval 最终，模型得分被定义为：
 \begin{eqnarray}
-\textrm{score}(d,\seq{t},\seq{s})=\textrm{rscore}(d,\seq{t},\seq{s})+ \lambda_8 \textrm{log}⁡(\textrm{P}_{\textrm{lm}}(\seq{t}))+\lambda_9 \mid \seq{t} \mid
+\textrm{score}(d,\seq{t},\seq{s})=\textrm{rscore}(d,\seq{t},\seq{s})+ \lambda_8 \textrm{log}⁡(\funp{P}_{\textrm{lm}}(\seq{t}))+\lambda_9 \mid \seq{t} \mid
 \label{eq:8-6}
 \end{eqnarray}

@@ -422,9 +422,9 @@ h_i (d,\seq{t},\seq{s})=\sum_{r \in d}h_i (r)

 \begin{itemize}
 \vspace{0.5em}
-\item $\textrm{log}⁡(\textrm{P}_{\textrm{lm}}(\textrm{t}))$表示语言模型得分；
+\item $\textrm{log}⁡(\funp{P}_{\textrm{lm}}(\seq{t}))$表示语言模型得分；
 \vspace{0.5em}
-\item $\mid \textrm{t} \mid$表示译文的长度。
+\item $\mid \seq{t} \mid$表示译文的长度。
 \vspace{0.5em}
 \end{itemize}

@@ -520,7 +520,7 @@ span\textrm{[0,4]}&=&\textrm{“猫} \quad \textrm{喜欢} \quad \textrm{吃} \q
 \vspace{0.5em}
 \item 把层次短语文法转化为乔姆斯基范式，这样可以直接使用原始的CKY方法进行分析；
 \vspace{0.5em}
-\item 对CKY方法进行改造。解码的核心任务要知道每个跨度是否能匹配规则的源语言部分。实际上，层次短语模型的文法是一种特殊的文法。这种文法规则的源语言部分最多包含两个变量，而且变量不能连续。这样的规则会对应一种特定类型的模版，比如，对于包含两个变量的规则，它的源语言部分形如$\alpha_0 \textrm{X}_1 \alpha_1 \textrm{X}_2 \alpha_2$。其中，$\alpha_0$、$\alpha_1$和$\alpha_2$表示终结符串，$\textrm{X}_1$和$\textrm{X}_2$是变量。显然，如果$\alpha_0$、$\alpha_1$和$\alpha_2$确定下来那么$\textrm{X}_1$和$\textrm{X}_2$的位置也就确定了下来。因此，对于每一个词串，都可以很容易的生成这种模版，进而完成匹配。而$\textrm{X}_1$和$\textrm{X}_2$和原始CKY中匹配二叉规则本质上是一样的。由于这种方法并不需要对CKY方法进行过多调整，因此层次短语系统中广泛使用这种改造的CKY方法进行解码。
+\item 对CKY方法进行改造。解码的核心任务要知道每个跨度是否能匹配规则的源语言部分。实际上，层次短语模型的文法是一种特殊的文法。这种文法规则的源语言部分最多包含两个变量，而且变量不能连续。这样的规则会对应一种特定类型的模版，比如，对于包含两个变量的规则，它的源语言部分形如$\alpha_0 \funp{X}_1 \alpha_1 \funp{X}_2 \alpha_2$。其中，$\alpha_0$、$\alpha_1$和$\alpha_2$表示终结符串，$\funp{X}_1$和$\funp{X}_2$是变量。显然，如果$\alpha_0$、$\alpha_1$和$\alpha_2$确定下来那么$\funp{X}_1$和$\funp{X}_2$的位置也就确定了下来。因此，对于每一个词串，都可以很容易的生成这种模版，进而完成匹配。而$\funp{X}_1$和$\funp{X}_2$和原始CKY中匹配二叉规则本质上是一样的。由于这种方法并不需要对CKY方法进行过多调整，因此层次短语系统中广泛使用这种改造的CKY方法进行解码。
 \vspace{0.5em}
 \end{itemize}

@@ -542,7 +542,7 @@ span\textrm{[0,4]}&=&\textrm{“猫} \quad \textrm{喜欢} \quad \textrm{吃} \q

 \subsection{立方剪枝}

-\parinterval 相比于基于短语的模型，基于层次短语的模型引入了“变量”的概念。这样，可以根据变量周围的上下文信息对变量进行调序。变量的内容由其所对应的跨度上的翻译假设进行填充。图\ref{fig:8-11}展示了一个层次短语规则匹配词串的实例。可以看到，规则匹配词串之后，变量X的位置对应了一个跨度。这个跨度上所有标记为X的局部推导都可以作为变量的内容。
+\parinterval 相比于基于短语的模型，基于层次短语的模型引入了“变量”的概念。这样，可以根据变量周围的上下文信息对变量进行调序。变量的内容由其所对应的跨度上的翻译假设进行填充。图\ref{fig:8-11}展示了一个层次短语规则匹配词串的实例。可以看到，规则匹配词串之后，变量$\seq{X}$的位置对应了一个跨度。这个跨度上所有标记为X的局部推导都可以作为变量的内容。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -555,12 +555,12 @@ span\textrm{[0,4]}&=&\textrm{“猫} \quad \textrm{喜欢} \quad \textrm{吃} \q

 \parinterval 真实的情况会更加复杂。对于一个规则的源语言端，可能会有多个不同的目标语言端与之对应。比如，如下规则的源语言端完全相同，但是译文不同：
 \begin{eqnarray}
-\textrm{X} & \to & \langle\ \textrm{X}_1\ \text{大幅度}\ \text{下降}\ \text{了},\ \textrm{X}_1\ \textrm{have}\ \textrm{drastically}\ \textrm{fallen}\ \rangle \nonumber \\
-\textrm{X} & \to & \langle\ \textrm{X}_1\ \text{大幅度}\ \text{下降}\ \text{了},\ \textrm{X}_1\ \textrm{have}\ \textrm{fallen}\ \textrm{drastically}\ \rangle \nonumber \\
-\textrm{X} & \to & \langle\ \textrm{X}_1\ \text{大幅度}\ \text{下降}\ \text{了},\ \textrm{X}_1\ \textrm{has}\ \textrm{drastically}\ \textrm{fallen}\ \rangle \nonumber
+\funp{X} & \to & \langle\ \funp{X}_1\ \text{大幅度}\ \text{下降}\ \text{了},\ \funp{X}_1\ \textrm{have}\ \textrm{drastically}\ \textrm{fallen}\ \rangle \nonumber \\
+\funp{X} & \to & \langle\ \funp{X}_1\ \text{大幅度}\ \text{下降}\ \text{了},\ \funp{X}_1\ \textrm{have}\ \textrm{fallen}\ \textrm{drastically}\ \rangle \nonumber \\
+\funp{X} & \to & \langle\ \funp{X}_1\ \text{大幅度}\ \text{下降}\ \text{了},\ \funp{X}_1\ \textrm{has}\ \textrm{drastically}\ \textrm{fallen}\ \rangle \nonumber
 \end{eqnarray}

-\parinterval 这也就是说，当匹配规则的源语言部分“$\textrm{X}_1$\ \ 大幅度\ \ 下降\ \ 了”时会有三个译文可以选择。而变量$\textrm{X}_1$部分又有很多不同的局部翻译结果。不同的规则译文和不同的变量译文都可以组合出一个局部翻译结果。图\ref{fig:8-12}展示了这种情况的实例。
+\parinterval 这也就是说，当匹配规则的源语言部分“$\funp{X}_1$\ \ 大幅度\ \ 下降\ \ 了”时会有三个译文可以选择。而变量$\funp{X}_1$部分又有很多不同的局部翻译结果。不同的规则译文和不同的变量译文都可以组合出一个局部翻译结果。图\ref{fig:8-12}展示了这种情况的实例。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -878,8 +878,8 @@ r_9: \quad \textrm{IP(}\textrm{NN}_1\ \textrm{VP}_2) \rightarrow \textrm{S(}\tex
 \noindent 可以得到一个翻译推导：
 {\footnotesize
 \begin{eqnarray}
-&& \langle\ \textrm{IP}^{[1]},\ \textrm{S}^{[1]}\ \rangle \nonumber \\
-& \xrightarrow[r_9]{\textrm{IP}^{[1]} \Leftrightarrow \textrm{S}^{[1]}} & \langle\ \textrm{IP(}\textrm{NN}^{[2]}\ \textrm{VP}^{[3]}),\ \textrm{S(}\textrm{NP}^{[2]}\ \textrm{VP}^{[3]})\ \rangle \nonumber \\
+&& \langle\ \textrm{IP}^{[1]},\ \funp{S}^{[1]}\ \rangle \nonumber \\
+& \xrightarrow[r_9]{\textrm{IP}^{[1]} \Leftrightarrow \funp{S}^{[1]}} & \langle\ \textrm{IP(}\textrm{NN}^{[2]}\ \textrm{VP}^{[3]}),\ \textrm{S(}\textrm{NP}^{[2]}\ \textrm{VP}^{[3]})\ \rangle \nonumber \\
 & \xrightarrow[r_7]{\textrm{NN}^{[2]} \Leftrightarrow \textrm{NP}^{[2]}} & \langle\ \textrm{IP(NN(进口)}\ \textrm{VP}^{[3]}),\ \textrm{S(NP(DT(the) NNS(imports))}\ \textrm{VP}^{[3]})\ \rangle \nonumber \\
 & \xrightarrow[r_8]{\textrm{VP}^{[3]} \Leftrightarrow \textrm{VP}^{[3]}} & \langle\ \textrm{IP(NN(进口)}\ \textrm{VP(AD}^{[4]}\ \textrm{VP(VV}^{[5]}\ \textrm{AS}^{[6]}))), \nonumber \\
 &                 & \ \ \textrm{S(NP(DT(the) NNS(imports))}\ \textrm{VP(VBP}^{[6]}\ \textrm{ADVP(RB}^{[4]}\ \textrm{VBN}^{[5]})))\ \rangle \hspace{4.5em} \nonumber \\
@@ -894,7 +894,7 @@ r_9: \quad \textrm{IP(}\textrm{NN}_1\ \textrm{VP}_2) \rightarrow \textrm{S(}\tex
 \end{eqnarray}
 }

-\noindent 其中，箭头$\rightarrow$表示推导之意。显然，可以把翻译看作是基于树结构的推导过程（记为$d$）。因此，与层次短语模型一样，基于语言学句法的机器翻译也是要找到最佳的推导$\hat{d} = \argmax_{d} \textrm{P} (d)$。
+\noindent 其中，箭头$\rightarrow$表示推导之意。显然，可以把翻译看作是基于树结构的推导过程（记为$d$）。因此，与层次短语模型一样，基于语言学句法的机器翻译也是要找到最佳的推导$\hat{d} = \argmax_{d} \funp{P} (d)$。

 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -1317,7 +1317,7 @@ r_9: \quad \textrm{IP(}\textrm{NN}_1\ \textrm{VP}_2) \rightarrow \textrm{S(}\tex

 \begin{itemize}
 \vspace{0.5em}
-\item (h1-2)短语翻译概率（取对数），即规则源语言和目标语言树覆盖的序列翻译概率。令函数$\tau(\cdot)$返回一个树片段的叶子节点序列。对于规则：
+\item ($h_{1-2}$)短语翻译概率（取对数），即规则源语言和目标语言树覆盖的序列翻译概率。令函数$\tau(\cdot)$返回一个树片段的叶子节点序列。对于规则：
 \begin{displaymath}
 \textrm{VP(}\textrm{PP}_1\ \ \textrm{VP(VV(表示)}\ \ \textrm{NN}_2\textrm{))} \rightarrow \textrm{VP(VBZ(was)}\ \ \textrm{VP(}\textrm{VBN}_2\ \ \textrm{PP}_1\textrm{))}
 \end{displaymath}
@@ -1328,7 +1328,7 @@ r_9: \quad \textrm{IP(}\textrm{NN}_1\ \textrm{VP}_2) \rightarrow \textrm{S(}\tex
 \end{eqnarray}
 \noindent 于是，可以定义短语翻译概率特征为$\log(\textrm{P(}\tau( \alpha_r )|\tau( \beta_r )))$和$\log(\textrm{P(}\tau( \beta_r )|\tau( \alpha_r )))$。它们的计算方法与基于短语的系统是完全一样的\footnote[9]{对于树到串规则，$\tau( \beta_r )$就是规则目标语言端的符号串。}；
 \vspace{0.5em}
-\item (h3-4) 词汇化翻译概率（取对数），即$\log(\textrm{P}_{\textrm{lex}}(\tau( \alpha_r )|\tau( \beta_r )))$和$\log(\textrm{P}_{\textrm{lex}}(\tau( \beta_r )|\tau( \alpha_r )))$。这两个特征的计算方法与基于短语的系统也是一样的。
+\item ($h_{3-4}$) 词汇化翻译概率（取对数），即$\log(\funp{P}_{\textrm{lex}}(\tau( \alpha_r )|\tau( \beta_r )))$和$\log(\funp{P}_{\textrm{lex}}(\tau( \beta_r )|\tau( \alpha_r )))$。这两个特征的计算方法与基于短语的系统也是一样的。
 \vspace{0.5em}
 \end{itemize}

@@ -1337,11 +1337,11 @@ r_9: \quad \textrm{IP(}\textrm{NN}_1\ \textrm{VP}_2) \rightarrow \textrm{S(}\tex

 \begin{itemize}
 \vspace{0.5em}
-\item (h5)基于根节点句法标签的规则生成概率（取对数），即$\log(\textrm{P(}r|\textrm{root(}r\textrm{))})$。这里，$\textrm{root(}r)$是规则所对应的双语根节点$(\alpha_h,\beta_h)$；
+\item ($h_{5}$)基于根节点句法标签的规则生成概率（取对数），即$\log(\textrm{P(}r|\textrm{root(}r\textrm{))})$。这里，$\textrm{root(}r)$是规则所对应的双语根节点$(\alpha_h,\beta_h)$；
 \vspace{0.5em}
-\item (h6)基于源语言端的规则生成概率（取对数），即$\log(\textrm{P(}r|\alpha_r)))$，给定源语言端生成整个规则的概率；
+\item ($h_{6}$)基于源语言端的规则生成概率（取对数），即$\log(\textrm{P(}r|\alpha_r)))$，给定源语言端生成整个规则的概率；
 \vspace{0.5em}
-\item (h7)基于目标语言端的规则生成概率（取对数），即$\log(\textrm{P(}r|\beta_r)))$，给定目标语言端生成整个规则的概率。
+\item ($h_{7}$)基于目标语言端的规则生成概率（取对数），即$\log(\textrm{P(}r|\beta_r)))$，给定目标语言端生成整个规则的概率。
 \end{itemize}

 \vspace{0.5em}
@@ -1349,17 +1349,17 @@ r_9: \quad \textrm{IP(}\textrm{NN}_1\ \textrm{VP}_2) \rightarrow \textrm{S(}\tex

 \begin{itemize}
 \vspace{0.5em}
-\item (h8)语言模型得分（取对数），即$\log(\textrm{P}_{\textrm{lm}}(\seq{t}))$，用于度量译文的流畅度；
+\item ($h_{8}$)语言模型得分（取对数），即$\log(\funp{P}_{\textrm{lm}}(\seq{t}))$，用于度量译文的流畅度；
 \vspace{0.5em}
-\item (h9)译文长度，即$|\seq{t}|$，用于避免模型过于倾向生成短译文（因为短译文语言模型分数高）；
+\item ($h_{9}$)译文长度，即$|\seq{t}|$，用于避免模型过于倾向生成短译文（因为短译文语言模型分数高）；
 \vspace{0.5em}
-\item (h10)翻译规则数量，学习对使用规则数量的偏好。比如，如果这个特征的权重较高，则表明系统更喜欢使用数量多的规则；
+\item ($h_{10}$)翻译规则数量，学习对使用规则数量的偏好。比如，如果这个特征的权重较高，则表明系统更喜欢使用数量多的规则；
 \vspace{0.5em}
-\item (h11)组合规则的数量，学习对组合规则的偏好；
+\item ($h_{11}$)组合规则的数量，学习对组合规则的偏好；
 \vspace{0.5em}
-\item (h12)词汇化规则的数量，学习对含有终结符规则的偏好；
+\item ($h_{12}$)词汇化规则的数量，学习对含有终结符规则的偏好；
 \vspace{0.5em}
-\item (h13)低频规则的数量，学习对训练数据中出现频次低于3的规则的偏好。低频规则大多不可靠，设计这个特征的目的也是为了区分不同质量的规则。
+\item ($h_{13}$)低频规则的数量，学习对训练数据中出现频次低于3的规则的偏好。低频规则大多不可靠，设计这个特征的目的也是为了区分不同质量的规则。
 \end{itemize}

 \vspace{0.5em}