合并分支 'caorunzhe' 到 'mengxia'

Caorunzhe 查看合并请求 !253

合并分支 'caorunzhe' 到 'mengxia'
Caorunzhe 查看合并请求 !253
e5564fa2 · 孟霞 · 2d52d067 · 34827943 · e5564fa2 · e5564fa2
Commit e5564fa2 authored Sep 23, 2020 by 孟霞
--- a/Chapter2/chapter2.tex
+++ b/Chapter2/chapter2.tex
@@ -593,10 +593,10 @@ F(x)=\int_{-\infty}^x f(x)\textrm{d}x
 %    NEW SUBSUB-SECTION
 %----------------------------------------------------------------------------------------

-\subsubsection{2. 古德-图灵估计法}
+\subsubsection{2. 古德-图灵估计方法}

 \vspace{-0.5em}
-\parinterval {\small\bfnew{古德-图灵估计法}}\index{古德-图灵估计法}（Good-Turing Estimate）\index{Good-Turing Estimate}是Alan Turing和他的助手Irving John Good开发的，作为他们在二战期间破解德国密码机Enigma所使用方法的一部分，在1953 年Irving John Good将其发表。这一方法也是很多平滑算法的核心，其基本思路是：把非零的$n$元语法单元的概率降低，匀给一些低概率$n$元语法单元，以减小最大似然估计与真实概率之间的偏离\upcite{good1953population,gale1995good}。
+\parinterval {\small\bfnew{古德-图灵估计}}\index{古德-图灵估计}（Good-Turing Estimate）\index{Good-Turing Estimate}是Alan Turing和他的助手Irving John Good开发的，作为他们在二战期间破解德国密码机Enigma所使用方法的一部分，在1953 年Irving John Good将其发表。这一方法也是很多平滑算法的核心，其基本思路是：把非零的$n$元语法单元的概率降低，匀给一些低概率$n$元语法单元，以减小最大似然估计与真实概率之间的偏离\upcite{good1953population,gale1995good}。

 \parinterval 假定在语料库中出现$r$次的$n$-gram有$n_r$个，特别的，出现0次的$n$-gram（即未登录词及词串）有$n_0$个。语料库中全部单词的总个数为$N$，显然：
 \begin{eqnarray}

--- a/Chapter5/Figures/figure-calculation-formula&iterative-process-of-function.tex
+++ b/Chapter5/Figures/figure-calculation-formula&iterative-process-of-function.tex
@@ -10,8 +10,8 @@
    \node [anchor=west,inner sep=2pt] (eq1) at (0,0) {$f(s_u|t_v)$};
    \node [anchor=west] (eq2) at (eq1.east) {$=$\ };
    \draw [-] ([xshift=0.3em]eq2.east) -- ([xshift=11.6em]eq2.east);
-    \node [anchor=south west] (eq3) at ([xshift=1em]eq2.east) {$\sum_{i=1}^{N} c_{\mathbb{E}}(s_u|t_v;s^{[i]},t^{[i]})$};
-    \node [anchor=north west] (eq4) at (eq2.east) {$\sum_{s_u} \sum_{i=1}^{N} c_{\mathbb{E}}(s_u|t_v;s^{[i]},t^{[i]})$};
+    \node [anchor=south west] (eq3) at ([xshift=1em]eq2.east) {$\sum_{i=1}^{K} c_{\mathbb{E}}(s_u|t_v;s^{[i]},t^{[i]})$};
+    \node [anchor=north west] (eq4) at (eq2.east) {$\sum_{s_u} \sum_{i=1}^{K} c_{\mathbb{E}}(s_u|t_v;s^{[i]},t^{[i]})$};

   {
    \node [anchor=south] (label1) at ([yshift=-6em,xshift=3em]eq1.north west) {利用这个公式计算};

--- a/Chapter5/Figures/figure-em-algorithm-flow-chart.tex
+++ b/Chapter5/Figures/figure-em-algorithm-flow-chart.tex
@@ -6,17 +6,17 @@
 %-------------------------------------------------------------------------
 \begin{tikzpicture}
 \node [anchor=north west] (line1) at (0,0) {\small\sffamily\bfseries{IBM模型1的训练（EM算法）}};
-\node [anchor=north west] (line2) at ([yshift=-0.3em]line1.south west) {输入: 平行语料${(\seq{s}^{[1]},\seq{t}^{[1]}),...,(\seq{s}^{[N]},\seq{t}^{[N]})}$};
+\node [anchor=north west] (line2) at ([yshift=-0.3em]line1.south west) {输入: 平行语料${(\seq{s}^{[1]},\seq{t}^{[1]}),...,(\seq{s}^{[K]},\seq{t}^{[K]})}$};
 \node [anchor=north west] (line3) at ([yshift=-0.1em]line2.south west) {输出: 参数$f(\cdot|\cdot)$的最优值};
-\node [anchor=north west] (line4) at ([yshift=-0.1em]line3.south west) {1: \textbf{Function} \textsc{EM}($\{(\seq{s}^{[1]},\seq{t}^{[1]}),...,(\seq{s}^{[N]},\seq{t}^{[N]})\}$) };
+\node [anchor=north west] (line4) at ([yshift=-0.1em]line3.south west) {1: \textbf{Function} \textsc{EM}($\{(\seq{s}^{[1]},\seq{t}^{[1]}),...,(\seq{s}^{[K]},\seq{t}^{[K]})\}$) };
 \node [anchor=north west] (line5) at ([yshift=-0.1em]line4.south west) {2: \ \ Initialize $f(\cdot|\cdot)$ \hspace{5em} $\rhd$ 比如给$f(\cdot|\cdot)$一个均匀分布};
 \node [anchor=north west] (line6) at ([yshift=-0.1em]line5.south west) {3: \ \ Loop until $f(\cdot|\cdot)$ converges};
-\node [anchor=north west] (line7) at ([yshift=-0.1em]line6.south west) {4: \ \ \ \ \textbf{foreach} $k = 1$ to $N$ \textbf{do}};
+\node [anchor=north west] (line7) at ([yshift=-0.1em]line6.south west) {4: \ \ \ \ \textbf{foreach} $k = 1$ to $K$ \textbf{do}};
 \node [anchor=north west] (line8) at ([yshift=-0.1em]line7.south west) {5: \ \ \ \ \ \ \ \footnotesize{$c_{\mathbb{E}}(\seq{s}_u|\seq{t}_v;\seq{s}^{[k]},\seq{t}^{[k]}) = \sum\limits_{j=1}^{|\seq{s}^{[k]}|} \delta(s_j,s_u) \sum\limits_{i=0}^{|\seq{t}^{[k]}|} \delta(t_i,t_v) \cdot \frac{f(s_u|t_v)}{\sum_{i=0}^{l}f(s_u|t_i)}$}\normalsize{}};
-\node [anchor=north west] (line9) at ([yshift=-0.1em]line8.south west) {6: \ \ \ \ \textbf{foreach} $t_v$ appears at least one of $\{\seq{t}^{[1]},...,\seq{t}^{[N]}\}$ \textbf{do}};
-\node [anchor=north west] (line10) at ([yshift=-0.1em]line9.south west) {7: \ \ \ \ \ \ \ $\lambda_{t_v}^{'} = \sum_{s_u} \sum_{k=1}^{N} c_{\mathbb{E}}(s_u|t_v;\seq{s}^{[k]},\seq{t}^{[k]})$};
-\node [anchor=north west] (line11) at ([yshift=-0.1em]line10.south west) {8: \ \ \ \ \ \ \ \textbf{foreach} $s_u$ appears at least one of $\{\seq{s}^{[1]},...,\seq{s}^{[N]}\}$ \textbf{do}};
-\node [anchor=north west] (line12) at ([yshift=-0.1em]line11.south west) {9: \ \ \ \ \ \ \ \ \ $f(s_u|t_v) = \sum_{k=1}^{N} c_{\mathbb{E}}(s_u|t_v;\seq{s}^{[k]},\seq{t}^{[k]}) \cdot (\lambda_{t_v}^{'})^{-1}$};
+\node [anchor=north west] (line9) at ([yshift=-0.1em]line8.south west) {6: \ \ \ \ \textbf{foreach} $t_v$ appears at least one of $\{\seq{t}^{[1]},...,\seq{t}^{[K]}\}$ \textbf{do}};
+\node [anchor=north west] (line10) at ([yshift=-0.1em]line9.south west) {7: \ \ \ \ \ \ \ $\lambda_{t_v}^{'} = \sum_{s_u} \sum_{k=1}^{K} c_{\mathbb{E}}(s_u|t_v;\seq{s}^{[k]},\seq{t}^{[k]})$};
+\node [anchor=north west] (line11) at ([yshift=-0.1em]line10.south west) {8: \ \ \ \ \ \ \ \textbf{foreach} $s_u$ appears at least one of $\{\seq{s}^{[1]},...,\seq{s}^{[K]}\}$ \textbf{do}};
+\node [anchor=north west] (line12) at ([yshift=-0.1em]line11.south west) {9: \ \ \ \ \ \ \ \ \ $f(s_u|t_v) = \sum_{k=1}^{K} c_{\mathbb{E}}(s_u|t_v;\seq{s}^{[k]},\seq{t}^{[k]}) \cdot (\lambda_{t_v}^{'})^{-1}$};
 \node [anchor=north west] (line13) at ([yshift=-0.1em]line12.south west) {10: \ \textbf{return} $f(\cdot|\cdot)$};

 \begin{pgfonlayer}{background}

--- a/Chapter5/Figures/figure-processes-smt.tex
+++ b/Chapter5/Figures/figure-processes-smt.tex
@@ -15,7 +15,7 @@
 \end{pgfonlayer}
 }

-\node [anchor=west,ugreen] (P) at ([xshift=5em,yshift=-0.7em]corpus.east){P($\mathbf{t}|\mathbf{s}$)};
+\node [anchor=west,ugreen] (P) at ([xshift=5em,yshift=-0.7em]corpus.east){$\funp{P}(\seq{t}|\seq{s})$};
 \node [anchor=south] (modellabel) at (P.north) {{\color{ublue} {\scriptsize \sffamily\bfseries{翻译模型}}}};

 \begin{pgfonlayer}{background}

--- a/Chapter5/chapter5.tex
+++ b/Chapter5/chapter5.tex
@@ -50,7 +50,7 @@ IBM模型由Peter F. Brown等人于上世纪九十年代初提出\upcite{DBLP:jo
 \end{figure}
 %----------------------------------------------

-\parinterval 上面的例子反映了人在做翻译时所使用的一些知识：首先，两种语言单词的顺序可能不一致，而且译文需要符合目标语的习惯，这也就是常说的翻译的{\small\sffamily\bfseries{流畅度}}\index{流畅度}问题（Fluency）\index{Fluency}；其次，源语言单词需要准确地被翻译出来，也就是常说的翻译的{\small\sffamily\bfseries{准确性}}\index{准确性}(Accuracy)\index{Accuracy}问题和{\small\sffamily\bfseries{充分性}}\index{充分性}（Adequacy）\index{Adequacy}问题。为了达到以上目的，传统观点认为翻译过程需要包含三个步骤\upcite{parsing2009speech}：
+\parinterval 上面的例子反映了人在做翻译时所使用的一些知识：首先，两种语言单词的顺序可能不一致，而且译文需要符合目标语的习惯，这也就是常说的翻译的{\small\sffamily\bfseries{流畅度}}\index{流畅度}问题（Fluency）\index{Fluency}；其次，源语言单词需要准确地被翻译出来，也就是常说的翻译的{\small\sffamily\bfseries{准确性}}\index{准确性}（Accuracy）\index{Accuracy}问题和{\small\sffamily\bfseries{充分性}}\index{充分性}（Adequacy）\index{Adequacy}问题。为了达到以上目的，传统观点认为翻译过程需要包含三个步骤\upcite{parsing2009speech}：

 \begin{itemize}
 \vspace{0.5em}
@@ -273,13 +273,13 @@ $\seq{t}$ = machine\; \underline{translation}\; is\; a\; process\; of\; generati

 \subsubsection{3. 如何从大量的双语平行数据中进行学习？}

-\parinterval 如果有更多的句子，上面的方法同样适用。假设，有$N$个互译句对$\{(\seq{s}^{[1]},\seq{t}^{[1]})$,...,\\$(\seq{s}^{[N]},\seq{t}^{[N]})\}$。仍然可以使用基于相对频次的方法估计翻译概率$\funp{P}(x,y)$，具体方法如下:
+\parinterval 如果有更多的句子，上面的方法同样适用。假设，有$K$个互译句对$\{(\seq{s}^{[1]},\seq{t}^{[1]})$,...,\\$(\seq{s}^{[K]},\seq{t}^{[K]})\}$。仍然可以使用基于相对频次的方法估计翻译概率$\funp{P}(x,y)$，具体方法如下:
 \begin{eqnarray}
-\funp{P}(x,y)  =  \frac{{\sum_{i=1}^{N} c(x,y;\seq{s}^{[i]},\seq{t}^{[i]})}}{\sum_{i=1}^{N}{{\sum_{x',y'} c(x',y';\seq{s}^{[i]},\seq{t}^{[i]})}}}
+\funp{P}(x,y)  =  \frac{{\sum_{i=1}^{K} c(x,y;\seq{s}^{[i]},\seq{t}^{[i]})}}{\sum_{i=1}^{K}{{\sum_{x',y'} c(x',y';\seq{s}^{[i]},\seq{t}^{[i]})}}}
 \label{eq:5-4}
 \end{eqnarray}

-\parinterval 与公式\ref{eq:5-1}相比，公式\ref{eq:5-4}的分子、分母都多了一项累加符号$\sum_{i=1}^{N} \cdot$，它表示遍历语料库中所有的句对。换句话说，当计算词的共现次数时，需要对每个句对上的计数结果进行累加。从统计学习的角度，使用更大规模的数据进行参数估计可以提高结果的可靠性。计算单词的翻译概率也是一样，在小规模的数据上看，很多翻译现象的特征并不突出，但是当使用的数据量增加到一定程度，翻译的规律会很明显的体现出来。
+\parinterval 与公式\ref{eq:5-1}相比，公式\ref{eq:5-4}的分子、分母都多了一项累加符号$\sum_{i=1}^{K} \cdot$，它表示遍历语料库中所有的句对。换句话说，当计算词的共现次数时，需要对每个句对上的计数结果进行累加。从统计学习的角度，使用更大规模的数据进行参数估计可以提高结果的可靠性。计算单词的翻译概率也是一样，在小规模的数据上看，很多翻译现象的特征并不突出，但是当使用的数据量增加到一定程度，翻译的规律会很明显的体现出来。

 \parinterval 举个例子，实例\ref{eg:5-2}展示了一个由两个句对构成的平行语料库。

@@ -633,7 +633,7 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \end{figure}
 %----------------------------------------------

-\item 源语言单词可以翻译为空，这时它对应到一个虚拟或伪造的目标语单词$t_0$。在图\ref{fig:5-16}所示的例子中，``在''没有对应到``on the table''中的任意一个词，而是把它对应到$t_0$上。这样，所有的源语言单词都能找到一个目标语单词对应。这种设计也很好地引入了{\small\sffamily\bfseries{空对齐}}\index{空对齐}的思想，即源语言单词不对应任何真实存在的单词的情况。而这种空对齐的情况在翻译中是频繁出现的，比如虚词的翻译。
+\item 源语言单词可以翻译为空，这时它对应到一个虚拟或伪造的目标语单词$t_0$。在图\ref{fig:5-16}所示的例子中，``在''没有对应到``on the table''中的任意一个词，而是把它对应到$t_0$上。这样，所有的源语言单词都能找到一个目标语单词对应。这种设计也很好地引入了{\small\sffamily\bfseries{空对齐}}\index{空对齐}（Empty Alignment\index{Empty Alignment}）的思想，即源语言单词不对应任何真实存在的单词的情况。而这种空对齐的情况在翻译中是频繁出现的，比如虚词的翻译。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -703,7 +703,7 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 

 \subsection{基于词对齐的翻译实例}

-\parinterval 用前面图\ref{fig:5-16}中例子来对公式\ref{eq:5-18}进行说明。例子中，源语言句子``在\ \ 桌子\ \ 上''目标语译文``on the table''之间的词对齐为$\seq{a}=\{\textrm{1-0, 2-3, 3-1}\}$。公式\ref{eq:5-18}的计算过程如下：
+\parinterval 用前面图\ref{fig:5-16}中例子来对公式\ref{eq:5-18}进行说明。例子中，源语言句子``在\ \ 桌子\ \ 上''目标语译文``on the table''之间的词对齐为$\seq{a}=\{\textrm{1-0, 2-3, 3-1}\}$。 公式\ref{eq:5-18}的计算过程如下：

 \begin{itemize}
 \vspace{0.5em}
@@ -720,11 +720,11 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \funp{P}(\seq{s},\seq{a}|\seq{t})\; &= & \funp{P}(m|\seq{t}) \prod\limits_{j=1}^{m} \funp{P}(a_j|a_{1}^{j-1},s_{1}^{j-1},m,\seq{t}) \funp{P}(s_j|a_{1}^{j},s_{1}^{j-1},m,\seq{t}) \nonumber \\
 &=&\funp{P}(m=3 \mid \textrm{$t_0$ on the table}){\times} \nonumber \\
 &&{\funp{P}(a_1=0 \mid \phi,\phi,3,\textrm{$t_0$ on the table}){\times} } \nonumber \\
-&&{\funp{P}(f_1=\textrm{在} \mid \textrm{\{1-0\}},\phi,3,\textrm{$t_0$ on the table}){\times} } \nonumber \\
+&&{\funp{P}(s_1=\textrm{在} \mid \textrm{\{1-0\}},\phi,3,\textrm{$t_0$ on the table}){\times} } \nonumber \\
 &&{\funp{P}(a_2=3 \mid \textrm{\{1-0\}},\textrm{在},3,\textrm{$t_0$ on the table}) {\times}} \nonumber \\
-&&{\funp{P}(f_2=\textrm{桌子} \mid \textrm{\{1-0, 2-3\}},\textrm{在},3,\textrm{$t_0$ on the table}) {\times}} \nonumber \\
+&&{\funp{P}(s_2=\textrm{桌子} \mid \textrm{\{1-0, 2-3\}},\textrm{在},3,\textrm{$t_0$ on the table}) {\times}} \nonumber \\
 &&{\funp{P}(a_3=1 \mid \textrm{\{1-0, 2-3\}},\textrm{在\ \ 桌子},3,\textrm{$t_0$ on the table}) {\times}} \nonumber \\
-&&{\funp{P}(f_3=\textrm{上} \mid \textrm{\{1-0, 2-3, 3-1\}},\textrm{在\ \ 桌子},3,\textrm{$t_0$ on the table})  }
+&&{\funp{P}(s_3=\textrm{上} \mid \textrm{\{1-0, 2-3, 3-1\}},\textrm{在\ \ 桌子},3,\textrm{$t_0$ on the table})  }
 \label{eq:5-19}
 \end{eqnarray}

@@ -1064,10 +1064,10 @@ f(s_u|t_v)=\frac{c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t})}  { \sum\limits_{s_u} c
 \end{figure}
 %----------------------------------------------

-\parinterval 进一步，假设有$N$个互译的句对（称作平行语料）：
-$\{(\seq{s}^{[1]},\seq{t}^{[1]}),...,(\seq{s}^{[N]},\seq{t}^{[N]})\}$，$f(s_u|t_v)$的期望频次为：
+\parinterval 进一步，假设有$K$个互译的句对（称作平行语料）：
+$\{(\seq{s}^{[1]},\seq{t}^{[1]}),...,(\seq{s}^{[K]},\seq{t}^{[K]})\}$，$f(s_u|t_v)$的期望频次为：
 \begin{eqnarray}
-c_{\mathbb{E}}(s_u|t_v)=\sum\limits_{i=1}^{N}  c_{\mathbb{E}}(s_u|t_v;s^{[i]},t^{[i]})
+c_{\mathbb{E}}(s_u|t_v)=\sum\limits_{i=1}^{K}  c_{\mathbb{E}}(s_u|t_v;s^{[i]},t^{[i]})
 \label{eq:5-46}
 \end{eqnarray}


--- a/Chapter6/Figures/figure-expression.tex
+++ b/Chapter6/Figures/figure-expression.tex
@@ -12,16 +12,16 @@



-\node [anchor=west,inner sep=2pt,minimum height=2.5em] (eq1) at (0,0) {${\textrm{P}(\tau,\pi|\mathbf{t}) =  \prod_{i=1}^{l}\hspace{6.0em} \times \ \hspace{5.5em}\times}$};
+\node [anchor=west,inner sep=2pt,minimum height=2.5em] (eq1) at (0,0) {${\funp{P}(\tau,\pi|\seq{t}) =  \prod_{i=1}^{l}\hspace{6.0em} \times \ \hspace{5.5em}\times}$};
 \node [anchor=north west,inner sep=2pt,minimum height=2.5em] (eq2) at ([xshift=-16.06em,yshift=0.0em]eq1.south east) {${\prod_{i=0}^l{\prod_{k=1}^{\varphi_i}\hspace{9.6em}} \ \ \times}$};
 \node [anchor=north west,inner sep=2pt,minimum height=2.5em] (eq3) at ([xshift=-16.05em,yshift=0.0em]eq2.south east) {${\prod_{i=1}^l{\prod_{k=1}^{\varphi_i}}\hspace{11.5em} \times}$};
 \node [anchor=north west,inner sep=2pt,minimum height=2.5em] (eq4) at ([xshift=-17.44em,yshift=0.0em]eq3.south east) {{${\prod_{k=1}^{\varphi_0}}$}};

-\node [anchor=west,inner sep=2pt,minimum height=2.0em,fill=red!30] (part1) at ([xshift=-13.4em,yshift=0.0em]eq1.east) {{${\textrm{P}(\varphi_i|\varphi_{1}^{i-1},\mathbf{t})}$}};
-\node [anchor=west,inner sep=2pt,minimum height=2.0em,fill=blue!30] (part2) at ([xshift=-6.4em,yshift=0.0em]eq1.east) {{${\textrm{P}(\varphi_0|\varphi_{1}^{l},\mathbf{t})}$}};
-\node [anchor=west,inner sep=2pt,minimum height=2.0em,fill=green!30] (part3) at ([xshift=-11em,yshift=0.0em]eq2.east) {{${\textrm{P}(\tau_{ik}|\tau_{i1}^{k-1},\tau_{1}^{i-1},\varphi_{0}^{l},\mathbf{t} )}$}};
-\node [anchor=west,inner sep=2pt,minimum height=2.0em,fill=yellow!30] (part4) at ([xshift=-12.5em,yshift=0.0em]eq3.east) {{${\textrm{P}(\pi_{ik}|\pi_{i1}^{k-1},\pi_{1}^{i-1},\tau_{0}^{l},\varphi_{0}^{l},\mathbf{t} )}$}};
-\node [anchor=west,inner sep=2pt,minimum height=2.0em,fill=gray!30] (part5) at ([xshift=0.0em,yshift=0.0em]eq4.east) {{${\textrm{P}(\pi_{0k}|\pi_{01}^{k-1},\pi_{1}^{l},\tau_{0}^{l},\varphi_{0}^{l},\mathbf{t} )}$}};
+\node [anchor=west,inner sep=2pt,minimum height=2.0em,fill=red!30] (part1) at ([xshift=-13.4em,yshift=0.0em]eq1.east) {{${\funp{P}(\varphi_i|\varphi_{1}^{i-1},\seq{t})}$}};
+\node [anchor=west,inner sep=2pt,minimum height=2.0em,fill=blue!30] (part2) at ([xshift=-6.4em,yshift=0.0em]eq1.east) {{${\funp{P}(\varphi_0|\varphi_{1}^{l},\seq{t})}$}};
+\node [anchor=west,inner sep=2pt,minimum height=2.0em,fill=green!30] (part3) at ([xshift=-11em,yshift=0.0em]eq2.east) {{${\funp{P}(\tau_{ik}|\tau_{i1}^{k-1},\tau_{1}^{i-1},\varphi_{0}^{l},\seq{t} )}$}};
+\node [anchor=west,inner sep=2pt,minimum height=2.0em,fill=yellow!30] (part4) at ([xshift=-12.5em,yshift=0.0em]eq3.east) {{${\funp{P}(\pi_{ik}|\pi_{i1}^{k-1},\pi_{1}^{i-1},\tau_{0}^{l},\varphi_{0}^{l},\seq{t} )}$}};
+\node [anchor=west,inner sep=2pt,minimum height=2.0em,fill=gray!30] (part5) at ([xshift=0.0em,yshift=0.0em]eq4.east) {{${\funp{P}(\pi_{0k}|\pi_{01}^{k-1},\pi_{1}^{l},\tau_{0}^{l},\varphi_{0}^{l},\seq{t} )}$}};


 \end{tikzpicture}

--- a/Chapter6/Figures/figure-probability-translation-process.tex
+++ b/Chapter6/Figures/figure-probability-translation-process.tex
@@ -4,10 +4,10 @@

 {
 {
-\node [anchor=north west] (st) at (0,0) {$\mathbf{s}$};
+\node [anchor=north west] (st) at (0,0) {$\seq{s}$};
 \node [anchor=north] (taut) at ([yshift=-3em]st.south) {\sffamily\bfseries{$\tau$}};
 \node [anchor=north] (phit) at ([yshift=-3em]taut.south) {\sffamily\bfseries{$\phi$}};
-\node [anchor=north] (tt) at ([yshift=-3em]phit.south) {$\mathbf{t}$};
+\node [anchor=north] (tt) at ([yshift=-3em]phit.south) {$\seq{t}$};
 }
 {\scriptsize
 \node [anchor=west,minimum height=2.5em,minimum width=5.0em] (sf1) at ([xshift=1em]st.east) {};

--- a/Chapter6/Figures/figure-word-alignment&probability-distribution-in-ibm-model-3.tex
+++ b/Chapter6/Figures/figure-word-alignment&probability-distribution-in-ibm-model-3.tex
@@ -14,8 +14,8 @@
 \node [anchor=west] (eq2) at ([xshift=3.0em,yshift=0.0em]eq1.east) {早饭};
 \node [anchor=north] (eq3) at ([xshift=0.0em,yshift=-2.0em]eq1.south) {have};
 \node [anchor=north] (eq4) at ([xshift=0.0em,yshift=-2.0em]eq2.south) {breakfast};
-\node [anchor=east] (eq5) at ([xshift=-1.0em,yshift=-1.8em]eq1.west) {$\mathbf{a}_{1}$};
-\node [anchor=west] (eq6) at ([xshift=1.0em,yshift=-1.8em]eq2.east) {$\textrm{P}(\mathbf{s},\mathbf{a}_{1}|\mathbf{t})=0.5$};
+\node [anchor=east] (eq5) at ([xshift=-1.0em,yshift=-1.8em]eq1.west) {$\seq{a}_{1}$};
+\node [anchor=west] (eq6) at ([xshift=1.0em,yshift=-1.8em]eq2.east) {$\funp{P}(\seq{s},\seq{a}_{1}|\seq{t})=0.5$};
 \draw [-,very thick](eq1.south) -- (eq3.north);
 \draw [-,very thick](eq2.south) -- (eq4.north);
 \node [anchor=west] (eq7) at ([xshift=13.1em,yshift=1.4em]eq2.east) {};
@@ -33,8 +33,8 @@
 \node [anchor=west] (eq2) at ([xshift=3.0em,yshift=0.0em]eq1.east) {早饭};
 \node [anchor=north] (eq3) at ([xshift=0.0em,yshift=-2.0em]eq1.south) {have};
 \node [anchor=north] (eq4) at ([xshift=0.0em,yshift=-2.0em]eq2.south) {breakfast};
-\node [anchor=east] (eq5) at ([xshift=-1.0em,yshift=-1.8em]eq1.west) {$\mathbf{a}_{2}$};
-\node [anchor=west] (eq6) at ([xshift=1.0em,yshift=-1.8em]eq2.east) {$\textrm{P}(\mathbf{s},\mathbf{a}_{2}|\mathbf{t})=0.1$};
+\node [anchor=east] (eq5) at ([xshift=-1.0em,yshift=-1.8em]eq1.west) {$\seq{a}_{2}$};
+\node [anchor=west] (eq6) at ([xshift=1.0em,yshift=-1.8em]eq2.east) {$\funp{P}(\seq{s},\seq{a}_{2}|\seq{t})=0.1$};
 \draw [-,very thick](eq1.south) -- (eq4.north);
 \draw [-,very thick](eq2.south) -- (eq3.north);
 \end{scope}
@@ -44,8 +44,8 @@
 \node [anchor=west] (eq2) at ([xshift=3.0em,yshift=0.0em]eq1.east) {早饭};
 \node [anchor=north] (eq3) at ([xshift=0.0em,yshift=-2.0em]eq1.south) {have};
 \node [anchor=north] (eq4) at ([xshift=0.0em,yshift=-2.0em]eq2.south) {breakfast};
-\node [anchor=east] (eq5) at ([xshift=-1.0em,yshift=-1.8em]eq1.west) {$\mathbf{a}_{3}$};
-\node [anchor=west] (eq6) at ([xshift=1.0em,yshift=-1.8em]eq2.east) {$\textrm{P}(\mathbf{s},\mathbf{a}_{3}|\mathbf{t})=0.1$};
+\node [anchor=east] (eq5) at ([xshift=-1.0em,yshift=-1.8em]eq1.west) {$\seq{a}_{3}$};
+\node [anchor=west] (eq6) at ([xshift=1.0em,yshift=-1.8em]eq2.east) {$\funp{P}(\seq{s},\seq{a}_{3}|\seq{t})=0.1$};
 \draw [-,very thick](eq1.south) -- (eq3.north);
 \draw [-,very thick](eq2.south) -- (eq3.north);
 \end{scope}
@@ -55,8 +55,8 @@
 \node [anchor=west] (eq2) at ([xshift=3.0em,yshift=0.0em]eq1.east) {早饭};
 \node [anchor=north] (eq3) at ([xshift=0.0em,yshift=-2.0em]eq1.south) {have};
 \node [anchor=north] (eq4) at ([xshift=0.0em,yshift=-2.0em]eq2.south) {breakfast};
-\node [anchor=east] (eq5) at ([xshift=-1.0em,yshift=-1.8em]eq1.west) {$\mathbf{a}_{4}$};
-\node [anchor=west] (eq6) at ([xshift=1.0em,yshift=-1.8em]eq2.east) {$\textrm{P}(\mathbf{s},\mathbf{a}_{4}|\mathbf{t})=0.1$};
+\node [anchor=east] (eq5) at ([xshift=-1.0em,yshift=-1.8em]eq1.west) {$\seq{a}_{4}$};
+\node [anchor=west] (eq6) at ([xshift=1.0em,yshift=-1.8em]eq2.east) {$\funp{P}(\seq{s},\seq{a}_{4}|\seq{t})=0.1$};
 \draw [-,very thick](eq1.south) -- (eq4.north);
 \draw [-,very thick](eq2.south) -- (eq4.north);
 \end{scope}
@@ -66,8 +66,8 @@
 \node [anchor=west] (eq2) at ([xshift=3.0em,yshift=0.0em]eq1.east) {早饭};
 \node [anchor=north] (eq3) at ([xshift=0.0em,yshift=-2.0em]eq1.south) {have};
 \node [anchor=north] (eq4) at ([xshift=0.0em,yshift=-2.0em]eq2.south) {breakfast};
-\node [anchor=east] (eq5) at ([xshift=-1.0em,yshift=-1.8em]eq1.west) {$\mathbf{a}_{5}$};
-\node [anchor=west] (eq6) at ([xshift=1.0em,yshift=-1.8em]eq2.east) {$\textrm{P}(\mathbf{s},\mathbf{a}_{5}|\mathbf{t})=0.05$};
+\node [anchor=east] (eq5) at ([xshift=-1.0em,yshift=-1.8em]eq1.west) {$\seq{a}_{5}$};
+\node [anchor=west] (eq6) at ([xshift=1.0em,yshift=-1.8em]eq2.east) {$\funp{P}(\seq{s},\seq{a}_{5}|\seq{t})=0.05$};
 \draw [-,very thick](eq1.south) -- (eq3.north);
 \draw [-,very thick](eq1.south) -- (eq4.north);
 \draw [-,very thick](eq2.south) -- (eq3.north);
@@ -81,8 +81,8 @@
 \node [anchor=west] (eq2) at ([xshift=3.0em,yshift=0.0em]eq1.east) {早饭};
 \node [anchor=north] (eq3) at ([xshift=0.0em,yshift=-2.0em]eq1.south) {have};
 \node [anchor=north] (eq4) at ([xshift=0.0em,yshift=-2.0em]eq2.south) {breakfast};
-\node [anchor=east] (eq5) at ([xshift=-1.0em,yshift=-1.8em]eq1.west) {$\mathbf{a}_{6}$};
-\node [anchor=west] (eq6) at ([xshift=1.0em,yshift=-1.8em]eq2.east) {$\textrm{P}(\mathbf{s},\mathbf{a}_{6}|\mathbf{t})=0.05$};
+\node [anchor=east] (eq5) at ([xshift=-1.0em,yshift=-1.8em]eq1.west) {$\seq{a}_{6}$};
+\node [anchor=west] (eq6) at ([xshift=1.0em,yshift=-1.8em]eq2.east) {$\funp{P}(\seq{s},\seq{a}_{6}|\seq{t})=0.05$};
 \draw [-,very thick](eq1.south) -- (eq3.north);
 \draw [-,very thick](eq2.south) -- (eq4.north);
 \draw [-,very thick](eq2.south) -- (eq3.north);

--- a/Chapter6/Figures/figure-word-alignment.tex
+++ b/Chapter6/Figures/figure-word-alignment.tex
@@ -39,8 +39,8 @@
 \draw [->,thick] (s5.south) -- (t6.north);
 }

-\node [anchor=east] (ss) at ([xshift=-0.5em]s1.west) {$\mathbf{s}$};
-\node [anchor=east] (tt) at ([xshift=-0.5em]t1.west) {$\mathbf{t}$};
+\node [anchor=east] (ss) at ([xshift=-0.5em]s1.west) {$\seq{s}$};
+\node [anchor=east] (tt) at ([xshift=-0.5em]t1.west) {$\seq{t}$};

 }
 \end{tikzpicture}

--- a/Chapter7/Figures/figure-translation-option.tex
+++ b/Chapter7/Figures/figure-translation-option.tex
@@ -4,7 +4,7 @@
 \begin{tikzpicture}
 \begin{scope}[minimum height = 16pt]

-\node[anchor=east] (s0) at (-0.8em, 0) {$\textbf{s}$：};
+\node[anchor=east] (s0) at (-0.8em, 0) {$\seq{s}$：};
 \node[anchor=west] (s1) at (0, 0) {桌子};
 \node[anchor=west] (s2) at ([xshift=2em]s1.east) {上};
 \node[anchor=west] (s3) at ([xshift=2.3em]s2.east) {有};

--- a/Chapter8/Figures/figure-chinese-syntax-tree.tex
+++ b/Chapter8/Figures/figure-chinese-syntax-tree.tex
@@ -8,7 +8,7 @@
        [.NP
            [.NP
                [.DT the ]
-                [.NN import ]
+                [.NN imports ]
            ]
            [.IN in ]
            [.NP \edge[roof]; {North Korea} ]

--- a/Chapter8/Figures/figure-cky-algorithm.tex
+++ b/Chapter8/Figures/figure-cky-algorithm.tex
@@ -8,7 +8,7 @@
 \tikzstyle{srcnode} = [anchor=south west]
 \begin{scope}[scale=0.85]

-\node[srcnode] (c1) at (0,0) {\normalsize{\textbf{Function} CKY-Algorithm($\textbf{s},G$)}};
+\node[srcnode] (c1) at (0,0) {\normalsize{\textbf{Function} CKY-Algorithm($\seq{s},G$)}};
 \node[srcnode,anchor=north west] (c21) at ([xshift=1.5em,yshift=0.4em]c1.south west) {\normalsize{\textbf{for} $j=0$ to $ J - 1$}};
 \node[srcnode,anchor=north west] (c22) at ([xshift=1.5em,yshift=0.4em]c21.south west) {\normalsize{$span[j,j+1 ]$.Add($A \to a \in G$)}};
 \node[srcnode,anchor=north west] (c3) at ([xshift=-1.5em,yshift=0.4em]c22.south west) {\normalsize{\textbf{for} $l$ = 1 to $J$}};
@@ -21,7 +21,7 @@
 \node[srcnode,anchor=north west] (c7) at ([yshift=0.4em]c6.south west) {\normalsize{$span[j, j+l]$.Update($hypos$)}};
 \node[srcnode,anchor=north west] (c8) at ([xshift=-4.5em,yshift=0.4em]c7.south west) {\normalsize{\textbf{return} $span[0, J]$}};

-\node[anchor=west] (c9) at ([xshift=-3.2em,yshift=1.7em]c1.west) {\small{\textrm{参数：}\textbf{s}为输入字符串。$G$为输入CFG。$J$为待分析字符串长度。}};
+\node[anchor=west] (c9) at ([xshift=-3.2em,yshift=1.7em]c1.west) {\small{\textrm{参数：}\seq{s}为输入字符串。$G$为输入CFG。$J$为待分析字符串长度。}};
 \node[anchor=west] (c10) at ([xshift=0em,yshift=1.3em]c9.west) {\small{\textrm{输出：字符串全部可能的语法分析结果}}};
 \node[anchor=west] (c11) at ([xshift=0em,yshift=1.3em]c10.west) {\small{\textrm{输入：符合乔姆斯基范式的待分析字符串和一个上下文无关文法（CFG）}}};


--- a/Chapter8/Figures/figure-content-of-chart-in-tree-based-decoding.tex
+++ b/Chapter8/Figures/figure-content-of-chart-in-tree-based-decoding.tex
@@ -64,8 +64,8 @@

 \node[anchor=west](k2) at ([xshift=0.2em,yshift=-1.7em]k1.west){{[0,1]}};
 \node[anchor=west](k3) at ([xshift=0em,yshift=-1.5em]k2.west){{[1,2]}};
-\node[anchor=west](k4) at ([xshift=0em,yshift=-1.5em]k3.west){{[2,5]}};
-\node[anchor=west](k5) at ([xshift=0em,yshift=-1.5em]k4.west){{[3,6]}};
+\node[anchor=west](k4) at ([xshift=0em,yshift=-1.5em]k3.west){{[2,3]}};
+\node[anchor=west](k5) at ([xshift=0em,yshift=-1.5em]k4.west){{[3,4]}};
 \node[anchor=west](k6) at ([xshift=0em,yshift=-1.5em]k5.west){{[0,2]}};
 \node[anchor=west](k7) at ([xshift=0em,yshift=-1.5em]k6.west){{[1,3]}};
 \node[anchor=west](k8) at ([xshift=0em,yshift=-1.5em]k7.west){{[2,4]}};

--- a/Chapter8/Figures/figure-example-of-cky-algorithm-execution.tex
+++ b/Chapter8/Figures/figure-example-of-cky-algorithm-execution.tex
@@ -224,7 +224,7 @@
 {
 \node [anchor=center] (n6) at ([yshift=-4em]n5.center) {\scriptsize{6}};
 \node [anchor=center] (k6) at ([yshift=-4em]k5.center) {\scriptsize{[{\blue 0},{\blue 2}]}};
-\node [anchor=west] (t6) at ([xshift=0.2em,yshift=-4em]t5.west) {\scriptsize{none}};
+\node [anchor=west] (t6) at ([xshift=0.2em,yshift=-4.2em]t5.west) {\scriptsize{none}};
 \node [anchor=center,selectnode,fill=red!20] (alig22) at (cell22.center) {\tiny{}};
 }

@@ -337,7 +337,7 @@
 \node [anchor=center] (sep1) at ([yshift=-1.7em]n7.center) {\scriptsize{...}};
 \node [anchor=center] (n8) at ([yshift=-3.4em]n7.center) {\scriptsize{15}};
 \node [anchor=center] (k8) at ([yshift=-3.4em]k7.center) {\scriptsize{[{\blue 0},{\blue 5}]}};
-\node [anchor=west] (t8) at ([yshift=-3.4em]t7.west) {\tiny{S $\to$ AB}};
+\node [anchor=west] (t8) at ([yshift=-3.4em]t7.west) {\scriptsize{S $\to$ AB}};

 \node [anchor=center,selectnode,fill=red!20] (alig33) at (cell33.center) {\tiny{}};
 \node [anchor=center,selectnode,fill=red!20] (alig42) at (cell42.center) {\tiny{}};

--- a/Chapter8/Figures/figure-tree-binarization.tex
+++ b/Chapter8/Figures/figure-tree-binarization.tex
@@ -31,7 +31,7 @@
 \node [anchor=north west] (rule2) at (rule1t.south west) {NP(NNP$_1$ NN(总统) NN(乔治) NN(华盛顿))};
 \node [anchor=north west] (rule2t) at ([yshift=0.2em]rule2.south west) {$\to$ NNP$_1$ President Trump};
 \node [anchor=north west] (rulelabel2) at ([yshift=-0.3em]rule2t.south west) {{{\red{不能}}抽取到的规则：}};
-\node [anchor=north west] (rule3) at (rulelabel2.south west) {NP(NN(乔治) NN(华盛顿)) $\to$ Trump};
+\node [anchor=north west] (rule3) at (rulelabel2.south west) {NP(NN(乔治) NN(华盛顿)) $\to$ Washington};

 \end{scope}
 }

--- a/Chapter8/chapter8.tex
+++ b/Chapter8/chapter8.tex
@@ -25,7 +25,7 @@

 人类的语言是具有结构的，这种结构往往体现在句子的句法信息上。比如，人们进行翻译时会将待翻译句子的主干确定下来，之后得到译文的主干，最后形成完整的译文。一个人学习外语时，也会先学习外语句子的基本构成，比如，主语、谓语等，之后用这种句子结构知识生成外语句子。

-使用句法分析可以很好的处理翻译中的结构调序、远距离依赖等问题。因此，基于句法的机器翻译模型长期受到研究者关注。比如，早期基于规则的方法里就大量使用了句法信息来定义翻译规则。进入统计机器翻译时代，句法信息的使用同样是领域主要研究方向之一。这也产生了很多基于句法的机器翻译模型及方法，而且在很多任务上取得非常出色的结果。本章将对这些模型和方法进行介绍，内容涉及机器翻译中句法信息的表示、基于句法的翻译建模、句法翻译规则的学习等。
+使用句法分析可以很好地处理翻译中的结构调序、远距离依赖等问题。因此，基于句法的机器翻译模型长期受到研究者关注。比如，早期基于规则的方法里就大量使用了句法信息来定义翻译规则。进入统计机器翻译时代，句法信息的使用同样是领域主要研究方向之一。这也产生了很多基于句法的机器翻译模型及方法，而且在很多任务上取得非常出色的结果。本章将对这些模型和方法进行介绍，内容涉及机器翻译中句法信息的表示、基于句法的翻译建模、句法翻译规则的学习等。

 %----------------------------------------------------------------------------------------
 %    NEW SECTION
@@ -68,9 +68,9 @@
 \end{figure}
 %-------------------------------------------

-\parinterval 句法树结构可以赋予机器翻译对语言进一步抽象的能力，这样，可以不需要使用连续词串，而是通过句法结构来对大范围的译文生成和调序进行建模。图\ref{fig:8-3}是一个在翻译中融入源语言（汉语）句法信息的实例。这个例子中，介词短语包含15个单词，因此，使用短语很难涵盖“在 $...$ 后”这样的片段。这时，系统会把“在 $...$ 后”错误地翻译为“In $...$”。通过句法树，可以知道“在 $...$ 后”对应着一个完整的子树结构PP（介词短语）。因此也很容易知道介词短语中“在 $...$ 后”是一个模板（红色），而“在”和“后”之间的部分构成从句部分（蓝色）。最终得到正确的译文“After $...$”。
+\parinterval 句法树结构可以赋予机器翻译对语言进一步抽象的能力，这样，可以不需要使用连续词串，而是通过句法结构来对大范围的译文生成和调序进行建模。图\ref{fig:8-3}是一个在翻译中融入源语言（汉语）句法信息的实例。这个例子中，介词短语“在 $...$ 后”包含15个单词，因此，使用短语很难涵盖这样的片段。这时，系统会把“在 $...$ 后”错误地翻译为“In $...$”。通过句法树，可以知道“在 $...$ 后”对应着一个完整的子树结构PP（介词短语）。因此也很容易知道介词短语中“在 $...$ 后”是一个模板（红色），而“在”和“后”之间的部分构成从句部分（蓝色）。最终得到正确的译文“After $...$”。

-\parinterval 使用句法信息在机器翻译中不新鲜。在基于规则和模板的翻译模型中，就大量使用了句法等结构信息。只是由于早期句法分析技术不成熟，系统的整体效果并不突出。在数据驱动的方法中，句法可以很好地融合在统计建模中。通过概率化的文法设计，可以对翻译过程进行很好的描述。
+\parinterval 使用句法信息在机器翻译中并不新鲜。在基于规则和模板的翻译模型中，就大量使用了句法等结构信息。只是由于早期句法分析技术不成熟，系统的整体效果并不突出。在数据驱动的方法中，句法可以很好地融合在统计建模中。通过概率化的句法设计，可以对翻译过程进行很好的描述。

 %----------------------------------------------------------------------------------------
 %    NEW SECTION
@@ -180,7 +180,7 @@

 \subsection{同步上下文无关文法}

-\parinterval {\small\bfnew{基于层次短语的模型}}\index{基于层次短语的模型}（Hierarchical Phrase-based Model）\index{Hierarchical Phrase-based Model}是一个经典的统计机器翻译模型\upcite{chiang2005a,chiang2007hierarchical}。这个模型可以很好地解决短语系统对翻译中长距离调序建模不足的问题。基于层次短语的系统也在多项机器翻译比赛中取得了很好的成绩。这项工作也获得了自然处理领域顶级会议ACL2015的最佳论文奖。
+\parinterval {\small\bfnew{基于层次短语的模型}}\index{基于层次短语的模型}（Hierarchical Phrase-based Model）\index{Hierarchical Phrase-based Model}是一个经典的统计机器翻译模型\upcite{chiang2005a,chiang2007hierarchical}。这个模型可以很好地解决短语系统对翻译中长距离调序建模不足的问题。基于层次短语的系统也在多项机器翻译比赛中取得了很好的成绩。这项工作也获得了自然语言处理领域顶级会议ACL2015的最佳论文奖。

 \parinterval 层次短语模型的核心是把翻译问题归结为两种语言词串的同步生成问题。实际上，词串的生成问题是自然语言处理中的经典问题，早期的研究更多的是关注单语句子的生成，比如，如何使用句法树描述一个句子的生成过程。层次短语模型的创新之处是把传统单语词串的生成推广到双语词串的同步生成上。这使得机器翻译可以使用类似句法分析的方法进行求解。

@@ -270,7 +270,7 @@ d = {r_1} \circ {r_2} \circ {r_3} \circ {r_4}
 \label{eq:8-1}
 \end{eqnarray}

-\parinterval 在层次短语模型中，每个翻译推导都唯一的对应一个目标语译文。因此，可以用推导的概率$\textrm{P}(d)$描述翻译的好坏。同基于短语的模型是一样的（见{\chapterseven}数学建模小节），层次短语翻译的目标是：求概率最高的翻译推导$\hat{d}=\arg\max\textrm{P}(d)$。值得注意的是，基于推导的方法在句法分析中也十分常用。层次短语翻译实质上也是通过生成翻译规则的推导来对问题的表示空间进行建模。在\ref{section-8.3} 节还将看到，这种方法可以被扩展到语言学上基于句法的翻译模型中。而且这些模型都可以用一种被称作超图的结构来进行建模。从某种意义上讲，基于规则推导的方法将句法分析和机器翻译进行了形式上的统一。因此机器翻译也借用了很多句法分析的思想。
+\parinterval 在层次短语模型中，每个翻译推导都唯一地对应一个目标语译文。因此，可以用推导的概率$\textrm{P}(d)$描述翻译的好坏。同基于短语的模型是一样的（见{\chapterseven}数学建模小节），层次短语翻译的目标是：求概率最高的翻译推导$\hat{d}=\arg\max\textrm{P}(d)$。值得注意的是，基于推导的方法在句法分析中也十分常用。层次短语翻译实质上也是通过生成翻译规则的推导来对问题的表示空间进行建模。在\ref{section-8.3} 节还将看到，这种方法可以被扩展到语言学上基于句法的翻译模型中。而且这些模型都可以用一种被称作超图的结构来进行建模。从某种意义上讲，基于规则推导的方法将句法分析和机器翻译进行了形式上的统一。因此机器翻译也借用了很多句法分析的思想。

 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -284,7 +284,7 @@ d = {r_1} \circ {r_2} \circ {r_3} \circ {r_4}
 \textrm{S} & \to & \langle\ \textrm{X}_1,\ \textrm{X}_1\ \rangle \nonumber
 \end{eqnarray}

-\parinterval 胶水规则引入了一个新的非终结符S，S只能和X进行顺序拼接，或者S由X生成。如果把S看作文法的起始符，使用胶水规则后，相当于把句子划分为若干个部分，每个部分都被归纳为X。之后，顺序地把这些X拼接到一起，得到最终的译文。比如，最极端的情况，整个句子会生成一个X，之后再归纳为S，这时并不需要进行胶水规则的顺序拼接；另一种极端的情况，每个单词都是独立的被翻译，被归纳为X，之后先把最左边的X归纳为S，再依次把剩下的X依次拼到一起。这样的推导形式如下：
+\parinterval 胶水规则引入了一个新的非终结符S，S只能和X进行顺序拼接，或者S由X生成。如果把S看作文法的起始符，使用胶水规则后，相当于把句子划分为若干个部分，每个部分都被归纳为X。之后，顺序地把这些X拼接到一起，得到最终的译文。比如，最极端的情况，整个句子会生成一个X，之后再归纳为S，这时并不需要进行胶水规则的顺序拼接；另一种极端的情况，每个单词都是独立地被翻译，被归纳为X，之后先把最左边的X归纳为S，再依次把剩下的X依次拼到一起。这样的推导形式如下：
 \begin{eqnarray}
 \textrm{S} & \to & \langle\ \textrm{S}_1\ \textrm{X}_2,\ \textrm{S}_1\ \textrm{X}_2\ \rangle \nonumber \\
                & \to & \langle\ \textrm{S}_3\ \textrm{X}_4\ \textrm{X}_2,\ \textrm{S}_3\ \textrm{X}_4\ \textrm{X}_2\ \rangle \nonumber \\
@@ -382,7 +382,7 @@ y&=&\beta_0 y_{\pi_1} \beta_1 y_{\pi_2} ... \beta_{m-1} y_{\pi_m} \beta_m

 \subsection{翻译特征}

-\parinterval 在层次短语模型中，每个翻译推导都有一个模型得分$\textrm{score}(d,\seq{s},\seq{t})$。$\textrm{score}(d,\seq{s},\seq{t})$是若干特征的线性加权之和：$\textrm{score}(d,\seq{t},\seq{s})=\sum_{i=1}^M\lambda_i\cdot h_i (d,\seq{t},\seq{s})$，其中$\lambda_i$是特征权重，$h_i (d,\seq{t},\seq{s})$是特征函数。层次短语模型的特征包括与规则相关的特征和语言模型特征，如下：
+\parinterval 在层次短语模型中，每个翻译推导都有一个模型得分$\textrm{score}(d,\seq{t},\seq{s})$。$\textrm{score}(d,\seq{t},\seq{s})$是若干特征的线性加权之和：$\textrm{score}(d,\seq{t},\seq{s})=\sum_{i=1}^M\lambda_i\cdot h_i (d,\seq{t},\seq{s})$，其中$\lambda_i$是特征权重，$h_i (d,\seq{t},\seq{s})$是特征函数。层次短语模型的特征包括与规则相关的特征和语言模型特征，如下：

 \parinterval 对于每一条翻译规则LHS$\to \langle \alpha, \beta ,\sim \rangle$，有：

@@ -390,7 +390,7 @@ y&=&\beta_0 y_{\pi_1} \beta_1 y_{\pi_2} ... \beta_{m-1} y_{\pi_m} \beta_m
 \vspace{0.5em}
 \item 	(h1-2)短语翻译概率（取对数），即$\textrm{log}(\textrm{P}(\alpha \mid \beta))$和$\textrm{log}(\textrm{P}(\beta \mid \alpha))$，特征的计算与基于短语的模型完全一样；
 \vspace{0.5em}
-\item 	(h3-4)词汇化翻译概率（取对数），即$\textrm{log}(\textrm{P}_{\textrm{lex}}(\alpha \mid \beta))$和$\textrm{log}(\textrm{P}(\beta \mid \alpha))$，特征的计算与基于短语的模型完全一样；
+\item 	(h3-4)词汇化翻译概率（取对数），即$\textrm{log}(\textrm{P}_{\textrm{lex}}(\alpha \mid \beta))$和$\textrm{log}(\textrm{P}_{\textrm{lex}}(\beta \mid \alpha))$，特征的计算与基于短语的模型完全一样；
 \vspace{0.5em}
 \item (h5)翻译规则数量，让模型自动学习对规则数量的偏好，同时避免使用过少规则造成分数偏高的现象；
 \vspace{0.5em}
@@ -438,7 +438,7 @@ h_i (d,\seq{t},\seq{s})=\sum_{r \in d}h_i (r)

 \parinterval 层次短语模型解码的目标是找到模型得分最高的推导，即：
 \begin{eqnarray}
-\hat{d} = \argmax_{d}\ \textrm{score}(d,\seq{s},\seq{t})
+\hat{d} = \argmax_{d}\ \textrm{score}(d,\seq{t},\seq{s})
 \label{eq:8-7}
 \end{eqnarray}

@@ -451,7 +451,7 @@ h_i (d,\seq{t},\seq{s})=\sum_{r \in d}h_i (r)

 \parinterval 由于层次短语规则本质上就是CFG规则，因此公式\eqref{eq:8-7}代表了一个典型的句法分析过程。需要做的是，用模型源语言端的CFG对输入句子进行分析，同时用模型目标语言端的CFG生成译文。基于CFG的句法分析是自然语言处理中的经典问题。一种广泛使用的方法是：首先把CFG转化为$\varepsilon$-free的{\small\bfnew{乔姆斯基范式}}\index{乔姆斯基范式}（Chomsky Normal Form）\index{Chomsky Normal Form}\footnote[5]{能够证明任意的CFG都可以被转换为乔姆斯基范式，即文法只包含形如A$\to$BC或A$\to$a的规则。这里，假设文法中不包含空串产生式A$\to\varepsilon$，其中$\varepsilon$表示空字符串。}，之后采用CKY方法进行分析。

-\parinterval CKY是形式语言中一种常用的句法分析方法\upcite{cocke1969programming,younger1967recognition,kasami1966efficient}。它主要用于分析符合乔姆斯基范式的句子。由于乔姆斯基范式中每个规则最多包含两叉（或者说两个变量），因此CKY方法也可以被看作是基于二叉规则的一种分析方法。对于一个待分析的字符串，CKY方法从小的“范围”开始，不断扩大分析的“范围”，最终完成对整个字符串的分析。在CKY方法中，一个重要的概念是{\small\bfnew{跨度}}\index{跨度}（Span）\index{Span}，所谓跨度表示了一个符号串的范围。这里可以把跨度简单的理解为从一个起始位置到一个结束位置中间的部分。
+\parinterval CKY是形式语言中一种常用的句法分析方法\upcite{cocke1969programming,younger1967recognition,kasami1966efficient}。它主要用于分析符合乔姆斯基范式的句子。由于乔姆斯基范式中每个规则最多包含两叉（或者说两个变量），因此CKY方法也可以被看作是基于二叉规则的一种分析方法。对于一个待分析的字符串，CKY方法从小的“范围”开始，不断扩大分析的“范围”，最终完成对整个字符串的分析。在CKY方法中，一个重要的概念是{\small\bfnew{跨度}}\index{跨度}（Span）\index{Span}，所谓跨度表示了一个符号串的范围。这里可以把跨度简单地理解为从一个起始位置到一个结束位置中间的部分。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -573,7 +573,7 @@ span\textrm{[0,4]}&=&\textrm{“猫} \quad \textrm{喜欢} \quad \textrm{吃} \q

 \parinterval 假设有$n$个规则源语言端相同，规则中每个变量可以被替换为$m$个结果，对于只含有一个变量的规则，一共有$nm$种不同的组合。如果规则含有两个变量，这种组合的数量是$n{m}^2$。由于翻译中会进行大量的规则匹配，如果每个匹配的源语言端都考虑所有$n{m}^2$种译文的组合，解码速度会很慢。

-\parinterval 在层次短语系统中，会进一步对搜索空间剪枝。简言之，此时并不需要对所有$n{m}^2$种组合进行遍历，而是只考虑其中的一部分组合。这种方法也被称作{\small\bfnew{立方剪枝}}\index{立方剪枝}（Cube Pruning）\index{Cube Pruning}。所谓“ 立方”是指组合译文时的三个维度：规则的目标语端、第一个变量所对应的翻译候选、第二个变量所对应的翻译候选。立方剪枝假设所有的译文候选都经过排序，比如，按照短语翻译概率排序。这样，每个译文都对应一个坐标，比如，$(i,j,k)$就表示第$i$个规则目标语端、第二个变量的第$j$个翻译候选、第三个变量的第$k$个翻译候选的组合。于是，可以把每种组合看作是一个三维空间中的一个点。在立方剪枝中，开始的时候会看到$(0,0,0)$这个翻译假设，并把这个翻译假设放入一个优先队列中。之后每次从这个优先队里中弹出最好的结果，之后沿着三个维度分别将坐标加1，比如，如果优先队列弹出$(i,j,k)$，则会生成$(i+1,j,k)$、$(i,j+1,k)$和$(i,j,k+1)$这三个新的翻译假设。之后，计算出它们的模型得分，并压入优先队列。这个过程不断被执行，直到达到终止条件，比如，扩展次数达到一个上限。
+\parinterval 在层次短语系统中，会进一步对搜索空间剪枝。简言之，此时并不需要对所有$n{m}^2$种组合进行遍历，而是只考虑其中的一部分组合。这种方法也被称作{\small\bfnew{立方剪枝}}\index{立方剪枝}（Cube Pruning）\index{Cube Pruning}。所谓“ 立方”是指组合译文时的三个维度：规则的目标语端、第一个变量所对应的翻译候选、第二个变量所对应的翻译候选。立方剪枝假设所有的译文候选都经过排序，比如，按照短语翻译概率排序。这样，每个译文都对应一个坐标，比如，$(i,j,k)$就表示第$i$个规则目标语端、第一个变量的第$j$个翻译候选、第二个变量的第$k$个翻译候选的组合。于是，可以把每种组合看作是一个三维空间中的一个点。在立方剪枝中，开始的时候会看到$(0,0,0)$这个翻译假设，并把这个翻译假设放入一个优先队列中。之后每次从这个优先队里中弹出最好的结果，之后沿着三个维度分别将坐标加1，比如，如果优先队列弹出$(i,j,k)$，则会生成$(i+1,j,k)$、$(i,j+1,k)$和$(i,j,k+1)$这三个新的翻译假设。之后，计算出它们的模型得分，并压入优先队列。这个过程不断被执行，直到达到终止条件，比如，扩展次数达到一个上限。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -617,7 +617,7 @@ span\textrm{[0,4]}&=&\textrm{“猫} \quad \textrm{喜欢} \quad \textrm{吃} \q
 \vspace{0.5em}
 \end{itemize}

-\parinterval 实际上，基于层次短语的方法可以被看作是介于基于短语的方法和基于语言学句法的方法之间的一种折中。它的优点在于，具备短语模型简单、灵活的优点，同时，由于同步翻译文法可以对句子的层次结构进行表示，因此也能够处理一些较长距离的调序问题。但是，另一方面，层次短语模型并不是一种“精细”的句法模型，当翻译需要复杂的结构信息时，这种模型可能会无能为力。
+\parinterval 实际上，基于层次短语的方法可以被看作是介于基于短语的方法和基于语言学句法的方法之间的一种折中。它的优点在于，短语模型简单且灵活，同时，由于同步翻译文法可以对句子的层次结构进行表示，因此也能够处理一些较长距离的调序问题。但是，另一方面，层次短语模型并不是一种“精细”的句法模型，当翻译需要复杂的结构信息时，这种模型可能会无能为力。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -751,7 +751,7 @@ span\textrm{[0,4]}&=&\textrm{“猫} \quad \textrm{喜欢} \quad \textrm{吃} \q

 \subsection{基于树结构的文法}

-\parinterval 基于句法的翻译模型的一个核心问题是要对树结构进行建模，进而完成树之间或者树和串之间的转换。在计算机领域中，所谓树就是由一些节点组成的层次关系的集合。计算机领域的树和自然世界中的树没有任何关系，只是借用了相似的概念，因为这种层次结构很像一个倒过来的树。在使用树时，经常会把树的层次结构转化为序列结构，称为树结构的{\small\bfnew{序列化}}\index{序列化}或者{\small\bfnew{线性化}}\index{线性化}（Linearization）\index{Linearization}。
+\parinterval 基于句法的翻译模型的一个核心问题是要对树结构进行建模，进而完成树之间或者树和串之间的转换。在计算机领域中，所谓树就是由一些节点组成的层次关系的集合。计算机领域的树和自然世界中的树没有任何关系，只是借用了相似的概念，因为这种层次结构很像一棵倒过来的树。在使用树时，经常会把树的层次结构转化为序列结构，称为树结构的{\small\bfnew{序列化}}\index{序列化}或者{\small\bfnew{线性化}}\index{线性化}（Linearization）\index{Linearization}。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -863,7 +863,7 @@ span\textrm{[0,4]}&=&\textrm{“猫} \quad \textrm{喜欢} \quad \textrm{吃} \q
 \end{figure}
 %-------------------------------------------

-\parinterval 规则的推导对应了一种源语言和目标语言树结构的同步生成的过程。比如，使用下面的规则集：
+\parinterval 规则的推导对应了一种源语言和目标语言树结构的同步生成过程。比如，使用下面的规则集：
 {
 \begin{eqnarray}
 r_3: \quad \textrm{AD(大幅度)} \rightarrow \textrm{RB(drastically)}\qquad\qquad\qquad\qquad\qquad\qquad\qquad\ \nonumber \\
@@ -902,7 +902,7 @@ r_9: \quad \textrm{IP(}\textrm{NN}_1\ \textrm{VP}_2) \rightarrow \textrm{S(}\tex

 \subsubsection{3. 树到串翻译规则}

-\parinterval 基于树结构的文法可以很好的表示两个树片段之间的对应关系，即树到树翻译规则。那树到串翻译规则该如何表示呢？实际上，基于树结构的文法也同样适用于树到串模型。比如，图\ref{fig:8-22}是一个树片段到串的映射，它可以被看作是树到串规则的一种表示。
+\parinterval 基于树结构的文法可以很好地表示两个树片段之间的对应关系，即树到树翻译规则。那树到串翻译规则该如何表示呢？实际上，基于树结构的文法也同样适用于树到串模型。比如，图\ref{fig:8-22}是一个树片段到串的映射，它可以被看作是树到串规则的一种表示。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -1025,7 +1025,7 @@ r_9: \quad \textrm{IP(}\textrm{NN}_1\ \textrm{VP}_2) \rightarrow \textrm{S(}\tex
 \end{definition}
 %-------------------------------------------

-\parinterval 可信节点表示这个树节点$node$和树中的其他部分（不包括$node$的祖先和孩子）没有任何词对齐上的歧义。也就是说，这个节点可以完整的对应到目标语言句子的一个连续范围，不会出现在这个范围中的词对应到其他节点的情况。如果节点不是可信节点，则表示它会引起词对齐的歧义，因此不能作为树到串规则中源语言树片段的根节点或者变量部分。图\ref{fig:8-24}给出了一个可信节点的实例。
+\parinterval 可信节点表示这个树节点$node$和树中的其他部分（不包括$node$的祖先和孩子）没有任何词对齐上的歧义。也就是说，这个节点可以完整地对应到目标语言句子的一个连续范围，不会出现在这个范围中的词对应到其他节点的情况。如果节点不是可信节点，则表示它会引起词对齐的歧义，因此不能作为树到串规则中源语言树片段的根节点或者变量部分。图\ref{fig:8-24}给出了一个可信节点的实例。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -1177,7 +1177,7 @@ r_9: \quad \textrm{IP(}\textrm{NN}_1\ \textrm{VP}_2) \rightarrow \textrm{S(}\tex
 \textrm{VP(P(对)}\ \ \textrm{NP(NN(局势))}\ \ \textrm{VP}_1) \rightarrow \textrm{VP}_1\ \ \textrm{about}\ \ \textrm{the}\ \ \textrm{situation} \nonumber
 \end{eqnarray}

-\parinterval 而这条规则需要组合三条最小规则才能得到，但是在SPMT中可以直接得到。相比规则组合的方法，SPMT方法可以更有效的抽取包含短语的规则。
+\parinterval 而这条规则需要组合三条最小规则才能得到，但是在SPMT中可以直接得到。相比规则组合的方法，SPMT方法可以更有效地抽取包含短语的规则。

 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -1199,7 +1199,7 @@ r_9: \quad \textrm{IP(}\textrm{NN}_1\ \textrm{VP}_2) \rightarrow \textrm{S(}\tex
 \end{figure}
 %-------------------------------------------

-\parinterval 图\ref{fig:8-32}给出了一个实例，其中的名词短语（NP），包含四个词，都在同一层树结构中。由于“唐纳德$\ $特朗普”并不是一个独立的句法结构，因此无法抽取类似于下面这样的规则：
+\parinterval 图\ref{fig:8-32}给出了一个实例，其中的名词短语（NP），包含四个词，都在同一层树结构中。由于“乔治$\ $华盛顿”并不是一个独立的句法结构，因此无法抽取类似于下面这样的规则：
 \begin{eqnarray}
 \textrm{NP(NN(乔治))}\ \textrm{NN(华盛顿))} \rightarrow \textrm{Washington} \nonumber
 \end{eqnarray}
@@ -1287,7 +1287,7 @@ r_9: \quad \textrm{IP(}\textrm{NN}_1\ \textrm{VP}_2) \rightarrow \textrm{S(}\tex

 \subsubsection{2. 基于对齐矩阵的规则抽取}

-\parinterval 同词对齐一样，节点对齐也会存在错误，这样就不可避免的造成规则抽取的错误。既然单一的对齐中含有错误，那能否让系统看到更多样的对齐结果，进而提高正确规则被抽取到的几率呢？答案是肯定的。实际上，在基于短语的模型中就有基于多个词对齐（如$n$-best词对齐）进行规则抽取的方法\upcite{liu2009weighted}，这种方法可以在一定程度上提高了短语的召回率。在树到树规则抽取中也可以使用多个节点对齐结果进行规则抽取。但是，简单使用多个对齐结果会使系统运行代价线性增长，而且即使是$n$-best对齐，也无法保证涵盖到正确的对齐结果。对于这个问题，另一种思路是使用对齐矩阵进行规则的“软”抽取。
+\parinterval 同词对齐一样，节点对齐也会存在错误，这样就不可避免地造成规则抽取的错误。既然单一的对齐中含有错误，那能否让系统看到更多样的对齐结果，进而提高正确规则被抽取到的几率呢？答案是肯定的。实际上，在基于短语的模型中就有基于多个词对齐（如$n$-best词对齐）进行规则抽取的方法\upcite{liu2009weighted}，这种方法可以在一定程度上提高短语的召回率。在树到树规则抽取中也可以使用多个节点对齐结果进行规则抽取。但是，简单使用多个对齐结果会使系统运行代价线性增长，而且即使是$n$-best对齐，也无法保证涵盖到正确的对齐结果。对于这个问题，另一种思路是使用对齐矩阵进行规则的“软”抽取。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -1298,7 +1298,7 @@ r_9: \quad \textrm{IP(}\textrm{NN}_1\ \textrm{VP}_2) \rightarrow \textrm{S(}\tex
 \end{figure}
 %-------------------------------------------

-\parinterval 所谓对齐矩阵，是描述两个句法树节点之间对应强度的数据结构。矩阵的每个单元中都是一个0到1之间的数字。规则抽取时，可以认为所有节点之间都存在对齐，这样可以抽取出很多$n$-best对齐中无法覆盖的规则。图\ref{fig:8-36}展示了一个用对齐矩阵的进行规则抽取的实例。其中矩阵1（Matrix 1）表示的标准的1-best节点对齐，矩阵2（Matrix 2）表示的是一种概率化的对齐矩阵。可以看到使用矩阵2可以抽取到更多样的规则。另外，值得注意的是，基于对齐矩阵的方法也同样适用于短语和层次短语规则的抽取。关于对齐矩阵的生成可以参考相关论文的内容\upcite{xiao2013unsupervised,liu2009weighted,sun2010exploring,DBLP:conf/coling/SunZT10}。
+\parinterval 所谓对齐矩阵，是描述两个句法树节点之间对应强度的数据结构。矩阵的每个单元中都是一个0到1之间的数字。规则抽取时，可以认为所有节点之间都存在对齐，这样可以抽取出很多$n$-best对齐中无法覆盖的规则。图\ref{fig:8-36}展示了一个用对齐矩阵的进行规则抽取的实例。其中矩阵1（Matrix 1）表示的是标准的1-best节点对齐，矩阵2（Matrix 2）表示的是一种概率化的对齐矩阵。可以看到使用矩阵2可以抽取到更多样的规则。另外，值得注意的是，基于对齐矩阵的方法也同样适用于短语和层次短语规则的抽取。关于对齐矩阵的生成可以参考相关论文的内容\upcite{xiao2013unsupervised,liu2009weighted,sun2010exploring,DBLP:conf/coling/SunZT10}。

 \parinterval 此外，在基于句法的规则抽取中，一般会对规则进行一些限制，以避免规则数量过大，系统无法处理。比如，可以限制树片段的深度、变量个数、规则组合的次数等等。这些限制往往需要根据具体任务进行设计和调整。

@@ -1433,7 +1433,7 @@ d_1 = {d'} \circ {r_5}
 \end{figure}
 %-------------------------------------------

-\parinterval 如图\ref{fig:8-38}所示，覆盖相同跨度的节点会被放入同一个表格单元，但是不同句法标记的节点会被看作是不同的项（Item）。这种组织方式建立了一个索引，通过索引可以很容易的访问同一个跨度下的所有推导。比如，如果采用自下而上的分析，可以从小跨度的表格单元开始，构建推导，并填写表格单元。这个过程中，可以访问之前的表格单元来获得所需的局部推导（类似于前面提到的$d'$）。该过程重复执行，直到处理完最大跨度的表格单元。而最后一个表格单元就保存了完整推导的根节点。通过回溯的方式，能够把所有推导都生成出来。
+\parinterval 如图\ref{fig:8-38}所示，覆盖相同跨度的节点会被放入同一个表格单元，但是不同句法标记的节点会被看作是不同的项（Item）。这种组织方式建立了一个索引，通过索引可以很容易地访问同一个跨度下的所有推导。比如，如果采用自下而上的分析，可以从小跨度的表格单元开始，构建推导，并填写表格单元。这个过程中，可以访问之前的表格单元来获得所需的局部推导（类似于前面提到的$d'$）。该过程重复执行，直到处理完最大跨度的表格单元。而最后一个表格单元就保存了完整推导的根节点。通过回溯的方式，能够把所有推导都生成出来。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -1574,7 +1574,7 @@ d_1 = {d'} \circ {r_5}
 \textrm{VP}_1\ \ \textrm{NP}_2 &\rightarrow& \textrm{V103(}\ \ \textrm{VP}_1\ \ \textrm{NP}_2 ) \nonumber
 \end{eqnarray}

-\noindent 可以看到，这两条新的规则源语言端只有两个部分，代表两个分叉。V103是一个新的标签，它没有任何句法含义。不过，为了保证二叉化后规则目标语部分的连续性，需要考虑源语言和目标语二叉化的同步性\upcite{DBLP:conf/naacl/ZhangHGK06,Tong2009Better}。这样的规则与CKY方法一起使用完成解码，具体内容可以参考\ref{section-8.2.4}节的内容。
+\noindent 可以看到，这两条新的规则中源语言端只有两个部分，代表两个分叉。V103是一个新的标签，它没有任何句法含义。不过，为了保证二叉化后规则目标语部分的连续性，需要考虑源语言和目标语二叉化的同步性\upcite{DBLP:conf/naacl/ZhangHGK06,Tong2009Better}。这样的规则与CKY方法一起使用完成解码，具体内容可以参考\ref{section-8.2.4}节的内容。
 \vspace{0.5em}
 \end{itemize}

@@ -1586,15 +1586,15 @@ d_1 = {d'} \circ {r_5}
 \sectionnewpage
 \section{小结及深入阅读}

-\parinterval 自基于规则的方法开始，如何使用句法信息就是机器翻译研究人员关注的热点。在统计机器翻译时代，句法信息与机器翻译的结合成为了最具时态特色的研究方向之一。句法结构具有高度的抽象性，因此可以缓解基于词串方法不善于处理句子上层结构的问题。
+\parinterval 自基于规则的方法开始，如何使用句法信息就是机器翻译研究人员关注的热点。在统计机器翻译时代，句法信息与机器翻译的结合成为了最具时代特色的研究方向之一。句法结构具有高度的抽象性，因此可以缓解基于词串方法不善于处理句子上层结构的问题。

 \parinterval 本章对基于句法的机器翻译模型进行了介绍，并重点讨论了相关的建模、翻译规则抽取以及解码问题。从某种意义上说，基于句法的模型与基于短语的模型都同属一类模型，因为二者都假设：两种语言间存在由短语或者规则构成的翻译推导，而机器翻译的目标就是找到最优的翻译推导。但是，由于句法信息有其独特的性质，因此也给机器翻译带来了新的问题。有几方面问题值得关注：

 \begin{itemize}
 \vspace{0.5em}
-\item 从建模的角度看，早期的统计机器翻译模型已经涉及到了树结构的表示问题\upcite{DBLP:conf/acl/AlshawiBX97,DBLP:conf/acl/WangW98}。不过，基于句法的翻译模型的真正崛起还源自同步文法的提出。初期的工作大多集中在反向转录文法和括号转录文法方面\upcite{DBLP:conf/acl-vlc/Wu95,wu1997stochastic,DBLP:conf/acl/WuW98}，这类方法也被用于短语获取\upcite{ja2006obtaining,DBLP:conf/acl/ZhangQMG08}。进一步，研究者提出了更加通用的层次模型来描述翻译过程\upcite{chiang2005a,DBLP:conf/coling/ZollmannVOP08,DBLP:conf/acl/WatanabeTI06}，本章介绍的层次短语模型就是其中典型的代表。之后，使用语言学句法的模型也逐渐兴起。最具代表性的是在单语言端使用语言学句法信息的模型\upcite{DBLP:conf/naacl/GalleyHKM04,galley2006scalable,marcu2006spmt,DBLP:conf/naacl/HuangK06,DBLP:conf/emnlp/DeNeefeKWM07,DBLP:conf/wmt/LiuG08,liu2006tree}，即：树到串翻译模型和串到树翻译模型。值得注意的是，除了直接用句法信息定义翻译规则，也有研究者将句法信息作为软约束改进层次短语模型\upcite{DBLP:conf/wmt/ZollmannV06,DBLP:conf/acl/MartonR08}。这类方法具有很大的灵活性，既保留了层次短语模型比较健壮的特点，同时也兼顾了语言学句法对翻译的指导作用。在同一时期，也有研究者提出同时使用双语两端的语言学句法树对翻译进行建模，比较有代表性的工作是使用同步树插入文法（Synchronous Tree-Insertion Grammars）和同步树替换文法（Synchronous Tree-Substitution Grammars）进行树到树翻译的建模\upcite{Nesson06inductionof,Zhang07atree-to-tree,liu2009improving}。不过，树到树翻译假设两种语言间的句法结构能够相互转换，而这个假设并不总是成立。因此树到树翻译系统往往要配合一些技术，如树二叉化，来提升系统的健壮性。
+\item 从建模的角度看，早期的统计机器翻译模型已经涉及到了树结构的表示问题\upcite{DBLP:conf/acl/AlshawiBX97,DBLP:conf/acl/WangW98}。不过，基于句法的翻译模型的真正崛起是在同步文法提出之后。初期的工作大多集中在反向转录文法和括号转录文法方面\upcite{DBLP:conf/acl-vlc/Wu95,wu1997stochastic,DBLP:conf/acl/WuW98}，这类方法也被用于短语获取\upcite{ja2006obtaining,DBLP:conf/acl/ZhangQMG08}。进一步，研究者提出了更加通用的层次模型来描述翻译过程\upcite{chiang2005a,DBLP:conf/coling/ZollmannVOP08,DBLP:conf/acl/WatanabeTI06}，本章介绍的层次短语模型就是其中典型的代表。之后，使用语言学句法的模型也逐渐兴起。最具代表性的是在单语言端使用语言学句法信息的模型\upcite{DBLP:conf/naacl/GalleyHKM04,galley2006scalable,marcu2006spmt,DBLP:conf/naacl/HuangK06,DBLP:conf/emnlp/DeNeefeKWM07,DBLP:conf/wmt/LiuG08,liu2006tree}，即：树到串翻译模型和串到树翻译模型。值得注意的是，除了直接用句法信息定义翻译规则，也有研究者将句法信息作为软约束改进层次短语模型\upcite{DBLP:conf/wmt/ZollmannV06,DBLP:conf/acl/MartonR08}。这类方法具有很大的灵活性，既保留了层次短语模型比较健壮的特点，同时也兼顾了语言学句法对翻译的指导作用。在同一时期，也有研究者提出同时使用双语两端的语言学句法树对翻译进行建模，比较有代表性的工作是使用同步树插入文法（Synchronous Tree-Insertion Grammars）和同步树替换文法（Synchronous Tree-Substitution Grammars）进行树到树翻译的建模\upcite{Nesson06inductionof,Zhang07atree-to-tree,liu2009improving}。不过，树到树翻译假设两种语言间的句法结构能够相互转换，而这个假设并不总是成立。因此树到树翻译系统往往要配合一些技术，如树二叉化，来提升系统的健壮性。
 \vspace{0.5em}
-\item 在基于句法的模型中，常常会使用句法分析器完成句法分析树的生成。由于句法分析器会产生错误，因此这些错误会对机器翻译系统产生影响。对于这个问题，一种解决办法是同时考虑更多的句法树，这样增加正确句法分析结果被使用到的概率。其中，比较典型的方式基于句法森林的方法\upcite{DBLP:conf/acl/MiHL08,DBLP:conf/emnlp/MiH08}，比如，在规则抽取或者解码阶段使用句法森林，而不是仅仅使用一棵单独的句法树。另一种思路是，对句法结构进行松弛操作，即在翻译的过程中并不严格遵循句法结构\upcite{zhu2011improving,DBLP:conf/emnlp/ZhangZZ11}。实际上，前面提到的基于句法软约束的模型也是这类方法的一种体现\upcite{DBLP:conf/wmt/ZollmannV06,DBLP:conf/acl/MartonR08}。实际上，机器翻译领域的长期存在一个问题：使用什么样的句法结构是最适合机器翻译？因此，有研究者尝试对比不同的句法分析结果对机器翻译系统的影响\upcite{DBLP:conf/wmt/PopelMGZ11,DBLP:conf/coling/XiaoZZZ10}。也有研究者面向机器翻译任务自动归纳句法结构\upcite{DBLP:journals/tacl/ZhaiZZZ13}，而不是直接使用从单语小规模树库学习到的句法分析器，这样可以提高系统的健壮性。
+\item 在基于句法的模型中，常常会使用句法分析器完成句法分析树的生成。由于句法分析器会产生错误，因此这些错误会对机器翻译系统产生影响。对于这个问题，一种解决办法是同时考虑更多的句法树，从而增加正确句法分析结果被使用到的概率。其中，比较典型的方式基于句法森林的方法\upcite{DBLP:conf/acl/MiHL08,DBLP:conf/emnlp/MiH08}，比如，在规则抽取或者解码阶段使用句法森林，而不是仅仅使用一棵单独的句法树。另一种思路是，对句法结构进行松弛操作，即在翻译的过程中并不严格遵循句法结构\upcite{zhu2011improving,DBLP:conf/emnlp/ZhangZZ11}。实际上，前面提到的基于句法软约束的模型也是这类方法的一种体现\upcite{DBLP:conf/wmt/ZollmannV06,DBLP:conf/acl/MartonR08}。事实上，机器翻译领域长期存在一个问题：使用什么样的句法结构最适合机器翻译？因此，有研究者尝试对比不同的句法分析结果对机器翻译系统的影响\upcite{DBLP:conf/wmt/PopelMGZ11,DBLP:conf/coling/XiaoZZZ10}。也有研究者面向机器翻译任务自动归纳句法结构\upcite{DBLP:journals/tacl/ZhaiZZZ13}，而不是直接使用从单语小规模树库学习到的句法分析器，这样可以提高系统的健壮性。
 \vspace{0.5em}
 \item 本章所讨论的模型大多基于短语结构树。另一个重要的方向是使用依存树进行翻译建模\upcite{DBLP:journals/mt/QuirkM06,DBLP:conf/wmt/XiongLL07,DBLP:conf/coling/Lin04}。依存树比短语结构树有更简单的结构，而且依存关系本身也是对“语义”的表征，因此也可以扑捉到短语结构树所无法涵盖的信息。同其它基于句法的模型类似，基于依存树的模型大多也需要进行规则抽取、解码等步骤，因此这方面的研究工作大多涉及翻译规则的抽取、基于依存树的解码等\upcite{DBLP:conf/acl/DingP05,DBLP:conf/coling/ChenXMJL14,DBLP:conf/coling/SuLMZLL10,DBLP:conf/coling/XieXL14,DBLP:conf/emnlp/LiWL15}。此外，基于依存树的模型也可以与句法森林结构相结合，对系统性能进行进一步提升\upcite{DBLP:conf/acl/MiL10,DBLP:conf/coling/TuLHLL10}。
 \vspace{0.5em}

--- a/bibliography.bib
+++ b/bibliography.bib
@@ -2312,13 +2312,12 @@ year = {2012}
 }
 @article{shannon1949communication,
  title ={Communication theory of secrecy systems},
-  author ={Shannon, Claude E},
+  author ={Claude E. Shannon},
  journal ={Bell system technical journal},
  volume ={28},
  number ={4},
  pages ={656--715},
-  year ={1949},
-  publisher ={Wiley Online Library}
+  year ={1949}
 }
 @inproceedings{DBLP:conf/acl/Moore04,
  author    = {Robert C. Moore},
@@ -2352,8 +2351,8 @@ year = {2012}
 }
 @article{1998Grammar,
  title={Grammar Inference and Statistical Machine Translation},
-  author={Ye-Yi Wang and Jaime Carbonell},
-  year={1998},
+  author={Ye-Yi Wang and Wayne Ward},
+  year={1999},
  publisher={Carnegie Mellon University}
 }

@@ -3227,12 +3226,10 @@ year = {2012}

 @inproceedings{2014Dynamic,
  title={Dynamic Phrase Tables for Machine Translation in an Interactive Post-editing Scenario},
-  author={Germann, Ulrich},
+  author={Ulrich Germann },
  publisher = {Association for Machine Translation in the Americas},
  year={2014},
 }
-
-
 %%%%% chapter 7------------------------------------------------------
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

@@ -3249,7 +3246,7 @@ year = {2012}
 }
 @article{chiang2007hierarchical,
    title={Hierarchical Phrase-Based Translation},
-    author ={Chiang David},
+    author ={David Chiang},
    journal ={Computational Linguistics},
    volume ={33},
    number ={2},
@@ -3258,8 +3255,7 @@ year = {2012}
 }
 @book{cocke1969programming,
  title ={Programming Languages and Their Compilers: Preliminary Notes},
-  author ={Cocke, J. and Schwartz, J.T.},
-  lccn ={76374279},
+  author ={Cocke, John and Schwartz, J.T.},
  year ={1970},
  publisher ={Courant Institute of Mathematical Sciences, New York University}
 }
@@ -3273,7 +3269,7 @@ year = {2012}
  year      = {1967}
 }
 @article{kasami1966efficient,
-  author ={Kasami, Tadao},
+  author ={Tadao Kasami},
  title ={An efficient recognition and syntax-analysis algorithm for context-free languages},
  journal ={Coordinated Science Laboratory Report no. R-257},
  year ={1966}
@@ -3298,14 +3294,14 @@ year = {2012}
 }
 @inproceedings{huang2006statistical,
  title ={Statistical syntax-directed translation with extended domain of locality},
-  author ={Huang, Liang and Knight, Kevin and Joshi, Aravind},
+  author ={Liang Huang and Kevin Knight and Aravind Joshi},
  pages ={66--73},
  year ={2006},
  publisher ={Computationally Hard Problems \& Joint Inference in Speech \& Language Processing}
 }
 @inproceedings{galley2004s,
  title ={What’s in a translation rule?},
-  author ={Galley, Michel and Hopkins, Mark and Knight, Kevin and Marcu, Daniel},
+  author ={Michel Galleyand Mark Hopkins and Kevin Knight and Daniel Marcu},
  publisher={Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics},
  pages ={273--280},
  year ={2004}
@@ -3655,7 +3651,10 @@ year = {2012}
 }
 @article{Zhai2012Treebased,
  title={Treebased translation without using parse trees},
-  author={Zhai, Feifei and Zhang, Jiajun and Zhou, Yu and Zong, Chengqing},
+  author    = {Feifei Zhai and
+               Jiajun Zhang and
+               Yu Zhou and
+               Chengqing Zong},
  publisher = {International Conference on Computational Linguistics},
  year={2012},
 }
@@ -3771,7 +3770,7 @@ year = {2012}
 }
 @inproceedings{bangalore2001computing,
  title ={Computing consensus translation from multiple machine translation systems},
-  author ={Bangalore, B and Bordel, German and Riccardi, Giuseppe},
+  author ={Srinivas Bangalore, German Bordel and Giuseppe Riccardi},
  pages ={351--354},
  year ={2001},
  organization ={The Institute of Electrical and Electronics Engineers}
@@ -3790,7 +3789,7 @@ year = {2012}
 }
 @article{xiao2013bagging,
  title ={Bagging and boosting statistical machine translation systems},
-  author ={Xiao, Tong and Zhu, Jingbo and Liu, Tongran},
+  author ={Tong Xiao and Jingbo Zhu and Tongran Liu },
  publisher ={Artificial Intelligence},
  volume ={195},
  pages ={496--527},